Flickr recently announced that they have deployed Sentinel to provide automated Redis failover in their offline task processing subsystem despite worries about its consistency.
Last year, Factual engineer and distributed systems expert, Kyle Kingsbury investigated the consistency properties of Redis as part of his Jespen series. In it he showed that he was able construct a scenario using Redis and Sentinels where Redis threw away 56% of the writes it told us succeeded.
Kingsbury noted that this concerning outcome was the result of two issues with the Sentinel system.
First, notice that all the clients lost writes at the beginning of the partition … because they were all writing to n1 when the network dropped–and since n1 was demoted later, any writes made during that window were destroyed. The second problem was caused by split-brain: both n1 and n5 were primaries up until the partition healed. Depending on which node they were talking to, some clients might have their writes survive, and others have their writes lost.
Salvatore Sanfilippo, the creator of Redis, responded to the article by confirming the issue but pointing out that minimal data loss was not a design goal of Sentinel.
Just to be clear, the criticism is a good one, and it shows how Sentinel is not good to handle complex net splits with minimal data loss. Just this was never the goal, and what users were doing with their home-made scripts to handle failover was in the 99% of cases much worse than what Sentinel achieve as failure detection and handling of the failover process.
Flickr, while aware of these issues, began their move to Sentinel by first deciding on a set of aggressive SLA targets for their offline task processing subsystem. After noting that their existing manual failover process could not deliver on their 99.995% uptime target they looked for other solutions and settled on Sentinel.
After significant testing of both the Sentinel system and its configuration they were able to design an implementation that provided a 4-6 second automated failover, allowing them to meet their uptime targets. During their testing they were also able to replicate Kingsbury’s findings. However, Flickr engineers Richard Thorn and Shawn Cook explain that [a]lthough we believe our production environment is not immune from split-brain, we are pretty sure that the benefits outweigh the risks.