The Twilio team describes their foray into Chaos Engineering where they use Gremlin to inject failures into their homegrown queuing system shards to test for automated recovery.
Twilio provides SMS and phone gateway services via APIs that application developers can invoke from their code. A core part of Twilio's architecture is their distributed queuing and rate limiting system. It provides persistent queueing to handle system failures and latency in processing messages, avoiding message loss. The system is called Ratequeue, built by the Twilio team. Ratequeue rate limits the dequeue rates for large numbers - every phone number has its own queue - of ephemeral queues. Rate limiting is required because developers can invoke the Twilio APIs as fast as possible, but the rate at which Twilio pushes these into the phone networks has to be controlled. Ratequeue is built on top of Redis, and is horizontally sharded for isolation and load balancing. Failure of one shard does not impact the other shards. Additionally, each shard is composed of a master and its replica for HA.
When a primary shard fails in Ratequeue, the previous way to recover was for a human to intervene and manually promote the replica to master. This involved locating the host which has the same shard number as the primary and adding it to the LB. The team built two systems to remove this manual process - an automated failover system and a fault injection system to test the former. The fault injection system is presented as an example of chaos engineering, where a system is subjected to random faults at various levels to test its ability to recover from failure.
Zero data loss was a primary goal of the exercise - the others being automatic detection of failure and promotion of the new master. The team built a custom solution using Amazon Kinesis, Nagios and Lazarus - their cluster automation service. Each Ratequeue replica pushes heartbeats about its master’s health to Nagios, which in turn pushes notifications to Kinesis if a threshold is breached. Lazarus listens on Kinesis for these events, performs its own checks for the health of the cluster and starts the process of failover if required.
To test this automated failure recovery, the team created a tool called Ratequeue Chaos that would pick a shard, kill the primary and monitor its recovery. A service called Gremlin is used to generate and inject faults into the system, triggering the failover. Gremlin allows for controlled fault injections across the stack and is invoked by Ratequeue Chaos via its API. This process runs every four hours on Twilio’s staging environment.
The team also shared learnings during this process - have a hypothesis based testing model, a framework to run the test, and a rollback plan if this is done on production.