At QCon London, Tammy Butow explained why the world needs more resilient systems, and how this can be achieved with the practice of chaos engineering. Three primary prerequisites for chaos engineering were provided -- high severity "SEV" incident management, monitoring, and measuring the impact -- and a series of guidelines, tools and practices for creating a chaos testing practice were presented.
Butow, a principal SRE at Gremlin, began the talk by providing a definition of a resilient system:
A resilient system is a highly available and durable system. A resilient system can maintain an acceptable level of service in the face of failure.
A resilient system can weather the storm (a misconfiguration, a large scale natural disaster or controlled chaos engineering).
A series of industry-specific IT failure case studies were presented, and the need for resilience demonstrated. Within the domain of the Fin Tech industry people are changing jobs, moving home and traveling more. In 2014 the "Real Time Gross Settlement" payment system failed in the UK, which resulted in scheduled house buying transactions failing. Issues have also been seen within the travel industry, with British Airways experiencing an IT outage in 2017 that is estimated to have cost the airline 80 million pounds. Butow discussed several other case studies that demonstrated issues caused by a lack of resilience within IT systems, and argued that systems need to not only handle failure gracefully, but also provide value anytime and anywhere.
In the case studies presented the primary concern of the user is resilience of the system, in particular high availability. One way to encourage the building of more resilient systems is through the practice of chaos engineering, which is the practise of thoughtful, planned experiments that are designed to reveal the weaknesses in systems. The core premise of this practice is to "inject something harmful, in order to build an immunity", much like hormesis.
Butow introduced chaos engineering as an analogue of a vaccination program, and suggested that a chaos engineer could be thought of as a vaccine research computer scientist. Site Reliability Engineers (SREs) and production engineers commonly practice chaos engineering, even if they haven't explicitly defined processes like testing database failover or application restart and crash recovery processes using this label.
There are three primary prerequisites for implementing an effective chaos engineering practice:
- High severity incident management
- Monitoring
- Measure the impact of downtime
High severity incidents, or "SEVs", must be carefully managed. Butow described in detail how to establish a high severity incident management program -- this is the practice of recording, triaging, tracking, and assigning business value to problems that impact critical systems. There are three levels of SEVs, with varying impact, resolution goals, and communication requirements.
A discussion on how to determine SEV levels, based on service level agreements and objectives -- referred to as SLAs and SLOs, respectively -- was followed by an overview of the SEV lifecycle. Stages within this lifecycle include detection, diagnosis, mitigation, prevention, closure, and detection. Teams must identify their critical systems (typically traffic management, databases and storage) practice managing failures and the corresponding SEV lifecycle, and the most effective way of doing this is typically through running incident simulation "game days" -- something Adrian Cockcroft refers to as "the fire drill for IT".
The second prerequisite for chaos engineering, monitoring, includes implementing effective aggregated metrics and log collection, and also creating corresponding critical services dashboards. The "four golden signals" of monitoring, taken from the Google SRE book, are latency, traffic, errors and saturation (which closely follows Brendan Greggs USE method and Tom Wilkies RED approach). Butow warned that if engineers attempt to implement chaos testing without monitoring, then they simply will not understand what is happening within their systems, nor the true impact of failure.
Measuring the impact of downtime was presented as the third prerequisite. Fundamentally, engineers must understand how SEVs impact their customers and business. Impact includes system properties like availability and durability, and business impact such as (damaging) outcomes, cost and time.
A chaos engineering use case from Twilio was presented next, which consisted of the Twilio engineering team creating and testing hypothesis in relation to failure within their internal distributed queuing system "Ratequeue". With the increasing complexity and unpredictability of distributed systems that the industry is adopting, the Twilio team highly recommend practicing chaos engineering wherever possible in order to test for coupling, and to increase a system's resilience.
After implementing Ratequeue High Availability and Ratequeue Chaos, we saw our first automated failover in production a few months ago complete in a little over a minute.
Butow cautioned that before beginning chaos engineering, the objectives and practices be shared widely within an organisation:
- Conduct a chaos engineering kick-off at an all-hands meeting
- Send email updates and progress reports
- Run monthly metrics reviews
- Deliver presentations
A brief demonstration of the Gremlin Chaos Engineering SaaS platform was presented, and open source tooling discussed, such as the Netflix Simian Army, the chaos-toolkit, and PowerfulSeal. It is worth noting that several thought leaders within this space -- such as John Alspaw, co-founder at Adaptive Capacity Labs -- are cautioning that the human side of resilience engineering should not be forgotten, and are arguing that this is in fact more important than the associated tooling.
The talk concluded with a call to action to implement chaos engineering, and to make and measure improvements, "tell a story of before and after with metrics":
- Build - Build a new system / improve existing
- Borrow - Use open source / contribute to OS
- Buy - Use 3rd party systems
- Brush up - GameDays / Team training
- Break - Chaos Engineering / Failure injection
- Begone - Decommission systems / delete code
The slides of Tammy Butow's QCon London talk "Why the World Needs More Resilient Systems" (4MB PDF) can be found on the QCon website, and the video recording will be released on InfoQ later in the year. Interested readers can learn more about chaos engineering by reading the Gremlin Community website, and by joining the Gremlin Slack.