Earlier in the year, the fourth edition of “Chaos Community Day” was held at Work-Bench in New York City. Key takeaways from the day included: the topic of chaos engineering draws heavily from other domains, and software engineers can learn much from this; understanding and exchanging mental models is vital for establishing resilience within a systems; observability is a prerequisite for conducting chaos experiments; and "gamedays" are a valuable approach to allow engineers to practice handling failures under non-emergency scenarios.
After a welcome from the Chaos Community Day organiser, Casey Rosenthal, CEO at Verica and ex-engineering manager for the chaos team at Netflix, the first speaker of the day was Nora Jones, and she talked about “Chaos Engineering Traps”. Jones, head of chaos engineering and human factors at Slack, began the talk by stating that chaos engineering draws from concepts and ideas from several different sources, including resilience engineering, and industries such as aviation, surgery, and law enforcement. She argued that software engineers do not talk about this enough, and therefore miss cross-discipline learning opportunities.
Echoing thoughts shared by John Allspaw at QCon London, she stated that exchanging mental models about the systems we work with is vitally important. Chaos engineering is also a company-wide effort, and safety must be thought about -- and communicated -- up front.
All phases of chaos engineering are important, and deserve equal attention. For example, the beginning phase allows the exchange of mental models, and we can start to identify any gaps between each other's model
Jones presented a series of “traps” that introduced fallacies such as measuring the success of chaos engineering by counting the number of vulnerabilities found, the belief that engineers must fix all issues found, and “it’s not real chaos engineering unless you move beyond gamedays and experimenting in sandbox environments”. She believed that trap number 3, which was presented at the end of the talk, was the most important: there is no prescriptive formula for doing chaos engineering. Interested readers can learn more from the InfoQ Chaos Engineering emag, which was curated by Jones.
For the next talk, Nathan Aschbacher, CEO at Auxon, cautioned that as engineers are now adding layers of software for the control of mechanical systems that were previously understood quite well -- such as aeroplanes -- this can result in creating complex systems, where the failure modes are much harder to understand. He argued that “chaos engineering is about surfacing the unknown unknowns”, but also cautioned that engineers should start from first principles when seeking to understand a system. Aschbacher stated that there clearly is “benefit from starting with understanding your known knowns…”
Next to the stage was Charity Majors, co-founder and CTO at honeycomb, and she began her talk by suggesting that chaos engineering is “really just a fancy new marketing term for testing in prod”. Majors argued that our idea of the software development life cycle is “due for an upgrade” in the era of distributed systems, and discussed that although chaos engineering is very beneficial, engineers should also invest in pre-production testing. Reiterating a core idea that was discussed throughout the day, she noted that sharing mental models within a team is essential for success when creating software:
We should be thinking about our distributed systems like the national power grid. It has to be modular. There is no way you can keep an accurate model of the entire system in your head
Majors stated that there are some prerequisites for chaos engineering, such as understanding your system at a fundamental level, and also having the ability to observe it: “Without observability, you don't have chaos engineering. You just have chaos". Tammy Butow has previously discussed similar prerequisites for chaos engineering at QCon London.
In the afternoon talks, Deepak Srinivasan, senior manager of software engineering at Capital One, presented his team’s journey with exploring chaos engineering and creating more resilient systems. He explore the three key stages of his journey: begin by obtaining an understanding of where an organisation and associated systems are now (the “operating point”); architect to reduce mean time to resolve (see image below); and improve human operations, by improving communication of failures, increasing transparency and accountability, and by challenging assumptions.
Next, Padma Gopalan from Google, provided insight into how teams within Google run disaster in recovery training (DiRT). She began the talk by describing how the concepts of DiRT and chaos engineering share many similarities, and that they emerged at similar times from shared ideas. Gopalan stated that the reason engineers run DiRT sessions at Google is that this allows them to test mitigation and failure response processes under a controlled scenario, before they are encountered in emergency conditions. She shared several DiRT scenarios, an example of which is shown below:
Gopalan also described “Catzilla”, an automated DiRT tool, that is used internally within Google (see image below).
She continued by stating that before any testing in production is undertaken, engineers must set clear goals for their tests, and ensure that confidence for failure injection scenarios is built incrementally: “start with 1% failure, then 5%... or test in dev environments, then staging, then prod”. Engineers must focus on the customer experience in support of service level objectives (SLOs), and ensure that customer-facing issues can be detected and mitigated, and that rollbacks are working quick.
Echoing many other speakers throughout the day, Gopalan stated that chaos engineering is very much an organisation-wide effort, and experimentation plans and times, along with mitigation strategies, must be well communicated.