The rise of distributed computing in the form of microservices and cloud native architecture has created a challenge for many organisations. Failures in distributed computer systems are common, causing problems for users and potentially having a direct impact on a company's bottom line.
Professional software developers care about things like availability, the number of incidents that occur, and the operational burden of keeping a system up and running. The techniques of chaos engineering, which is broadly the business of deliberately testing how a system behaves under specific stresses, provide a mechanism by which engineers can proactively find and fix failures before they impact customers.
This is typically done by injecting failure into places where failure is known to occur - places such as remote procedure calls, caching layers, and persistence tiers - guided by individual engineers.
Creating a successful chaos practice isn’t purely an engineering problem. As with many aspects of cloud native computing it requires buy-in across the organisation. Whilst many large organisations have seen considerable success with chaos, many others are yet to apply it, and may be unsure how to get started. So with this eMag we’ve pulled together a variety of case studies to show mechanisms by which you can do so, even in tightly regulated industries where you might face considerable opposition.
Free download
The InfoQ eMag - Real World Chaos Engineering include:
- The Abyss of Ignorable: a Route into Chaos Testing from Starling Bank - Greg Hawkins describes how Starling Bank introduced a chaos engineering practice, starting in 2016 with their own simple chaos daemon.
- Applying Chaos Engineering in Healthcare: Getting Started with Sensitive Workloads - Carl Chesser shares what the teams at Cerner Corporation, a healthcare information technology company, found to be effective in introducing chaos engineering with their systems.
- A Chaos Test too far… - Kathryn Downes, Engineer, Internal Products, and Arjun Gadhia, Principal Engineer, Internal Products for the Financial Times, describe how they dealt with unintentionally running a chaos experiment in production for Spark, an in-house content management system for creating and publishing digital content.
- SeaMonkeys - Chaos in the War Room - Glen Ford describes his experience applying a very early form of chaos testing to naval combat systems in the Australian military in the late 1990s and draws the parallells to modern SRE.
- Chaos Conf Q&A: Adrian Cockcroft & Yury Niño Roa - Daniel Bryant sat down with Adrian Cockcroft and Yury Niño Roa to explore topics of interest in the chaos engineering community. Key takeaways included: the adoption of chaos engineering is still unevenly spread across organisations; there are clear benefits to running “game days” to develop psychological safety; and the future of chaos engineering points toward incorporating experiments focused on security and scaling up experiments to test larger failure modes.
InfoQ eMags are professionally designed, downloadable collections of popular InfoQ content - articles, interviews, presentations, and research - covering the latest software development technologies, trends, and topics.