In preparation for the upcoming Chaos Conf 2020, InfoQ sat down with Adrian Cockcroft and Yury Niño Roa to explore topics of interest in the chaos engineering community. Key takeaways included: the adoption of chaos engineering is still unevenly spread across organisations; there are clear benefits to running “game days” to develop psychological safety; and the future of chaos engineering points toward incorporating experiments focused on security and scaling up experiments to test larger failure modes.
Cockcroft, VP cloud architecture strategy at AWS, and Niño Roa, senior site reliability engineer at ADL Digital Lab, are both speaking at Chaos Conf 2020. This is a free virtual event running 6-8 October, organised by the Gremlin team, that will focus on the chaos engineering community. InfoQ was keen to explore several of the key themes of the event, such as reliability and “completing the DevOps loop”.
Although the use of game days is not a new practice, both interviewees highlighted the benefits of practicing incident management and remediation in a safe to fail environment. As Google has discussed previously, the concept of “psychological safety” is vitally important to creating high-performing teams. When dealing with production incidents and acting under duress, this concept is even more important.
Implementing observability in both the platform and application stacks is essential for capturing feedback for the operations and development teams. DevOps practices aim to assist in the removal of silos separating developers and operations engineers. However, the Chaos Conf team suggests that “too often it creates a one-way flow from developers to ops with no feedback from ops to help developers build more operable and reliable applications.”
Increasing automation and bringing developers and operations engineers together with chaos engineering can help to increase the fidelity and usefulness of metrics that were collected for the purpose feedback.
Below is an edited version of the discussion between InfoQ, Niño Roa, and Cockcroft.
InfoQ: How successful has the uptake of chaos engineering been among enterprise organizations?
Yury Niño Roa: I believe that it varies a lot between organizations. In the case of banking, they are feeling the competitive pressure from each other and the FinTechs. According to Casey Rosenthal, the position of banking institutions has turned from “well, we can’t do chaos experiments because there’s real money on the line” to “we are quickly picking up chaos engineering as a practice”. There are many success stories from large banks such as CapitalOne, who are actually looking at this as a very useful, solid practice.
Proactively running experiments with chaos engineering has been demonstrated to be useful for large organizations and startups, as described in the book by Rosenthal and Nora Jones: “Chaos Engineering: System Resiliency in Practice.” The book contains perspectives, examples and narratives from Slack, Google, Microsoft, LinkedIn, and Capital One adopting chaos engineering.
Adrian Cockcroft: It’s been particularly successful in the large enterprise organizations that have made significant moves to the cloud, and in organizations that run highly critical applications like financial services.
It’s generally seen as part of the modernization program, along with adopting continuous delivery, DevOps and cloud computing. However, I think most organizations are just tinkering with chaos engineering, and it’s only the “scaled digital native” companies that use it as a foundational part of their operational excellence program.
InfoQ: Understanding the flow of value, implementing effective observability, and responding to feedback are often cited as prerequisites for chaos engineering. Is a typical enterprise mature enough in this regard?
Cockcroft: I don’t see these as prerequisites for chaos engineering. Any organization will benefit from regular game day trainings of its staff, regardless of the state of their tooling around observability.
I would always start with the people, and run paper exercises or simulated “what-if” tests. These will make it clear what the gaps in tooling and process really are. Set up a schedule, with the goal that each time the game day runs, it’s a bit more realistic, and the organization is a bit more capable.
Make it a continuous improvement process, not something that has lots of pre-requisites before you can start.
Niño Roa: It depends on the context and country; I think typical enterprises are not mature enough in this regard.
With large enterprises in countries like mine, we are just starting to adopt the concept of observability. In some entities that belong to the government for example, the resources invested for observability are really limited. I have met with customers working on-call, who do not know what the difference between logging, tracing and monitoring is. To have an observability pipeline is a utopian dream for some of them.
My recommendation with observability is just to start; the first step is always the hardest. A more pragmatic path starts by taking a look at architecture designs to get an overview of things that could be interesting to observe. Afterwards, you should set the target metrics baseline, such as Service Level Indicators (SLIs) and Service Level Objectives (SLOs) per service to get an idea how your system should behave. With the information gathered, you can start to form hypotheses, which is the first principle in the manifesto of chaos engineering.
InfoQ: How important is psychological safety within an organization?
Niño Roa: Psychological safety is critical for an organization. No matter the size or business domain of the organization, its success is directly related to how team members feel accepted and respected within their current roles.
I believe that chaos experiments and game days are a perfect way to promote psychological safety in an organization. When people know that they can experiment in a safe format, it removes the sense of shame or insecurity with failure. That emotional foundation lets everyone focus on what they can learn, without the distraction about who is going to get in trouble for what happened.
Cockcroft: It’s a characteristic of successful organizations, whether that be enterprises or sports teams. If problems can’t be safely surfaced without blame or shooting the messenger, the organization won’t learn to systemically fix and prevent problems. For example, if you look at the current dominance of the Mercedes Formula One team, and the way they talk about and react to setbacks, it’s clearly set up that way. Other teams have much more internal politics and have not been able to compete over the last few years.
InfoQ: What has been the most exciting topic or idea that has emerged from the chaos engineering community over the past year? What should we look out for over the next year?
Cockcroft: The most interesting area is the move from a focus on chaos testing individual microservices or request flows, to larger scale failure mode scenarios.
Niño Roa: In the last year, “Security Chaos Engineering” and “Security Chaos Testing” provided us with an opportunity to instrument security capabilities that must be continuously integrated in order to build confidence in the system's ability to withstand malicious conditions. Aaron Rinehart, who has been a sponsor and advocate of this practice, wrote an excellent chapter in “Chaos Engineering: System Resiliency in Practice”, and he is preparing a complete book about this, with a lot of resources, discussions, and experiences about chaos and resilience in security.
More details about the upcoming Chaos Conf 2020 virtual event, including free registration, can be found on the conference’s website.