In a new O'Reilly report, "Chaos Engineering Observability: Bringing Chaos Experiments into System Observability", Russ Miles explores why he believes the topics of observability and chaos engineering "go hand in hand". He argues that as engineers begin to run chaos experiments, they will need to be able to ask many questions about the underlying system being experimented on. Being able to observe and make sense of the running system is an important prerequisite for this, and as such, the report provides a brief guide for collecting metrics, logs, and traces, alongside running experiments.
Miles, CEO of ChaosIQ.io, worked alongside the Humio team during the production of this 32 page report, and they also sponsored the work. Topics covered include: chaos engineering signals, which explores how to obtain notifications of an experiment's progress, and also how to act on related events in order to control the experiment; logging chaos experiments, and the need for an effective centralised logging solution; and tracing chaos experiments, for example, emitting OpenTracing data in order to "piece together the crucial answers such as what happened, in what order, and who instigated the whole thing".
InfoQ reached out to Geeta Schmidt, CEO of Humio, in order to understand how the Humio team sees the relationship with the work they are doing with their log management platform and chaos engineering.
Chaos engineering allows developers, security teams and operations managers to refine, recalibrate, and navigate the understanding of systems through intentional and careful experimentation in the form of failure and threat injection. The increases in understanding lead to a better experience for their customers and users and improved business outcomes.
Echoing the first step in the "chaos in practice" section of the Principles of Chaos, which states "start by defining 'steady state' as some measurable output of a system that indicates normal behavior", Schmidt argued that developing the ability to observe a system is important when beginning chaos experimentation in order to "detect when the system is normal and how it deviates from that normal state as the experiment is executed". She emphasised that when working with Humio customers, the team had learned that access to customisable and fine-grained near real-time metrics and logging led to better understanding of the impact of experiments.
In related work, Tammy Butow, principal SRE at Gremlin, presented at last year's QCon London, and argued that there are three primary prerequisites for implementing effective chaos engineering practices: high severity incident management, monitoring, and the ability to measure the impact of downtime. The original O'Reilly Chaos Engineering report, initially sponsored by Netflix, and written by Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri, presented a broad overview of the topic and also discussed the importance of identifying metrics that should be watched during an experiment.
Both Michael Kehoe and Aaron Rinehart discussed the need to be able to observe failure injection experiments and derive new insights from the results of controlled experiments in their respective articles "LinkedIn's Waterbear: Influencing resilient applications" and "Using Chaos Engineering to Secure Distributed Systems" in the recent InfoQ Chaos Engineering emag, of which Nora Jones was the guest editor. John Allspaw, co-founder at Adaptive Capacity Labs, has cautioned that the term "observability" has recently diverged from the previous meaning (which was backed by empirically supported research), and that the importance of human factors in relation to this should not be ignored.
InfoQ recently sat down with Miles and discussed the topics of chaos engineering and observability.
InfoQ: Hi Russ, and thanks for joining us today. If teams are new to chaos engineering and observability, where should they start?
Russ Miles: All roads tend to start with the Principles of Chaos.
From there, we'd suggest grabbing the Chaos Toolkit, working through some online tutorials, and getting stuck in to creating some chaos experiments of your own. This will lead you almost immediately to question your own system's observability, and then grabbing a copy of the Distributed Systems Observability and Chaos Engineering books would be great next steps.
Finally, there are also a few great communities that you can join to compare notes and get lots of free advice. There is the Chaos Community on Google Groups, Gremlin's Chaos Engineering Community on Slack and, for those using the Chaos Toolkit and Platform, our team and community Slack as well.
Chaos engineering is an exciting journey towards building trust and confidence in your systems that never really stops. These communities are there to help you get going and keep going as we all see chaos engineering as being a key mindset and discipline to helping us all build more robust, resilient and, ultimately, reliable, safe and secure systems.
InfoQ: How important is it to be able to observe a system before running chaos experiments?
Miles: The key word in this question is "before". While it is very useful to have good system observability regardless of whether you are embarking on chaos engineering, it is not exactly a prerequisite.
The two approaches, chaos engineering and observability, work well together.
In some cases you will already have some system observability before you do any chaos engineering, and this will definitely help with when your chaos engineering experiments surface deviations that need to be diagnosed into system weaknesses. However, it's also the case that by starting to conduct chaos engineering experiments, this will become a "forcing function"; it will highlight the need for better system observability if it is not present.
For this reason, in our experience good system observability is less of a strong pre-requisite to chaos engineering. More often as you consider chaos engineering, your lack of system observability will come to the fore, and both techniques will mutually develop and reinforce one another over time.
Interestingly this is also why chaos engineering is a good "forcing function" for other non-functional or operationally-focussed system improvements. Chaos engineering often makes such easily-forgotten or de-prioritised concerns utterly unignorable, and is a great way of helping everyone, even those not responsible and intimate with production systems, aware of these challenges and then able to design and implement system robustness and resilience strategies to meet them.
In a nutshell, chaos engineering makes everyone a better operationally-sensitive system owner, and system observability is one capability that benefits strongly almost immediately.
InfoQ: Are there any particular challenges with implementing chaos engineering? And how important is it to be able to understand and observe a chaos experiment itself?
Miles: Chaos engineering can fail in an organisation for a number of reasons. From not understanding the mindset, right through to chaos engineers subjecting other teams to chaos in a conflict-style relationship. Chaos engineering done badly can easily end up with the discipline being rejected.
The ChaosIQ team believes that everyone in an organisation will eventually begin to practice chaos engineering. Rather than being a technique just for the chosen few, chaos engineering is a useful mindset and discipline for everyone who owns mission-critical business systems. Understandably, this means that, even for a moderately sized organisation, there is the challenge of chaos engineering at scale. And at scale, the opportunities for failure of the chaos engineering approach increase.
One such opportunity for the failure of chaos engineering, which appears at all scales of adoption, is when chaos engineering is practiced as a surprise. What this means is that if anyone is surprised when an experiment executes, then that's usually a bad moment for everyone, and this can frequently lead to requests not to run those experiments again.
Sometimes it is useful to surprise your team with a chaos experiment itself, say as part of an organised Game Day. In order to explore how your team responds to turbulent conditions, perhaps to explore any people/practices/process weaknesses, you may not tell your colleagues the exact nature of the conditions being applied, and so it could be argued that those conditions are a surprise. However, the Game Day itself is not a surprise in and of itself, and this is an important point.
When executing chaos experiments it is also useful for those experiments to not be a surprise either. Everyone potentially affected by chaos experiment execution should be able to know that chaos is being executed, when and against what and, ideally, why.
The combination of the Chaos Toolkit and Platform's experiment format and pluggable chaos engineering observability tools makes this level of cross-team and organisation visibility possible, avoiding the dangers of surprise chaos and the adverse reactions we've seen that can happen at that point.
So in many respects, regardless of the size of your teams and chaos engineering capability, bringing your chaos experiment execution into your overall system observability picture is crucial to avoid these pitfalls. Chaos engineering experiments are also citizens within your system, and so they should be good citizens by making their activities known and observable alongside all the other aspects of your runtime systems.
Miles' new report, "Chaos Engineering Observability: Bringing Chaos Experiments into System Observability", is available for download via the Humio website. Registration is required.