Key Takeaways
- While chaos engineering is a proven technique for improving the resilience of systems, there is often a reluctance amongst stakeholders to introduce the practice when a system is viewed as critical.
- With critical systems, it can be a good idea to first run experiments in your dev/test type environments to minimize both actual and perceived risk. As you learn new things from these early experiments, you can explain to stakeholders that production is a larger and more complex environment which would further benefit from this practice.
- Using real production traffic, as opposed to a synthetic workload, can improve the usefulness of the experiments in these early stages.
- A good chaos engineering practice helps you to improve both the resilience of the system, and its observability when incidents do occur.
Chaos engineering is a discipline that has seen growing interest over recent years. It provides a definition to a valuable practice of improving the reliability of your systems by embracing the fact that systems will fail. While literature and talks abound on how this approach can be applied, there is often hesitation when systems are viewed as “critical” or too important to fail. Although there can be even more compelling reasons to apply this approach with critical systems, it is expected that stakeholders in these systems will be sensitive to anything new that can add risk. In this article, I will share what our teams at Cerner Corporation, a healthcare information technology company, found to be effective in introducing this practice with our systems.
Organizing
Before you begin trying to apply these types of experiments, you want to make sure stakeholders in your system are aligned on this approach. This is important in the early stages, as some of your findings may alter your software delivery schedule. You will want to ensure they understand this new part of your development process, and that findings may take priority over other planned development. You can use a comparison of how addressing reliability issues discovered in a production incident would take priority over other planned features. The issues you discover through chaos testing may be equally critical, but you will be discovering them proactively rather than reactively during an actual incident.
When preparing to introduce these types of experiments, you need to establish what components of your system will be part of the initial focus. To minimize the amount of stakeholder communication, it is helpful to include these component owners in the initial meetings so they can understand what and why this is being applied and you get their full buy in. As with anything new, it may be easiest to agree to work jointly on the first experiment that includes their component, so they can be closely involved and thereby see the value and give feedback on how best to keep them engaged going forward. Often it is valuable as they can provide information immediately on the system behaviors that are observed, versus having to document these with the essential amount of detail for them to provide initial feedback. By having them as part of the experiment, they will be able to dig more deeply into their component at the point of discovery, which is the most effective time to gather additional details.
Identify Safe Points of Introducing Chaos
With critical systems, it can be a good idea to first run experiments in your dev/test type environments to minimize both actual and perceived risk. As you learn new things from these early experiments, you can explain to stakeholders that production is a larger and more complex environment which would further benefit from this practice. Equally, before introducing something like this in production, you want to be confident that you can have a safe approach that allows for you to be surprised with newer findings without introducing that additional risk.
As a next step, consider running chaos experiments in a new production environment before it is handling live traffic by generating synthetic workloads. You get the benefit of starting to test some of the boundaries of the system in its production configuration, and it is easy for other stakeholders to understand how this will be applied and that it will not introduce added risks to customers, since live traffic isn’t being handled yet.
To start introducing more realistic workloads than you can get from synthetic traffic, a next step may be to leverage your existing production traffic. At Cerner, we built a traffic management capability of replaying incoming requests to another system to generate a form of “shadow traffic.” This capability was built into our API gateway that was handling ingress traffic to our systems. When looking at candidates for shadow traffic, you want to make sure requests can safely be replayed. You might think of read-only type requests, and these can make good candidates, but with sensitive workloads, even read operations can still have side-effects. An example of this is events being emitted from a service to support an alerting system. You will need to take care when identifying candidates for this traffic and understand how to control problematic side effects in the system.
Having the ability to annotate traffic as being shadowed will help in these later experiments. For example, if a shadow request fails in a system, you don’t necessarily want to contribute this to the service’s overall production metrics (like its failure rate), as you could start alarming the owners of that service when it may be acceptable for shadowed requests to fail during these experiments. Having the ability to annotate and control this at the individual request level, gives you the right level of granularity to manage these side effects and measure their behavior. Furthermore, it is valuable to have consistent correlation identifiers on your traffic (ex. Correlation-Id HTTP header) when applying a shadow traffic approach, as it gives you a reliable way of comparing how two systems are performing by comparing the system handling live traffic versus the system handling shadow traffic. Since you may be only capable of replaying a certain set of requests safely to the other system, you can’t accurately apply aggregate comparisons of behavior (ex. failure rate or 99th percentile of response time) without knowing what subset of that traffic should be compared. Through the use of correlation identifiers and knowing which request was for which system, you can compare the behavior of the systems.
With shadow traffic, you can start replaying traffic to another system which can essentially fail without impacting the origin of that request. This type of setup can be an effective way to grow these types of experiments. This allows you to continue this type of testing once your system is live, as by controlling the flow of traffic in your system, you are able to control the risks of these experiments.
Apply Experiments
When it comes to running experiments, you want to start by keeping it simple and plan for additional time to effectively digest any surprises that are uncovered. For example, if you think it will take two hours to do the experiment, plan for the whole afternoon. If you don’t need all the time, that’s fine, but what you don’t want is to feel compelled to end early due to time constraints. It will be helpful to have a central location where you will document these experiments for others to discover and to learn from. In this central location, you will build out your plans of your scenarios and the expected behavior before you apply the experiment. Capturing what you expect before you observe the behavior is valuable, as you often will be surprised by the actual behavior.
Early on, you may not have a lot of automation when injecting faults into the system. Common examples of this are manually shutting down services. This may be physical infrastructure or powering down virtual machines supporting your system. Therefore, when you have multiple people coordinating tasks as part of an experiment, it is important to ensure communication is centralized in one location. That may be having everyone in the same chat and ensuring their calendars are blocked off for that time so their attention isn’t directed elsewhere. Having everyone as part of the same chat also helps in capturing the timeline of events for that experiment. This aids in correlating events and important findings, like the total time for the system to recover from the fault. It is valuable to utilize tooling that can quickly illustrate the effect of these experiments to the actual consumer. Vegeta is a simple HTTP load testing harness that you can use for synthetic traffic generation, which can quickly illustrate the client experience with its real-time charting in your console. It is important to capture the client-side metrics, as there may be a discrepancy in the overall experience when only looking at the server-side metrics. For example, if you inject a network fault that only affects the client-side application, you could miss the fact that specific requests didn’t even arrive in your system. Making it easy to both generate traffic and visualize the effects is helpful when doing these experiments and assessing the impact. Furthermore, it avoids any additional work of generating charts around error rate or latency, since the tooling can provide this in real-time during the experiment.
Improving the System
The findings of these experiments are often not what you expect. Sometimes the behavior is similar enough to what you originally believed, that the primary outcome of the experiment is the learning you acquire about the actual system behavior. In other cases, you may discover things that you will want to address before pursuing other experiments. As mentioned earlier, you want to make sure it is easy for these discoveries to feed into the team’s backlog to address sooner, rather than later. The reason is that you want to support a quick feedback loop on these experiments, otherwise it can introduce an unnecessary source of friction.
It is valuable to be able to quickly and easily change how you build visibility into your system, so that when you encounter surprises, you are able to repeat the experiment with additional visibility into a given component. The more you practice this, the better you will get at tuning your observability system to capture the necessary context, ultimately reaching the point where you are not needing to make changes, since you are able to answer many of the common initial sets of questions. At Cerner, we use New Relic across many of our products as our observability platform and Splunk for our log aggregation. Tuning this configuration can be accomplished centrally through a common container build process. When we find improvements around visibility, like adding context to logs, it can be done in one location for the other parts of the system to leverage.
Share the Value
As you begin identifying issues and improving your system based on these experiments, you want to make sure others understand how this has been effective. Often, many of the primary stakeholders who are part of running the experiments are already aware of the benefits. They see system behaviors with these failures, what is learned, and how improvements are applied. However it may not be obvious to those closely involved that stakeholders who are further removed, including the leadership team, may not see the value of this work. Leadership often gets visibility of production incidents, but they may not always hear when an incident is avoided.
To counter this, make it a practice to document experiments in a story that can be shared in leadership updates as to what was found, how it was discovered, what the impact to customers was, and what is being done to address it. Often when this story is framed as a proactive approach, where you prepared the system and planned for surprising failures, you have already reduced the risk to your customers (or at the very least kept it to a minimal level). Therefore, when you can share these stories of how the team has been able to learn and improve systems without impacting customers, it becomes an obvious choice to continue to use these methods. Where possible, the weight of these stories can be further extended by comparing a finding to a past production incident which may be more memorable to your stakeholders. Sharing how something was discovered and avoided by proactively seeking these boundaries of failures in the system, typically gets more people interested in these approaches, as the proactive approach gives a sense of control that outweighs the often reactive approaches that result from production incidents. This approach improves your team's ability to diagnose and troubleshoot issues in production. Both the observability improvements and gained knowledge of the system, prepares the team to handle different incidents in the future.
Conclusion
Change is never easy, especially when it involves sensitive workloads in a production environment. Chaos engineering approaches can bring significant value to your systems by giving your teams a proactive approach of identifying and resolving risks to your system’s availability. By working early with stakeholders of these systems and sharing the value of this practice, you build an important momentum that helps you take this approach from early environments to your actual production systems.
About the Author
Carl Chesser is a principal engineer supporting the service platform at Cerner Corporation, a global leader in healthcare information technology. The majority of his career has been focused on evolving and scaling the service infrastructure for Cerner's core electronic medical record platform called Millennium. He is passionate about growing a positive engineering culture at Cerner and contributes as an organizer of hackathons, meetups, and giving technical talks. In his spare time, he enjoys blogging about engineering related topics and sharing his poorly made illustrations.