Key Takeaways
- Modern architectures and infrastructure are ephemeral and dynamic, with unpredictable user behavior intersecting with unforeseeable events. Our systems have begun to function less like predictable metal machines, and more like biological machines with emergent behaviors.
- Disaster recovery has been around for many years, but it's expensive, custom, and fragile, so it's only implemented where it's essential and it's exercised infrequently if at all. Chaos Engineering takes advantage of the APIs and automation now available in cloud native architectures.
- Everyone within an IT team should focus on chaos engineering, but the area of focus could be different. For example, there are layers of chaos: at the operating system level (CPU, memory), at the network level, and at the application layer. Even product managers who focus on user experience should plan and execute chaos exercises as a part of their feature roll outs.
- Chaos Engineering is a good way to provoke the emergence of unknowns and unanticipated failures within a managed time-frame in order to help to improve resilience of the system. Chaos Engineering should be run continuously.
- There are a few important prerequisites for Chaos Engineering. First, you need to make sure you have appropriate monitoring in place. Then you need to be able to determine what your top five critical services are. Next, pick a service and determine a hypothesis and chaos experiment you want to perform on it.
In preparation for the upcoming Chaos Conf event, running in San Francisco, USA, on 28th September, InfoQ sat down with a number of the presenters and discussed the benefits, challenges and practices of chaos engineering.
The inaugural Chaos Conf will be run by the Gremlin Inc team, and aims to provide a forum for experts, practitioners, and those new to the topic of chaos engineering to share experiences and learn about good practices that are emerging. Presentations will cover topics such as introducing the ideas of running chaos experiments to non-technical stakeholders, the history of chaos engineering, implementing fault injection, patterns for failure management, and “breaking containers” and running chaos experiments with Kubernetes.
InfoQ recently sat down with the following Chaos Conf speakers, and discussed all-thing chaos engineering: Kriss Rochefolle, Director of Operational Excellence @ OUI.Sncf; Adrian Cockcroft, VP of Cloud Architecture @ AWS; Charity Majors, CEO @ Honeycomb; Mark McBride, CEO & Founder @ Turbine Labs; Vilas Veeraraghavan, Director of Engineering @ Walmart Labs; Ronnie Chen, Engineering Manager @ Twitter; Mikolaj Pawlikowski, Software Engineer @ Bloomberg; and Tammy Butow and Ana Medina, Empresses of Chaos @ Gremlin.
InfoQ: Welcome, and many thanks for taking part in the Chaos Conf pre-event Q&A. Could you briefly introduce yourself please?
Kriss Rochefolle: Hi, I’m a Quality & Safety Engineer from the Ecole des Mines de Nantes, France (now IMT Atlantique), I built my experience by setting up software quality teams at software companies and start-ups, especially in multi-site environments and offshore.
After having contributed to lead an Agility & DevOps approach for the first French flash-sales websites, I joined the OUI.sncf group 2 years ago to continue the DevOps transformation with an Operational Excellence approach and a pinch of Chaos Engineering.
Adrian Cockcroft: Hello, I’m AWS VP Cloud Architecture Strategy, and part of my job is to run the open source community team for AWS. The rest of my job is to talk to customers, and speak at conferences.
Charity Majors: Hey, I'm co-founder and CEO of honeycomb.io, where we build observability for software engineers. Previously of Parse/Facebook, co-author of O'Reilly's book "Database Reliability Engineering", long time operations engineer and breaker of systems.
Mark McBride: Hi, I’m the founder and CEO of Turbine Labs. We make Houston, a service mesh with a focus on self-service for microservices. Prior to Turbine Labs I ran server-side engineering at Nest, and before that I worked at Twitter. In both places, I led the transition to microservices.
Vilas Veeraraghavan: Hello, I run the cloud application testing and deployment pipeline at Walmart. We own testing, analysis, sizing of clusters for our hybrid cloud deployment and chaos engineering.
Ronnie Chen: Hi, I'm currently an engineering manager at Twitter. I started in the tech industry as a backend engineer, and then moved into data engineering. For Chaos Conf, however, I'm speaking primarily in my capacity as a technical diver (SCUBA and rebreather).
Mikolaj Pawlikowski: I'm a Software Engineer at Bloomberg, where I work on a microservices platform running on top of Kubernetes. It's a large distributed system, which led me to the Chaos Engineering paradigm.
Tammy Butow & Ana Medina: We both practice Chaos Engineering. We are on the cutting-edge of failure and constantly exploring how Chaos Engineering can be used to make systems more resilient. We also dedicate time to helping the greater community learn more about Chaos Engineering. We’ve practiced CE at Gremlin, Dropbox, Uber, DigitalOcean and the National Australia Bank. We’ve learned a ton over the years and it’s great to be able to give back.
InfoQ: What makes Chaos Engineering important, and why do you believe the topic is receiving more attention now?
Rochefolle: Availability of a service is now as important as security for IT Directors, we must mitigate impact of critical incidents and outage :
- The average number of critical IT events (CIEs) per European organisations per month is 3
- 65% of European organisations report that a past CIE has led to reputational damage and associated financial losses
The last major exsmple is the Amazon Prime Day Crash
Cockcroft: Disaster recovery has been around for many years, but it's expensive, custom, and fragile, so it's only implemented where it's essential and it's exercised infrequently if at all. Chaos Engineering takes advantage of the APIs and automation now available in cloud native architectures (whether on premises using Kubernetes, or on AWS) to make DR low cost, productized and robust enough to be exercised continuously, to prove that safety margins exist.
Majors: Because modern systems demand it. There's been an explosion of complexity in the infrastructure space in just the past few years. Modern infrastructure is composed of far-flung distributed systems, loosely coupled with third-party vendors and SaaS or IaaS, plus containers and schedulers, on top of virtualization, with redundancy between regions and sub-second latency guarantees, with a proliferation of languages and environments and orchestration layers and "many" storage systems.
Oh and don't forget microservices! Modern infra is ephemeral and dynamic, with unpredictable user behavior intersecting with unforeseeable events, amidst an ever-growing hunger for both reliability and rich new feature sets. Our systems have begun to function less like predictable metal machines, and more like biological machines with emergent behaviors.
Yet, most of the tools and practices we use for interacting with these systems are a legacy of the LAMP-stack era, when you could look at a dashboard and know how healthy the system was. You could predict how those systems would fail, and monitor for those things. You can no longer do either of those things now, which is why chaos engineering -- and observability, and other disciplines -- are starting to emerge.
They are predicated on the assumption that failure is inevitable, but unknowable; that the best way to understand your systems and interact with them is to embrace that impossibility and build for it. Instead of asking everyone to predict the future and build perfect systems, you aim to manage failures constantly, detect and remediate them quickly, and sometimes inject them under somewhat controlled circumstances to accelerate your learnings. Thus chaos engineering.
The world is moving fast from a world of known-unknowns to one of unknown-unknowns, from "predict failures and monitor for them" to "instrument and explore". The tools and techniques we use to deal with these are fundamentally different.
McBride: At Twitter and Nest, the transition to microservices created an ever-increasing gap between our knowledge of how we thought our applications would fail and how our applications actually failed. Across all teams, I saw markedly different approaches, ranging from creating expensive staging simulations to allowing developers to manage production safely and directly. Tooling up and directly interacting with production was vastly more effective.
Today more than ever, critical high-scale services are run by small teams. Chaos Engineering closes the loop, giving them real-world information about how their services behave in production. Otherwise, teams are just guessing what will break, and the stakes are too high to guess.
Veeraraghavan: In the cloud engineering ecosystem, specifically with a microservices-based architecture, there are vulnerabilities that cannot (or are too expensive to) be discovered by traditional devops and/or quality engineering. Chaos engineering makes finding those hard-to-find cases much simpler. The topic will continue to become more and more mainstream as private cloud deployments become the norm at all companies.
Chen: The increased complexity of the average system definitely contributes to the the attention that the movement is getting, but the practice of running disaster drills for training purposes has existed in other industries for years.
Pawlikowski: Chaos Engineering helps fill in the gaps left by other, more conventional ways of testing our software. In particular, it addresses the problems that result from the interaction of various components of a distributed system, each of which is otherwise tested and shown to be working correctly.
I think that Chaos Monkey, published by Netflix, played a role in getting more attention for Chaos Engineering. Other factors include the increasing popularity of the cloud, microservices and large distributed systems, where the failure of various subcomponents is a norm.
Butow & Medina: The last thing any company wants is to have any sort of downtime or failure in their systems. A multi-day outage is unacceptable. With the complexity of distributed systems, cloud infrastructure, serverless, microservices and more it makes it harder for us as engineers to know what steady-state even looks like. We need to be prepared for failure and Chaos Engineering enables that.
InfoQ: Should everyone focus on Chaos Engineering, or just the Ops team?
Rochefolle: Having the best customer experience is the goal for everyone, not just Ops team.
Practicing Chaos Engineering is one way to improve it by limiting the impact in case of an incident:
- Dev teams to use resilience patterns in their applications,
- Ops Team to provide and operate resilient platform, and
- UX/UI Team to design customer experience to absorb impact of incident.
Cockcroft: Availability is a developer concern, Chaos Engineering makes sure developers build resilient systems and learn to make them stronger. There doesn't even need to be an ops team for Chaos Engineering to be useful.
Majors: What even is an ops team anymore? In modern organizations all engineering teams ship software, everyone owns their service health. Those used to be called "dev" and "ops" but that is increasingly an archaic distinction.
If you're asking if only infra engineering teams should care about chaos engineering, I would answer "do only infra engineers have SLAs or care about quality?" Hopefully not. Every engineering team should care passionately about (and be held accountable for) the health of their service. Chaos eng can help by accelerating the rate at which you discover failure scenarios and compensate for them.
McBride: Everybody should focus on Chaos Engineering. The ops team needs to be the champion, because the key to a successful effort is a platform that allows teams to perform chaos experiments safely. Once that’s in place, though, the service teams are the ones that really find edge cases and rough edges, and they’re in the best place to fix them.
Veeraraghavan: Everyone should focus on chaos engineering – the area of focus could be different. E.g. there are layers of chaos – at the operating system level (CPU, memory), at the network level, at the application layer. Even product managers who focus on user experience should plan and execute chaos exercises as a part of their feature roll outs.
Chen: Everyone who does development in a complex system should be cognizant of how the system can fail and think about failure modes.
Pawlikowski: Good tooling, which automates the task and can report on problems that are detected in a continuous fashion, can be very helpful in achieving that.
Butow & Medina: We strongly believe everyone should be focused on it. Chaos Engineering can be used in the entire stack and in the same way we believe all developers should be on-call for their own code, we think all developers should be building for failure and putting it to test via chaos engineering experiments. Chaos Engineering is not just about the backend, or infrastructure, this is something for all engineers.
InfoQ: What is the difference between Chaos Engineering and Resilience Engineering?
Rochefolle: Resilience Engineering aims to prevent incident impact by building resilient applications and platforms. However, as our systems grow in complexity, it becoming very hard to anticipate everything.
Chaos Engineering is a good way to provoke the emergence of unknowns and unanticipated failures (dark weakness) in a managed time-framed period to help to improve resilience of the system. By running Chaos Engineering continuously, it also help us to have confidence that resilience is not degraded.
In a word : Resilience Engineering is DETERMINISTIC whereas Chaos Engineering is NONDETERMINISTIC. We need both to have the best customer experience.
Cockcroft: No effective difference if that's what your team is called. Taleb argues that Anti-fragility is different to resilience because an anti-fragile system becomes stronger when stressed, and a resilient system has fixed characteristics. Chaos engineering implies more anti-fragile characteristics, as the system is stressed to find weaknesses.
Majors: Hell if I know. Is that like black hats and white hats? Pen testing and ... just breaking stuff?
McBride: They’re similar, in that they both encourage developers to accept that failure is inevitable. Resilience Engineering only encourages teams to design for failure. Chaos Engineering takes this a step further and asks teams to test their theories, ultimately resulting in much stronger systems.
Veeraraghavan: I use them interchangeably. However, I find that resilience engineering strikes a more positive note with senior leadership whereas chaos engineering scares them :)
Chen: I'm probably not the right person to arbitrate that distinction. My understanding is that chaos engineering focuses on a fault injection centered approach and resilience engineering is the larger discipline of building more fault tolerant systems.
Pawlikowski: On the other hand, Chaos Engineering is more about injecting failure on purpose to see whether the system as a whole can continue working as expected. This also enables developers to detect bugs they hadn't thought of testing for.
Butow & Medina: Chaos Engineering is a way to achieve resilience engineering. We prefer to focus on antifragile systems, systems that not only fail gracefully but also get stronger when you inject chaos and failure. A system should always be improving.
Editor’s note: John Allspaw is doing a lot of very interesting research and thinking in this space. Please see this greaterthancode podcast for more information and additional references.
InfoQ: Can you recommend your favourite tooling in the Chaos Engineering space?
Rochefolle: We built our own tooling because we are not on Amazon AWS, and most of the tools are for this platform.
Cockcroft: I think Gremlin are doing a fine job leading this space, I was an advisor to them before I joined AWS. The chaostoolkit open source project is very nicely documented and a great place for everyone to collaborate on developing the field further. The CNCF Chaos Working group is also a good starting point.
Majors: This is gonna sound selfish, but it is honest and predates honeycomb: your observability tooling. If you don't have the ability to observe each and every failed request, and drill down to isolate the cause of it, and back out to see what else was affected -- you aren't doing engineering, you're guessing. I am a big fan of testing in production, and a big fan of injecting chaos under controlled circumstances. But if you can't observe the consequences with precision -- if all you have are dashboards and aggregates, no unique request identifiers, no ability to break down by high-cardinality dimensions -- then you're not doing "chaos engineering", you're just breaking stuff. Observability is the missing link in most modernizing systems.
McBride: The best companies have a tight feedback loop between failure and fixes. My personal favorite chaos setup pairs Gremlin for injecting failure into systems with Envoy, a service mesh proxy from the CNCF.
Envoy provides a consistent view of how services are performing, Gremlin breaks things, and Envoy provides the tools to mitigate entire classes of failure. It’s an amazingly powerful system, and it doesn’t require re-implementing fixes in different languages or in bespoke ways across multiple services.
Veeraraghavan: We are writing up a lot of tools in-house (to be released open source, eventually). Lots of tools out there that cover a lot of resiliency needs. But, most tools are still in the evolution phase. As more companies (Netflix excluded) start executing chaos exercises, the tools will mature.
Chen: A junior-level engineer with just enough access permissions to be dangerous.
Pawlikowski: One good resource I recommend is Awesome Chaos Engineering, which lists many related resources and tools, in no particular order.
I am obviously partial to PowerfulSeal, a chaos testing tool for Kubernetes clusters that we wrote and use extensively at the company. We published it as an open source project at KubeCon + CloudNativeCon North America 2017 in Austin last year.
Butow & Medina: The space is still very small and there will be more tools developed in the upcoming future, but after working on Uber’s own chaos monkey tooling, uDestroy, I (Ana) learned to appreciate what out of the box automated Chaos Engineering software like Gremlin offers. I (Tammy) have built my own tooling in the past while at Dropbox to perform Chaos Engineering experiments on the Database Infrastructure. After seeing the Gremlin product I thought it was so much more awesome than anything I had even dreamed of, I then asked if I could join the team : ) Now I work at Gremlin, I’ve been here for 10 months and I love it!
InfoQ: If a team wanted to get started with Chaos Engineering, what is the first step they should take?
Rochefolle: Prepare cultural change by organising a gameday
Cockcroft: Install chaos tooling and processes in new environments before any applications are deployed, and make a test environment that tortures applications before they go to the production environment.
Majors: Make sure their instrumentation was sufficiently mature so they can actually explain the unknown-unknown events that happen in their system. Without instrumentation, the war is lost before it's even begun.
McBride: The first step is to simply start doing it. It’s better to break a service intentionally with everybody paying attention than it is to wait until 2am when there’s only one sleepy developer awake.
Chaos Engineering is all about testing your knowledge of the system. As a first step, pick something that you think should be resilient and break it in a way you believe the system will recover gracefully. If it does, great. And if it’s doesn’t, you’ve learned something valuable!
Veeraraghavan: Be upfront with what your expectations are and realize that it is supposed to be a scientific experiment. Just because the word “chaos” is in the name, doesn’t mean you abandon your responsibility as an engineer. Gremlin has some awesome blogs that we refer to plan and execute experiments.
Chen: Most teams would be well served by first examining how far their runbooks and standard recovery procedures have drifted from their actual system state before even getting started with actual fault injection. Get someone without the operational context to follow your documentation and validate that before you start throwing more chaos into the mix.
Pawlikowski: I would start with Awesome Chaos Engineering, read the Principles of Chaos Engineering manifesto, and then poke around to try and break your system using the various tools available. Breaking your carefully designed system is good fun!
Butow & Medina: We believe there are a few important prerequisites for Chaos Engineering. First, you need to make sure you have appropriate monitoring in place. Then you need to be able to determine what your top five critical services are. Next, pick a service and determine a Chaos experiment you want to perform on it.
The goal is to run the experiment in a way that it doesn’t cause harm to customers. You systems should fail gracefully. The goal is to use Chaos Engineering experiments to confirm your hypothesis for failure scenarios. Next observe/analyze the results and then increase the blast radius.
About the Panelists
Kriss Rochefolle is Director of Operational Excellence @ OUI.Sncf. Bearer of organizations, methods and tools to improve the quality and the agility of the systems and teams, in the course of exploration & experimentation on the Chaos Engineering, all with passion & pleasure!
Adrian Cockcroft is VP of Cloud Architecture @ AWS. Adrian played a crucial role in developing the cloud ecosystem as Cloud Architect at Netflix and later as a Technology Fellow at Battery Ventures. Prior to this, he held positions as Distinguished Engineer at eBay and Sun Microsystems.
Charity Majors is CEO @ Honeycomb. Charity is a former systems engineer and manager at Facebook, Parse, and Linden Lab, and the co-author of Database Reliability Engineering. "With distributed systems, you don't care about the health of the system, you care about the health of the event or the slice.
Mark McBride is founder and CEO of Turbine Labs, makers of Houston, a modern traffic management plane. Prior to Turbine Labs he ran server side engineering at Nest. Before that he worked at Twitter, working on migrating their rails code base to JVM-based equivalents.
Vilas Veeraraghavan is Director of Engineering @ Walmart Labs. Vilas joined Walmart labs in 2017 and leads the teams responsible for the testing and deployment pipelines for eCommerce and Stores. Prior to joining Walmart Labs, he had long stints at Comcast and Netflix where he wore many hats as automation, performance and failure testing lead. He loves breaking things and believes that Chaos engineering is the new normal for testing complex application ecosystems.
Ronnie Chen is an engineering manager at Twitter. Previously, she was a data engineer at Slack and Braintree, and a backend engineer at PayPal. She is a deep sea technical diver and was also the sous chef of a Michelin-starred restaurant in a previous life.
Mikolaj Pawlikowski a software engineer with Bloomberg, is building a microservices platform based on Kubernetes, and evangelising Cloud Native and Chaos Engineering. He previously built two startups, worked as a freelance consultant and collaborated on open source projects like https://cozy.io/en/. In his free time, he’s a sports nut and also researches productivity and happiness.
Tammy Butow is a Principal SRE at Gremlin. She previously led SRE teams at Dropbox responsible for Databases and Storage systems used by over 500 million customers. Prior to this Tammy worked at DigitalOcean and one of Australia's largest banks in Security Engineering, Product Engineering, and Infrastructure Engineering.
Ana Medina is currently working as a Chaos Engineer at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. She last worked at Uber where she was an engineer on the SRE and Infrastructure teams specifically focusing on chaos engineering and cloud computing. Catch her tweeting at @Ana_M_Medina mostly about traveling, diversity in tech, and mental health.