Nora Jones, a senior software engineer on Netflix’ Chaos Team, talks with Wesley Reisz about what Chaos Engineering means today. She covers what it takes to build a practice, how to establish a strategy, defines cost of impact, and covers key technical considerations when leveraging chaos engineering.
Key Takeaways
-
Chaos engineering is a discipline where you formulate hypotheses, perform experiments, and evaluate the results afterwards.
-
Injecting a bit of failure over time is going to make your system more resilient in the end.
-
Start with Tier 2 or non-critical services first, and build up success stories to grow chaos further.
-
As systems become more and more distributed, there becomes a higher need for chaos engineering.
-
If you’re running your first experiment, get your service owners in a war room and get them to monitor the results of the test as it is running.
Subscribe on:
Show Notes
How did you get started in chaos engineering?
- 01:50 I started in a home security and home automation system, where my role was focussed on creating different scenarios of sensors being tripped and needing to set off an alarm.
- 02:20 We didn't call it chaos engineering at that time, but it was a similar approach.
- 02:30 I was creating different failure scenarios on my desk - I had a light bulb on a timer set to trigger a camera, and tests to see if the alarms were successfully tested.
What did you do at Jet?
- 03:05 I was the first person hired to do developer productivity at Jet, when it was in startup phase.
- 03:20 As a startup, the site was going down periodically, and so there was a focus on how to improve the reliability of the site and reduce overnight call-outs.
- 03:30 This led to chaos engineering as an approach to add stability to the site.
Why is chaos engineering not called chaos testing?
- 04:00 Chaos engineering is the discipline as a whole, but when you're creating individual tests or experiments they aren't testing something in a binary success/fail mode.
- 04:20 Chaos experimentation is a new space; you have a formal method to generate new knowledge, you formulate a hypothesis, you perform an experiment and see what the result is.
- 04:30 If the hypothesis is not supported then it's a form of exploration.
Where did the rigour come from?
- 05:00 It came from experiences both from myself at Jet and existing practices at Netflix.
- 05:10 Chaos monkey existed and tested resiliency, which found a lot of stuff, but they also have Latency monkey, which would increase latency of operations.
- 05:20 It found a bunch of things, because there weren't hypotheses around it, there wasn't a scope to it, there weren't safety mechanisms, and it could cause more harm than good.
- 05:35 So chaos engineering has a bunch of benefits as a whole, but if you don't scope it out properly and have a strategy then it could cause more harm than good.
- 05:50 We felt it would help if we defined a process based on our experiences so that others who are just getting started with it can start off on the right foot.
Can you give me an example of what would make a good hypothesis?
- 06:00 If you think of how you create a scientific experiment, with a hypothesis, then that's what we're trying to do with chaos engineering.
- 06:10 We try to formulate it as “If X, then Y happens” and list it up front in the experiments that we are running.
- 06:20 When we were sitting down with application owners and creating these experiments, we asked them what they thought should happen in those circumstances.
- 06:35 If the service doesn't expect to be resilient under those conditions, you have to be careful, like running that test when the engineers are available or on call, and verify that the blast radius is suitably minimised for a safety mechanism.
- 06:50 One approach is to introduce latency into those services, instead of failing them.
- 06:55 So we can create a hypothesis like “If we fail a call from service A to service B, then we expect the fallback path to kick in”
- 07:05 So the interesting thing is: what happens if the fallback path doesn't kick in?
- 07:10 Another interesting hypothesis is “What if service C, which isn't a critical service (as far as we know) is making calls to service D, does it affect our core business?”
- 07:30 We like to call that a ‘surprise' hypothesis, where you discover that a previously assumed non-critical service is in fact critical.
- 07:45 In other companies, where they have hundreds of micro-services, categorised as ‘Tier 1' and ‘Tier 2' — how do you know that? Has that been empirically tested, or is it a gut feeling?
How do you define a safety net when getting started with chaos engineering?
- 08:20 You have to keep your business goals in mind; you have to keep them in mind as the chaos engineer.
- 08:35 As you're kicking off the experiments, you have to ask what the business goals are, what the KPIs are, what is monitoring them, and what's the cost of impact if those KPIs are impacted in some regard?
- 08:50 When you're at a home security company, the impact might be that someone can't lock their door - which is a high cost in regards to safety.
- 09:10 If you're an e-commerce company and you're just getting started, if new customers see failures when trying to add their card, you might lose that customer permanently.
- 09:15 So you have to work out the cost of acquiring the customer and take that into your planning for failure.
- 09:20 At a streaming company, like Netflix, a lot of the users are existing customers, who would be unlikely to stop watching or using the service, so there's a different kind of customer mentality there.
- 09:30 If there are already problems in your existing site that are impacting customers at the moment, that would be where you want to start chaos engineering, by recreating those problems in order to understand how to mitigate or resolve issues.
Is using chaos engineering in testing or staging a good first step?
- 10:00 It's a great first starting step, which was one of the things I first did at Jet.
- 10:05 One of the first chaos experiments we ran took QA down for a week; it was great we didn't take production down for that long.
- 10:15 It did uncover some things that weren't known, which is a good way to find out and fix them.
- 10:20 However, with the advent of micro-services, it may not be possible to easily separate out or mirror QA and production environments.
- 10:35 You don't get exactly the same level and type of traffic patterns in QA that you do in production though.
- 10:50 You can uncover some issues, and it's a good starting point while you're trying to manage the chaos engineering perspective as well as the company's perspective.
- 11:05 Once that's been done, it is important to perform chaos testing in production as well.
- 11:10 Charity Majors has written an article recently explaining why it's important to test in production - I think it highlights a lot of these points.
What are the most common arguments you hear about moving tests to a production environment?
- 11:25 I hear a lot about the “It won't happen to me” fallacy.
- 11:30 There are a lot of companies - banks (ING, Capital One), automotive companies doing self-driving cars (PolySync, JPL) - that are doing chaos engineering.
- 12:00 The common fear is if your KPIs get hit in a meaningful way then it impacts the business goal in terms of costs or safety.
- 12:10 There's an implicit fear from businesses that we shouldn't be causing more harm than we need to.
- 12:15 With the practice of chaos engineering, causing little bits of harm over time will reduce the giant harm in the future.
- 12:20 We'll spend less time with incident management or recovery in the future.
- 12:30 Injecting a bit of failure over time is going to make your system more resilient in the end.
- 12:35 Having the culture shift is the hardest part of getting started.
How does chaos engineering fit into the larger SRE story?
- 13:15 I think there's a place within site reliability engineering (SRE) for chaos engineering, but I don't think in general people are placing an importance on chaos engineering specifically within SRE.
- 13:30 As systems become more and more distributed, there becomes a higher need for chaos engineering.
- 13:40 You can't hold all the interactions between moving components in your head, and you can sleep better at night as an SRE if you've used chaos engineering.
What is a good first test that you might use when getting started?
- 14:15 kill -9 is good to run for stateful applications - you'll be able to immediately tell what the impact is.
- 14:25 I know that AWS and VMWare both implement kill -9 on the back end.
- 14:30 The problem is that it's harsh, and the cleanup on the backend may be complicated, so it's important to know what you're testing and have a strategy around it.
- 14:40 I wouldn't build my whole chaos programme around that, but I do think it's a good starting point, along with terminating instances.
- 14:50 Chaos monkey has a bit of randomness about it, but as long as you're strategic about that randomness and letting people know what might be terminated and allowing opt-out is important.
What kind of questions would you have and how would you start chaos engineering for a simple shopping cart application?
- 15:35 I would start by asking where the calls to the shopping cart come from, and injecting failure between that and its immediate child calls.
- 15:50 As a chaos engineer I think of injecting latency or failure between calls, such as throwing a random exception.
- 16:30 You can have a library called at the site, which randomly throws an error.
- 16:45 The Chaos Application Platform (ChAP) sits on top of the Failure Injection Testing (FIT) library, so there will need to be some code added.
What is the chaos maturity model?
- 17:30 Netflix had a big success with Chaos Monkey and Chaos Kong, and those tools allowed them to get sophisticated in terms of the technology.
- 17:55 We started thinking of the new tools like ChAP and FIT, which decreases the overhead for engineers to adopt these services.
- 18:15 If you're looking at a business metric - at Netflix we use Starts Per Second (SPS) - which deviates too far from expected, we stop the experiment.
- 18:40 It is enough to give a signal but it stops before customers get unhappy and cause alerts.
- 19:00 With the current state of ChAP, you have to decide that you're starting the service, and what you're failing.
- 19:10 If you're failing calls from Service A to Service B or making them more latent, you have to put them in there.
- 19:20 We're focusing on making this more automated - it will find all the services that you are calling, and create the experiments on its own.
- 19:40 As the user, you can approve or reject those experiments from running.
What are the ideas behind ChAP that others might be able to adopt?
- 20:00 The most important idea is to minimise blast radius, minimising customer pain - focussing on safety to an extent.
- 20:10 Once you get adoption, they're more likely to use chaos if they know that it's not going to run rampant or cause problems.
- 20:15 Importantly, it can stop itself if it sees a deviation from the customer experience that goes outside the desired boundaries.
- 20:25 The way it works is that it uses canaries to perform an experiment in a controlled cluster, and we inject failure or latency into the experiment cluster and see where they deviate.
- 20:45 There may be different ways of achieving the same thing in different organisations who have operations structured differently.
Where do you start injecting failure?
- 21:00 The shopping cart example earlier sounds like what I would call a Tier 1 or critical service - if it goes down, it may result in lost dollars for the business.
- 21:15 Instead of starting with the Tier 1 services, start with the Tier 2 services and see what the fallbacks to the shopping carts are.
- 21:20 If the shopping cart goes down, then maybe they are Tier 1 services instead and we didn't realise that they were Tier 1.
- 21:35 I would start by asking what the biggest impact is, where the issues are happening at the moment, and try to dig in deeper there.
- 21:45 In terms of injecting failure and latency between calls, it's a good place to build your chaos programme around.
- 21:50 Terminating instances at first to see if you're resilient to that is a great first step.
How did you win the argument to add more code to services to inject failure or errors?
- 22:35 At Jet, we were doing graceful restarts at first, along with ungraceful restarts, so it wasn't something that engineers needed to get involved in (which I think is important).
- 22:45 The most they needed to was to recover where necessary and clean up afterwards, or choose to opt out of the tests.
The business owners couldn't be easy though?
- 23:00 There were times where I had business owners or service owners coming up and asking me to turn off chaos because they had a big release that day.
- 23:15 That's why I needed to start conversations about chaos engineering - it was being seen as a blocker or a hinderance, rather than an important part of the development process.
- 23:25 If your services aren't resilient to a single instance failure, then how are they going to be resilient to many more types of failure that may occur in production?
- 23:35 After that was accepted (after a few months) we didn't have anyone opt out any more.
- 23:50 We then focussed on targeted chaos - at the time Jet relied heavily on Kafka, which was also used to do regional failover.
- 24:05 We realised that the potential for things going bad if the Kafka failover had problems.
- 24:15 We asked what the steady state was that we wanted to achieve with these failures in Kafka, and if that steady state got out of hand what do we want to happen?
- 24:30 We looked at what would happen if the offsets got out of sync, what would happen if we dropped random packets, injected data in and so on.
What's happening in chaos engineering?
- 25:15 The chaos community meeting is in its third year, and is a community-driven event, in San Francisco on September 8, and there's one in Minneapolis in November.
- 26:10 The state of the art is moving towards automation and the meetings to be a regular thing.
- 26:20 It can't scale if chaos engineering needs an expert on site.
- 26:30 In the beginning, it's helpful having someone there and getting things started, but once buy-in has happened then making it automated is the way to go.
How do you automate scenarios?
- 27:05 At Jet, chaos engineering was one of six things I was doing, so I didn't have my full attention on it.
- 27:15 I think it's important to monitor - both with the culture (how the tools are being adopted, how they are being changed), but also the business metrics.
- 27:30 Recording the success stories and monitoring them appropriately will allow you to get more success out of it, if it's in the company's best interest to continue down that path.
How do you notify the organisation for upcoming tests?
- 28:30 Some people have a lot of buy-in for chaos engineering, but other organisations need to be encouraged to get involved.
- 28:45 If you're starting from scratch, over communicate. Because if things start going wrong - you don't know the benefits of the hypothesis under test before you run it - if things go wrong and you haven't told anyone, the chances of you being able to do this again are very low.
- 29:10 If you're running your first experiment, get your service owners in a war room and get them to monitor the results of the test as it is running.
Mentioned
Tools
Chaos Monkey
Resources
Charity Majors: "Testing in production: Yes, you can (and should)"
Ali Basiri on The Chaos Application Platform
Netflix Artilces: Chaos Kong, ChAP: Chaos Automation Platform, FIT
Book: Chaos Engineering
Netflix content on InfoQ