In today’s podcast we sit down with Ryan Kitchens, a senior site reliability engineer and member of the CORE team at Netflix. This team is responsible for the entire lifecycle of incident management at Netflix, from incident response to memorialising an issue.
Key Takeaways
- Top level metrics can be used as a proxy for user experience, and can be used to determine that issue should be alerted on an investigated. For example, at Netflix if the customer playback initiation “streams per second” metric declines rapidly, this may be an indication that something has broken.
- Focusing on how things go right can provide valuable insight into the resilience within your system e.g. what are people doing everyday that helps us overcome incidents. Finding sources of resilience is somewhat “the story of the incident you didn’t have”.
- When conducting an incident postmortem, simply reconstructing an incident is often not sufficient to determine what needs to be fixed; there is no root cause with complex socio-technical systems as found at Netflix and most modern web-based organisations. Instead, teams must dig a little deeper, and look for what went well, what contributed to the problem, and where are the recurring patterns.
- Resilience engineering is a multidisciplinary field that was established in the early 2000s, and the associated community that has emerged is both academic and deeply practical. Although much resilience engineering focuses on domains such as aviation, surgery and military agencies, there is much overlap with the domain of software engineering.
- Make sure that support staff within an organisation have a feedback loop into the product team, as these people providing support often know where all of the hidden problems are, the nuances of the systems, and the workarounds.
Subscribe on:
Our discussion begins with an exploration of the role of SRE at Netflix, and continues by examining the related use of service level objectives (SLOs), service level indicators (SLIs). Key SLIs and business-proxy metrics can be used trigger alerting on a potential incident. We also cover the post-incident process, which typically includes a postmortem, and Kitchens suggests that failures within complex systems often does not have a root cause, and engineers must look for what went right, and much as they search for what went wrong.
When working with complex sociotechnical systems, like many of us do within modern software-powered organisations, much can be learned from the fields of cognitive systems engineering and resilience engineering.
Kitchens states that “failure happens all the time” within complex distributed systems, and as this is the new normal, we must develop skills for dealing with this -- and crucially, learning from this -- throughout an organisation. A key lesson shared is that we should ensure that support staff within an organisation have a feedback loop into the product managers, as the people providing frontline support often know where all of the hidden problems are, the nuances of the systems, and the workarounds.
Kitchens also shares his advice on running incident “game days” and the associated topic of chaos engineering, and concludes with an interesting failure story from which we can all learn from.
Show Notes
What is your role at Netflix? -
- 02:30 I am on the core team; anything related to incidents, we have our claws in.
- 02:40 In the worst case, we are the first people to get a page when people cannot stream on Netflix.
- 02:50 There's a pretty giant breadth of possibilities in terms of incident response.
- 02:55 We also maintain the whole lifecycle; following the incident, memorialising it - which is post-incident documents - and following up with teams to learn how the incident occurred in the first place.
- 03:20 We then figure out what people do during everyday work that contributes to overcoming incidents or encountering them.
Is there a lot of overlap between what you do and Google's Site Reliability Engineering? -
- 03:45 There's a lot of overlap - it's not a one-to-one mapping with Google's model; SRE has its own incarnation at every company.
- 03:55 Not every team has Service Level Objectives (SLOs), not every team has an error budget; some do, but it's not mandated.
- 04:05 There's a lot of overlap with SRE - we take the job title, because that's the closest thing we have a name for.
- 04:10 There are some things that we're doing at Netflix that we're pushing the envelope on what SRE is.
Can you break down SLOs and SLIs? -
- 04:30 They are Service Level Objectives and Service Level Indicators; the ultimate goal is an effort to reduce alert fatigue.
- 04:40 What you want to do is find key performance indicators of a service, if you're taking the SRE approach of SLIs you can look at latency, response time, error rates.
- 04:55 The SLOs are the goals/expectations that you would like to hit to set with other teams.
- 05:00 For example, our service will return within 250ms for 99.9% of the time for these particular calls.
- 05:10 When you bring this into the realms of alerting, these are top-level metrics these are proxy indicators that have gone off the rails and we need to look at it.
- 05:20 At Netflix, our greatest example are people pressing the 'play' button to start playing a video.
- 05:25 We have various alerts and secondary metrics, but if that metric deviates we know something is going on and we need to investigate deeper.
You have a 'streams per second' metric? -
- 05:40 There's a lot of historical context on that one, which has evolved over the years, but we still use this one as the best proxy for user experience for people starting a video.
- 06:00 What becomes nuanced is to find metrics for teams which aren't on the streaming path, or on a more individual basis that aren't on the product level.
- 06:10 If we look at the database teams, that's a much more involved system - you have a lot of different metrics that they could look at.
- 06:20 So the problem is finding out what what the equivalent metric is for streams per second for those teams.
- 06:35 Sometimes you have to use proxies; it is a case-by-case basis.
Do you think as an industry we focus too much on what went wrong? -
- 06:50 It's interesting; we do, but through the lens on the reliability face we focus too much on failure.
- 07:00 You may find startup stories that focus too much on success and over-sell.
- 07:10 In our space, we focus on incidents and accidents - security of how thing go wrong all the time.
- 07:15 When we say 'how things go right', what are people doing every day that helps us overcome incidents?
- 07:25 We were talking about finding sources of resilience - the story of the incident that you didn't have.
- 07:30 If we look at post-incident documents/post-mortems, you can ask what went well - it could be literally anything.
- 07:45 Understanding the things you really want to highlight, what challenges you want to overcome to get past that incident - that's what you start to look for improvements and the way people interact with systems.
- 08:00 Someone could be doing something that works nine times out of ten, but in the time that didn't, what's going on there?
- 08:05 It's so much harder to discover the things that go right, because we take them for granted - people who are experts in the system hide the expertise needed that is the effort that goes into dealing with those problems.
How do you go about figuring this out; interviewing, or watching people? -
- 08:30 Incidents are the Trojan Horse to all of this; you have the interviews (though we don't use the term 'interviews' as it can be intimidating).
- 08:40 You can have a chat with someone, and ask them what went on there, looking for threads that sound novel or interesting, and you can pull on those.
- 08:55 You take that person's perspective, and then you do the same with other people's perspective, and you try and build up a narrative story of how we got here.
- 09:05 The thing we try to go for is to tell a narrative story, broken up into a few different sections.
- 09:15 First of all, what existed that limited the damage to the incident? - that's our way of looking at what went well.
- 09:20 The other thing is what are the risks to generalise out of it - can we solve something for all of Netflix, instead of just this one team?
- 09:30 Another is highlighting conditions at the start of the incident that allowed it to occur.
- 09:35 What we're trying to do is to move away from causal explanations of incidents.
- 09:40 Just looking at what caused it isn't sufficient - we end up missing a lot of stuff that existed but which didn't necessarily cause the incident.
- 09:50 If we don't find those things, they could be present in the next incident.
- 09:55 If we're just looking at what caused this incident, then we fix that thing and move onto the next incident - all of those things could still be there.
- 10:00 Once you've fixed an incident, by definition your next incident is going to be totally different - no matter what you've done in one incident, it's not going to help you learn from the next one.
- 10:15 You need to dig into the things that persist across multiple incidents - we're looking for the patterns to dive deeper.
I'm assuming that root cause analysis isn't enough these days? -
- 10:40 It's an oversimplification - my point is that if we're looking at a causal chain, ironically root cause is a shallow term.
- 11:00 When people are on-boarding, we teach them how to really get to the bottom of it.
- 11:05 What we have to become comfortable with is that there really isn't a bottom to this - you can go on for ever.
- 11:10 It really depends on the perspective.
- 11:15 If I work on a particular team, I may care about a particular aspect of that incident that is many layers abstracted from what another team cares about.
- 11:20 When you're looking at causes, it's always through the lens of a perspective.
- 11:30 The whole idea of a narrative story that we tell is to bring all of those perspectives together, so that we can find the differences in people's mental models.
- 11:40 Everyone thinks they know how the system works, but they know thinks in a slightly different way - so these differences in opinions of the way that people think things work are data that we can use to help improve and re-adjust everyone's mental models.
- 11:55 Incidents never stop, which is another thing to become comfortable with.
- 12:00 In terms of management metrics, this becomes a difficult conversation, because a goal of zero incidents is unattainable.
- 12:15 You have to have a mindset that incidents are not preventable; they are inevitable.
How can you get management to care about these things? -
- 12:30 That's the first question that everyone wants to know.
- 12:35 If you don't have metrics, you won't know what you're looking for.
- 12:40 It's really easy if you are in a VP/CTO position, because in a lot of organisations you get to call the shots.
- 12:50 What's unique about Netflix is the freedom and responsibility culture.
- 13:00 There are no mandates - I don't have to do something because someone tells me to.
- 13:05 In light of that, how do you get buy-in? You have to get buy-in, find champions who are interested in this.
- 13:10 If we're talking about what kind of metrics to look at for making a case for this, more people will start to attend post-incident review meetings.
- 13:30 The documents you write-up will start to get richer and more detailed, and new hires will use them.
- 13:35 These are some of the things you can look for that aren't metrics, but indicators that there is a need for them.
- 13:45 In a lot of ways, it's pivotal for business' continued success that you learn from incidents - so the more buy-in you can get (doesn't have to come from the top) but can be beneficial.
- 13:55 You find that in cultures that are different from Netflix, which have more of a command-and-control style, you have to find a way to either do the work below the radar or start to convince your leadership that there is a different way.
- 14:20 You need to convince them that they make space to be able to question the general assumptions that the organisation has.
- 14:30 A lot of people tend towards flying under the radar; that is absolutely a way to do it.
Failure happens all the time. -
- 15:00 The seminal paper "How complex systems fail" [https://www.researchgate.net/publication/228797158_How_complex_systems_fail] by Dr Richard Cook, outlines a few key points.
- 15:10 One of the key points is that failure happens all the time; everything is failing in some nuanced way.
- 15:20 These are opportunities - we call them "unplanned investments" - it's really our responsibility to get as much out of those as we can, because they're going to keep happening.
- 15:30 The things that you do to make them better are going to have different failure modes; the world around you keeps changing, and we have to keep up the capability to adapt to it.
You have been creating resilient systems by building capacity for adaption and reaction to failures. -
- 16:10 John Allspaw has popularised this in software, and a community has grown from this, and the Lund university human factors program in Sweden is an academic path to get there.
- 16:25 The resilient systems community is a bit at odds with the academia in some ways.
- 16:30 A lot of people's reaction to hearing about resilience engineering as a multi-disciplinary field that has existed for more than ten years are "oh, they are academics".
- 16:45 What I find is that particularly in software that a lot of practitioners are going into the resilience engineering community, so that it's not academia driven.
- 16:55 Even folks like David Woods at Ohio State University at the Cognitive Systems Engineering Programme are deeply involved in academia and are still incredibly practical, like his work on Three Mile Island, the Columbia disaster - on the ground stuff.
- 17:15 The things that the resilience engineering community are doing are not lab conditions.
- 17:30 A lot of the books that are out there - whether it's about aviation or medicine - are incredibly practical; you could replace those things with software, and it would be the same.
Do you have any book recommendations? -
- 17:45 The best introduction is "A Field Guide to Understanding Human Error" [ISBN:978-1472439055]
- 17:50 We've given it to some new hires here - it's like the "Hello World" of the field.
- 18:00 Another one is Etsy's "Debriefing Facilitation Guide" [https://extfiles.etsy.com/DebriefingFacilitationGuide.pdf]
- 18:05 When we're talking about interviewing people, or facilitating post-mortem meetings, there's a lot of good advice in there.
- 18:15 One of the things we find is that experts aren't great at describing their expertise.
- 18:25 This is true at Netflix, or any company that claims to "hire the best" - the better you get, the more you actually need doing that some of the questioning work to probe deeper questions.
- 18:40 It becomes more challenging for experts to do that; in a company that has mainly senior engineers, how do you share knowledge between people?
- 18:50 You have to make an active effort to build a learning organisation.
How do you model that? -
- 19:05 There are different audiences for everything - you can't just have one meeting or a giant incident review with a lot of people.
- 19:10 You have to have smaller working groups, you have to have education classes following incidents.
- 19:15 You have to work with people afterwards, and really understand how people do their jobs.
- 19:25 The responsibility is in amplifying the effort by getting people engaged in reading it and sharing it, so as you start to see these documents referenced in Slack or other documents, you know you're doing a good job.
Do you aspire to automate these responses? -
- 19:50 Some of it is automatable - when they are going to automate things they are looking for time savings, and you're not going to find much of that in this sort - there's a lot of high touch curation.
- 20:10 What it does help with is finding those patterns, visualising those timelines, articulating those things to make data consistent.
- 20:20 You can look at that as one of the metrics; as people start to do this stuff, they will start to create tools to help them.
Charity Majors said "Availability is made up; the nines don't matter unless the users are unhappy" - how does that impact the work you do at Netflix? -
- 20:45 We're not going to chase numbers; what I love about Netflix is that there's almost an aversion to these kind of metrics.
- 20:50 Once you get to a certain scale, when you're looking at a number of nines, is that the margin of error to make that number accurate is almost bigger than the nines you're trying to report on.
- 21:05 Trying to get that fourth nine - is it actually true or not?
- 21:10 Netflix has so many different devices in the field - TVs, games consoles, mobile devices etc. - those all have a dependency on their ISP.
- 21:20 When you're looking at nines, what side are you looking at - server-side, client side? How do you aggregate them?
- 21:30 Availability is a proxy metric - it has many different dimensions, and people care about those dimensions in different ways.
- 21:35 The first thing that I'd say is that bypass this conversation entirely, and look at anyone who is doing customer support in your organisation, and have a feedback loop into your product teams.
- 21:55 If you don't do that, you're missing out, because your front-line support team know where all the problems are.
- 22:00 They know about things that don't just impact availability, but they know where the annoyances are that they're going off to solve to unblock users that no-one knows that they are doing.
- 22:10 It is incredibly important - the sort of idea is to make the product team feel the pain.
- 22:25 That's going to impact your availability so much more than chasing a number and setting a goal.
- 22:35 Telling people to be better isn't going to make them better; you're going to have to build a system that looks at the interactions and the user experience.
- 22:45 What are the interactions people are having with the system that may not have anything to do with availability, but they care about from a customer perspective.
Do you run failure game days? -
- 23:10 Our team does one for our team that helps us diagnose things we can imagine.
- 23:15 It's sort of an on-boarding exercise (which is also great for people who have been here for a while) to know what the tools are, or try to imagine scenarios to what might happen and role play what the responses might look like.
- 23:30 A lot of teams do that across Netflix.
- 23:35 As we are expanding the scope of my team to touch different business teams, we're encouraging and partnering with them to do game days to facilitate it for them.
- 23:40 This goes hand-in-hand with setting up Chaos Engineering experiments as well; you can start with a game day, and then determine that it might be a good chaos experiment, and you can prove that you can handle this.
- 23:55 It both uncovers surprising things in your system that you didn't know about, but it's also works towards what the co-ordination costs are in an incident.
- 24:15 It helps to uncover common ground, and the costs of communication in an incident - how are the Slack channels organised, who knows about a particular subject?
Can you share an interesting game day or failure result? -
- 24:45 One place I worked, we had services that could subscribe to a command-and-control protocol.
- 24:50 We could manipulate the fleet of services by matching against a name of an application and giving it commands to manipulate its functionality at runtime, which already sounds dangerous.
- 25:05 It was used primarily in emergency situations to manipulate the state of an application.
- 25:10 Someone implemented the command 'crash', which effectively restarted processes.
- 25:20 Because it listened to commands over RPC to a side-car process, it worked.
- 25:25 The system of how we matched against the application names was based on regular expressions, and someone typed in 'crash' to an empty query, which you think would match nothing.
- 25:40 However, nothing in this regular expression engine matched everything, and so every service at the company that implemented the basic set of commands, including 'crash', then subsequently crashed.
- 25:45 Every service in the company globally crashed at the same time.
- 25:50 What we discovered is that the strange loops of dependencies of what a cold-start looked like in the company, where everything has gone away, and we had to bootstrap to restart.
- 26:05 This was a good few hours of being down while we scrambled to get everything on-line, and figured out what to get back on-line and what order to start them up.
- 26:20 When you learn from success and not only failure - scrambling to get everything together; going back and digging through the pieces, how the people involved knew the particulars to get their services up, what was the co-ordination came from.
- 27:50 It was OK that the person who pressed the 'crash' command did so, and we got a ton of information out of it; we were able to improve our service hand-over-hand.
- 27:00 It was a massive outage, but we learned a ton, and in the endgame we learned a lot for our customers in future because of it.
- 27:10 I find a lot of companies that promising a failure like that will never happen again isn't as good as telling how much you learnt from it.
Would you recommend anything else to allow people to learn more? -
- 27:45 We have started a GitHub repository to collect resilience papers [https://github.com/lorin/resilience-engineering], and several colleagues of mine run resiliencepapers.club [https://resiliencepapers.club], and you can get in touch with me on Twitter @This_Hits_Home - happy to chat with anyone about this.