InfoQ Homepage Presentations Rethinking Reliability: What You Can (and Can't) Learn from Incidents

Rethinking Reliability: What You Can (and Can't) Learn from Incidents

Bookmarks

View Presentation

Speed:

38:24

Summary

Courtney Nash discusses research collected from the VOID, challenging standard industry practices for incident response and analysis, like tracking MMTR and using RCA methodology.

Bio

Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. She has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Nash: I'm Courtney Nash. I am here to talk to you about rethinking reliability, what we can and can't learn from incident metrics. I'm an Incident Internet Librarian at Verica. I'm a researcher with a long background at a bunch of different places. I used to study the brain. I think mountain bikes are the coolest technology we've ever invented.

The Verica Open Incident Database (VOID)

I'm here to talk to you about this thing I made called the VOID. The Verica Open Incident Database is a place where public software related incident reports are collected and made available to everyone and anyone. Our goal is to raise awareness and increase understanding of software-based failures in order to make the internet a more resilient and safe place. Why do we care about that? Because software has long since moved beyond hosting pictures of cats online to running transportation and infrastructure and hardware in healthcare systems, and devices in voting systems and autonomous vehicles. These modern online systems are expected to run 24/7 hours a day, 365 days a year. Those increased pressures that you all deal with, combined with software models of interrelated increasingly automated services that run in the cloud, have accelerated the complexity of these systems. As you already probably know, from direct experience, when those complex systems fail, they fail in unexpected and chaotic ways. We all have incidents. Yes, that is a dumpster fire with a dragon lighting a volcano on fire. I think more of what you face is probably more like Calvin and Hobbes, where there's like a monster under the bed, and you're never sure when it's going to come out.

The really important point is that the tech industry has an immense body of commoditized knowledge that we could share in order to learn from each other and push software resilience and safety forward. If you're at all skeptical about that, I get that, you might be. There's historical precedence for this. It's not our industry, it's a different industry. In the 1990s, in the United States, our aviation industry was in a bit of a crisis, we had a horrible safety record. Significant high consequence accidents were happening on a regular basis. The industry collectively and from the ground up decided to get together and try to do something about this. A variety of pilots first from a number of different airlines got together and started sharing their incident data. They started sharing their stories and their patterns of what they were seeing. Eventually, more of that industry got on board, the regulatory bodies, the air traffic controller folks, a huge number of people got involved to share their incidents and find commonalities and patterns. Over the course of doing that, and obviously other activities, the safety record of our airline industry went way up. In fact, we didn't have a significant incident until some of the Boeing MAX stuff of recent years happened. It's possible to do it from the ground up as practitioners before even there were regulatory people show up. That's important.

What's in the VOID?

We're trying to collect those kinds of incidents together. There hasn't been any place in the past where we've had all of these, what you might call availability incidents. There's a number of people who've walked this ground before me. I'm not claiming that this was my idea. I've just been fortunate enough maybe to get more of them together in one place. At this point, we have almost 10,000 public incident reports from over 600 organizations, spanning from about 2008 up until basically now in a variety of formats. This is important. We're collecting them across social media posts, and status pages, and blog posts, and conference talks, and tweets, and full-blown comprehensive post-incident reviews, where you all sit down and write up a lot more information about these things. We want all of those because we want a really comprehensive picture of not just how you see these things, but how other people see them. Things like media articles are included in there as well.

Duration: Gray Data

What's in there is a bunch of metadata that we've collected, things like the organization, the date of the incident, the date of the report, and a bunch of stuff. The thing we're really going to focus on for this talk mostly is duration, and its relationship to a metric that is very influential in our industry, which is mean time to respond, or MTTR. First, let's dig into duration. Duration is what I like to call gray data. It's high in variability, low in fidelity. It's fuzzy on, when does it start, when does it end? How are those things decided and recorded? It's sometimes automated, often not. It's sometimes updated, sometimes not. It's ultimately a lagging indicator of what's happened in your systems. It's inherently subjective. We decide these things. We think they're spit out of our systems, but that's not really entirely true. The reality is, when you take all of these gray areas, and you use them as objective, and you average those, and try to think that that's some key indicator of performance, you get one big gray blob. You're not able to rely on that.

I want to show you a little bit of this gray data in action. These are mean times to respond that we have collected based on almost 7000s worth of incident duration data from the VOID. It's really tempting to look at these and try to use them as some indicator, maybe Cloudflare and New Relic are just better at incident response, or maybe their systems are less complex, or they get lucky more often. No, these data are not indicative of any of those things. As you see, there's a huge amount of variability across these even within a company across a certain amount of time, depending on the window in which you try to calculate something like a mean over these data. The problem with trying to use numbers like these derived from the gray data of duration is exacerbated by the inherent distribution of those duration data.

The Distribution Matters

I don't know how many of you have actually looked at the distribution, charted a histogram of your incident duration data if you collect them. Let's back up a little bit. This is a normal distribution, something you're very used to seeing. It's your classic bell curve. There's a really clear middle. There's similar curves off both sides. You can nail the mean in the middle. You can get standard deviations. You can do all kinds of super fancy statistical things with this. That's not what your data looked like, based on almost 10,000 incidents that we have in the VOID. This is what your incident data actually look like. As this plays out, these are histograms of the distribution of duration data across a variety of companies that we have that have enough data to really do this, hundreds, at least of incidents. They're skewed, you might notice. They are high on the left side of the graph, and then they drop down and they have a nice long tail there. The problem with skewed distributions like this is they can't be described well via central tendency metrics like the mean. Some of you are going to start thinking, we'll use the median, and we'll get there. Even that when you have this much variability in your data means detecting the differences is also very difficult. That's what a lot of times we're trying to do. We're trying to say, our MTTR got better. Our MTTR got worse, like what happened? It turns out, this just doesn't have meaning.

We had looked at a report from an engineer at Google last year, who had a lot of data that looked very similar to ours. He ran a bunch of Monte Carlo simulations in this report that's in the O'Reilly report that I showed there. He found that when he simulated a decrease in MTTR across a big chunk of Monte Carlo simulations, that there was so much variability in the data that it led to really like an inability to detect that change. It's really, you're subtracting 10% off your duration, you're decreasing your MTTR. You run a bunch of simulations of the original data against those shorter durations, and you can't actually detect a difference. I'll show you a graph of this. What he found, and we also replicated when we did the same thing with more companies across even more incidents, that was even when you introduce improvements, you find that almost a third of the time the detected change in MTTR was negative. It's a little counterintuitive, but let's just say that means that things got worse. This is what those data look like. You subtract the original data from the changed data. A positive means that things have gotten better. These are our results replicating Davidovic's Monte Carlo simulations. Anything on the right side means it got better, and anything on the left side means that it got worse, that your MTTR actually increased. What you'll see is for a big chunk of most of those curves you see, there's places where you think you got even better than 10%, and places where you might have gotten even worse than 10%. That's really what's important. There's too much variance to rely on the mean. Those of you who've paid attention to distribution and knew a bit about these things, or a lot about these things, even, might say, we could use the median. We ran these simulations across the median data, and the results were incredibly similar. There's still just too much variability and not enough sample size across most of these data to actually detect anything.

Sample Size Matters

Let's talk about sample size. That company that had the big red one, they have the most incidents out of anybody, in all of the data that we had. They have enough data that you can start to maybe get some fidelity from things like mean and running these kinds of Monte Carlo simulations, and all of those types of things. That really means that the only way to get better fidelity of your incident data is to have more incidents. That's really not what any of us wants. You have to get into the upper hundreds, almost thousands, to start seeing tighter curves, and being able to detect differences amongst things. Nobody wants that. We don't want to have more incidents.

Summary of Duration, MTTR, and Severity

Just to sum up a bit on this, MTTR is not an indicator of the reliability of your system. I think this is a really common myth in our industry, that it tells you how reliable you are. There's nothing reliable about those data that you just saw. You don't want to use an unreliable metric to talk about the reliability or the resilience or any of those words of your systems. It can't tell you how agile or effective your team or organization is, because you can't predict these things. You definitely can't predict the ones that are surprising, and they're all surprising. The MTTR won't tell you if you're getting better at responding to incidents, whether the next one will be longer or shorter, or even how bad any given incident is. I want to spend a little bit of time on this, just a slight diversion. Because most people harbor some suspicion, like the longer ones are worse, or the longer ones aren't as bad because we throw everyone at the bad ones, and those are done sooner. The longer ones are like, whatever, we can just let those run. Our data don't show that either. We had something like 7000 incidents where we could collect the duration and the severity, these are mostly status pages from those companies. What you see in this graph is what it really looks like for most people. You have long ones that are more severe. They're in the yellow, red. They're the 1s and 2s. You have short ones that are 1s and 2s. You have long ones that are 3s and 4s, they're green and blue, they're not so bad. The look of this chart is what statistically we found. We ran a Spearman rank correlation, for anyone who really wants to get in the weeds, across severity and duration, and only two of the companies we looked at showed what anyone would consider to be super weak correlations between duration and severity, -.18, -.17 R values, they did have significant p values. In that up to 0.2, 0.3 range of correlational analyses, it's a very small effect. Some you might see a little bit, but you also have to have a lot of data in order for an effect like that to show up.

Systems are Sociotechnical

These things, duration, MTTR, severity, they're what John Allspaw and many of us now come to refer to as shallow metrics, because they obfuscate the messy details that are really informative about incidents. I wanted to just give one metaphor about that, which is probably pretty relevant to any of us who live in the western United States right now. Your systems are inherently complex, and trying to measure incident response for those systems by things like duration or MTTR is like assessing how teams are fighting wildfires in the western U.S. by counting the number of fires, or how long it takes them to put them out, versus understanding what's happening for the teams on the ground in their reality, and each one is so remarkably different. That's not the answer you want. I understand. If it's not MTTR, don't panic. If it's not MTTR, then what is it? The answer to this, from my perspective is incident analysis. You're going to get really mad at me and you're going to scream, that's not a metric. No, it's not. It is the best thing to tell you about the resilience and the reliability of your systems.

What do those kinds of things look like? What kinds of data do you get out of incident analyses? I know we think data is numbers, and observability, and what comes out of Grafana. We have these expectations about what that looks like. There are numbers and metrics you can still get out of these kinds of approaches to studying your systems. Your systems are sociotechnical. They are a combination of humans and computers and what we expect them to do, which means you need to collect sociotechnical data. Some of the things that I pulled together here for you to think about are cost of coordination, how many people were involved? Were they paged overnight? Did they have to wake up in the middle of the night to deal with this? Across how many unique teams were there? How many people were there involved. PR and comms, did they have to get involved? How many tools, how many channels? There are so many things that tell you whether any given incident has a big internal impact, along with what you believe the external impact is, and how well your teams are dealing with those or how much your organization has to throw at something in order to deal with it.

Study near misses. How many near misses do you have compared to how many incidents you have? It's way more, which means your success to failure ratio is way higher than you realize it is. They tell you things that you don't get from metrics like duration, or MTTR. Then there's other really interesting side effects of investing in incident analysis, and that's, how much do people start to invest and participate in these things? How many people are reading these? You can totally track that and don't be creepy about it. The number of people who are actually showing up to these kinds of post-incident reviews and attending to hear what's happening? Are these things getting shared? Where else do they show up? Do they get referenced in PRs? Do they get referenced in code reviews? There's all kinds of places, and you can put a number on any of these if you really need to put a number on a slide for somebody higher up the chain.

I want to talk a little bit more about the near misses piece of that, which is that near misses are successes. When you look at the write-ups from these few near misses, we have almost none, less than a percent by a long shot of these in the VOID. They talk about the kinds of things they find in near misses that they don't even find in incident reviews. These are rich sources of data about where pockets of adaptation exist in your organization or on your team. Carl Macrae has this really great book called "Close Calls" where he talks about near misses as providing the source materials to interrogate the unknown. You can test assumptions and validate expectations, and become aware of ignorance and also watch experience as it plays out in your organization. They also highlight where people leverage their expertise and provide adaptive capacity when their system behaves in unanticipated or expected ways, which is what incidents are. They often include much richer information about sociotechnical systems, including things like gaps in knowledge, and breakdowns in communication, and even things like cultural or political forces on your systems.

Methods of Analyses in the VOID (Root Cause)

I want to talk next about one other thing that we track in the VOID so far, which is the methods of analyses that people use. We are really interested in the use of root cause analysis, or RCA. I'll talk about why. Last year, we saw about a 26% rate of organizations either formally stating they use root cause analysis or have some root cause identified in their reports. That number went way down this year. I just want to clarify, we went from about 2,000 incidents to about 10,000 incidents, so the denominator changed. That 6% number, we didn't see across the same set of report. I want to be really clear here. We're beginning to build up this corpus in the VOID. That's a spurious result. I want to talk about why we care about root cause and some interesting developments that have happened since the last VOID report. A very formal root cause analysis posits that an incident has a single specific cause or trigger, without which it wouldn't have happened. It's very linear. Once the trigger is discovered, a solution flows from it, the organization can take steps to ensure it never happens again. This linear sequence of events mindset, which is also known as the domino model, has its origins in industrial accident theory. We've adopted it, we've taken that on. Because we don't go looking for them until after the fact, the reality is we don't find causes, we construct them. How you construct them and from what evidence depends on where you look, what you look for, who you talk to, who you've seen before, and likely who you work for.

A funny thing happened between the last report and this one, because we had two really big companies last year in the VOID, that were doing root cause analysis, it was Google and Microsoft, we added Atlassian this year, which kept the number at least a little bit higher. Between that last report and June of this year, Microsoft stopped doing root cause analysis. I think that's fascinating, because if an organization the size of Microsoft Azure, a very large organization, can change their mindset and their approach to it, any of us can. Any of us can change our mindset and our approach to any of the ways that we think about and look at how incidents inform us about our organizations. I think language matters. Microsoft realized that. They realized that when you identify a number of contributing factors, especially if they're sociotechnical in nature, then you're modeling a very different approach for how your team thinks about the systems it deals with. If you go and read these new reports, they call them post-incident reports, PIRs, because we all still need an acronym. They're very different too, like the detail in those reports, the approach, the way of thinking about their systems has shifted. I'm not saying just because Microsoft Azure team suddenly said, we're not going to do root cause, those were all a piece of the way they chose to change their approach. I just have to say, if Microsoft can do it, so can you.

We need a new approach to how we think about, analyze, talk about, and study incidents, and how we share the information from those. The biggest piece of that is to treat incidents as an opportunity to learn, not to feel shame or blame. I am so excited to see so many more companies sharing their incidents, and not seeing that as a place where PR or marketing has to run around and worry. They also favor in-depth analysis over shallow metrics. We're going to move past MTTR, and worrying about duration and severity, and we're going to dive into what's happening. We're going to listen to the stories of the practitioners who are at the sharp end, who are close to how these systems work. We're going to use that to better understand those systems. We're going to treat those practitioners and those people as solutions, not problems, and study what goes right, along with what goes wrong. This new approach, I suspect has a competitive advantage. I do believe that companies that do this will be more responsive in ways and have adaptive capacity that others don't, which will give them an advantage over companies who don't view the world this way.

Resources

I would like to encourage you to analyze your incidents. There's a couple of resources in here for thinking about how to do that. I would love for you to submit them to the VOID. There's a forum, handy-dandy forum, if you don't have a lot, if you have a lot, you can get in touch and we'll help you get those in there. You can actually become a member and participate in a variety of more involved ways if you're so interested. Some of these data I referred back to the previous report from last year, a bunch of the things I talked about plus a bunch more will actually be in the next report which is coming soon.

Questions and Answers

Rosenthal: You mentioned that there's another VOID report coming out soon, do you have a date now?

Nash: I do. The world is working in my favor now. It will be coming out on the 13th. People can go to the VOID website, and there'll be a handy-dandy big button to go and download the report.

Rosenthal: You basically made a case in your presentation that there's no metric that you can use to communicate the reliability or resilience of a site or an application. Is there a way that you could create a combined metric to distill down like one number or something that you could use to represent the resilience or reliability of your site or application?

Nash: No. I wish. Wouldn't that be magical, and wouldn't we all want to have that number and collect that number, and celebrate that number? Unfortunately, no, we have to do the hard work of digging into these things.

Rosenthal: You mentioned that you replicated the Monte Carlo simulation that somebody at Google did. Can you say a little bit more about how you replicated that? Did you have a team doing that? What's involved there?

Nash: Yes, we did. First of all, it involved somebody else who writes better R code than I do. We actually had an intern from the University of British Columbia, whose name is William Wu. He actually helped us with this. He's getting his PhD in evolutionary ecology, studies large, complex biological systems, which is really cool. He was really interested in how our complex sociotechnical systems work. What we did was we took all of those incident data and put them into two groups. You're able to then run experiments on that, like A/B tests. It doesn't take a lot of people to do this. I had one person who wrote some really great code, and that allowed us to really poke around a lot, too. We looked a lot at the distributions. He tried to fit power laws to things. Once we had all these data, we were able to separate them into different groups, and then test all these theories on them but, unfortunately, quite a few of them fell apart.

Rosenthal: You mentioned that Microsoft no longer does RCA at least with the Azure one.

Nash: The Azure health status page for all their various services, yes.

Rosenthal: If I work at a company that isn't Microsoft Azure, what would be the one thing that you would recommend for me to improve my availability process, not MTTR? If I could only change one thing, would it be, stop doing RCA or something else?

Nash: This is one where I say language matters. Even if you are doing what your team officially calls RCA, what you really want to be able to do is have someone or someone's people who have the dedicated skill set to dig into these kinds of incidents. I've seen people be able to do this work in environments where the belief is still that you can find one cause. We're seeing a trend of people hiring incident analysts or taking folks who have been on the SRE team or whatnot. The skill set to invest in investigating your incidents is very different than the skill set it takes to build and run these systems. If I were to do one thing I would invest in a person or a team of people who are skilled at investigating incidents and distributing that information back out to the organization so people can learn from it. That skill set developed within the Azure team, and that led to a greater awareness of that. I think the change in the mindset there came from the investment in people and really taking a different approach to learning from incidents. Then eventually, they said, now we're doing this other thing. We can call it this other thing and we'll see what the world thinks of that. They actually asked for feedback on it, which I think is fascinating. The one thing is to have people who are really good at investigating incidents.

Rosenthal: What are your thoughts about the correction of errors approach that Amazon uses? Do you have any thoughts about that, as it relates to RCA?

Nash: When you investigate an incident, you're not correcting errors. You're learning about how your systems work. Along the way, you might fix some technical things, you hopefully change some sociotechnical things. I think that language has a similar mindset to it, that RCA does. Like, we'll correct the errors, and errors corrected. What were those errors? I think it's a very narrow view that the language reflects.

Rosenthal: It includes 5 why's, and action items, and other stuff that's been demonstrated in other forms don't necessarily improve the reliability of a system, although they do keep us busy.

What are your thoughts on business perceived reliability versus having incident data?

Nash: I used this phrase in the talk, and it's a safety science term. It's maybe a bit obtuse. We talk about people at the sharp end, and then the blunt end, it's a wedge. The people at the sharp end have a typically different perception and understanding of the reliability of their systems than the people at the more blunt and further removed or business end. There's often a big disconnect between those two. I talk to a lot of SRE folks. I talk to a lot of just people running these types of systems who are often very frustrated at the lack of understanding at the business side of things. Ideally, some SRE practices aim to connect those a little bit better, trying to use things like SLOs which are designed in theory to align the performance of your system with the expectations of its customers, whatever those may be, like internal customers or whatnot. It's usually pretty disconnected. The interesting thing about companies that really invest in that detailed incident analysis, is it reveals that reality, if you can do the work to push that information further back into your organization.

My favorite example of this is David Lee at IBM. He's in the office of the CIO, a gigantic organization within a gigantic organization. IBM is huge. The office of the CIO is I think, ultimately, like 12,000 people. He's managed to take in incident analysis and review process and create an environment where executives show up to these learning from incident reviews, I think they're monthly. He's actually tracking some of those data that I talked about, about how many people are coming to the meetings, how many people are reading the reports, and all of that? He's watching that gain traction within the CIO organization. He's taking the data from the sharp end and he's bringing it into the business, and they're beginning to see what the reality on the ground is. It reminds me of the early days of DevOps, where we're trying to connect these two sides of the organization. I still see so much power in building a culture around these incident analyses, they reveal so much that typically at a certain distance from that on the business side of the house is rarely seen. There's a gap there that people have to bridge. There's work to be done to get those two groups aligned.

Rosenthal: You mentioned two ways not to do it. You outlined a little bit better process to investing in incident analysis. I'm going to post the Howie document, which is one company's take on a better process for incident analysis.

Nash: The best process is the one that you come up with, for your own organization. There's not a template. There's not a forum. That's very frustrating too. I've watched this happen in quite a few organizations where you just start doing this, and a process that you develop if you're invested in this, in learning from incidents, it will be the one that you evolve and refine and change to work for your company. That is the take of one company that's very good at this. That doesn't mean that has to be exactly how you do it either. You just have to start.

See more presentations with transcripts

Recorded at:

Aug 31, 2023

Courtney Nash

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?