Key Takeaways
- Over half of incidents tracked in The VOID are resolved externally within two hours.
- MTTR (and all related mean-based measurements of incidents) can be unreliable because of the positively skewed data occurring with incident duration. In addition to being misleading they may overly simplify the complexity of studying incidents.
- In lieu of MTT* data, it is recommended to use SLOs or Cost of Coordination data instead.
- Reporting on the root cause of an incident may be negatively impacting your analysis and organization. Incidents are highly complex and rarely reducible to a single cause and when they are that cause tends to be mislabelled as "operator error".
- While the database contains few near miss reports, these reports can often be more insightful as they lack the pressure of restoring service.
The Verica Open Incident Database (VOID) is assembling publically available software-related incident reports. Their goal is to make these freely available in a single place mimicking how other industries, such as the airline industry, have done so in the past. At the time of writing, the database contains over 1800 reports from close to 600 organizations.
As noted in the 2021 edition of The VOID Report, "If we want to improve as an industry, we can’t just stumble around in the dark or imitate large players without understanding why their approach may not work or which caveats apply." Within this report, the VOID team has identified a number of key learnings from the current collection of data.
Over half (53%) of all reported incidents are externally resolved within two hours. The report notes that this aligns with findings from Štěpán Davidovič in his book “Incident Metrics in SRE: Critically Evaluating MTTR and Friends”. In both cases, it was noted that incident durations are a positively skewed distribution where most of the values are clustered on the left side.
External incident duration for all reports within The VOID regardless of organization size, frequency, or total number of reports (credit: The VOID)
This leads into the second finding, that MTTR can be a misleading metric. The report notes that the mean is not a good representation for positively-skewed data. Davidovič demonstrated with both empirical data and Monte Carlo simulations that reliable calculations of improvements of incident duration (or incident count) is not possible. Their findings indicate that this applies to all forms of MTTx-style measurements. It was noted that although the mean is influenced by the spread of data, the median may be a better representation.
They note that these metrics are appealing because they seem to simplify a complex problem into something easily reportable. John Allspaw calls these shallow incident data:
They are appealing because they appear to make clear, concrete sense of what are really messy, surprising situations that don’t lend themselves to simple summaries.
Alex Ewerlöf has a similar warning in his InfoQ article How to Best Use MTT* Metrics to Optimize Your Incident Response that:
Depending on which metric you choose, the optimization focuses on one or more areas. If the wrong metric is chosen, the signal you’re trying to optimize may get lost in the noise of a multivariable equation.
Within the report, it is recommended to move away from MTTx data and instead, if metric reporting is required, move towards SLOs and cost of coordination data instead. Alex Hidalgo, in his book Implementing Service Level Objectives, states that "SLOs allow for you to think about your service in better ways than raw data ever will." Dr. Laura McGuire, in her QCon presentation How Many Is Too Much? Exploring Costs of Coordination During Outages, describes tracking how many people, how many teams, and at what level these individuals are in the org (i.e. individual contributors versus executives) can help showcase the “hidden costs of coordination” that can add to the cognitive demands of people responding to incidents.
The report authors did find that root cause analysis is rarely used in the collected reports. Only about 25% of reports utilize an RCA or, at a minimum, explicitly identify a root cause of the incident. It has been well documented that root cause analysis can be heavily misleading within complex systems that interact with a variety of socio-technical factors. The complexity of systems means that there is very rarely ever a single root cause or even a collection of root causes. There is a common trend to point to "operator error" as the source of root cause. Nora Jones, in her QCon Plus presentation Incident Analysis: Your Organization's Secret Weapon states that this can lead to a blameful culture in which open investigations of how to improve both the software systems and the human interactions with it are not happening.
Jeffrey D. Smith explains in his book "Operations Anti-Patterns, DevOps Solutions" that "a blameless culture, whereby employees are free from retribution, creates an environment much more conducive to collaboration and learning". Sidney Dekker elaborates further noting “a ‘human error’ problem, after all, is an organizational problem. It is at least as complex as the organization that has helped to create it.”
The final key finding within the report is the lack of "near misses": "an incident that the organization noted had no noticeable external or customer impact, but still required intervention from the organization". The report authors posit this is due to a number of factors including that near misses can be harder to detect, formal incident processes are not engaged for them, and publicizing them can negatively impact public perception of the organization.
However, as noted by Fred Hebert, "Near misses are generally more worth our time, because they come without the pressure of dealing with post-incident fall-out and blame, and allow a better focus on what happened.”
The report concludes with a number of recommendations. They include treating incidents as an opportunity to learn, embrace in-depth analyses over simplified metrics such as MTTR, view humans as solutions and not as problems, and to invest time investigating what went right alongside studying what went wrong.
One of the lead researchers for The VOID, Courtney Nash, recently spoke on the Page it to the Limit podcast with Mandy Walls. InfoQ sat down with Nash to discuss the findings and learnings from VOID in more detail.
InfoQ: You mention that "you have failures because you are successful". This seems like a counterintuitive statement, what do you mean by that?
Courtney Nash: First of all, I have to credit this notion to Dr. Richard Cook. If no one is coming to your site or using your service, it’s not likely to fail. With scale comes consequences, both good and bad. Building and maintaining those products and services always involves various forms of tradeoffs, and as such, complex software/internet failures arise because your system is always in some kind of degraded state—unforeseen actions/pressures can suddenly push it into a more perturbed state. Often, success (more traffic, more customers, etc) is what pushes you into that perturbed state.
InfoQ: The idea that incidents are caused by factors much more complex than the reductionist explanation of "human error" is not new. Why do you think we continue to point to human failure when explaining incidents?
Nash: I’d argue it’s still new to more people than it’s not. The press regularly covers incidents and accidents in this light, and we still see technology incident reports that name it as a primary “cause.” It’s still a pervasive idea amongst many security teams as well. Moreover, it matters who still believes this—folks on DevOps teams or SREs are likely to embrace this view, but if an individual SRE or incident responder conducts a blameless postmortem and seeks to learn from their incidents, and folks higher up the chain want someone to pin it on, that will still define the culture in which incidents are evaluated.
As for the second question, humans don’t do well with uncertainty. Engineering organizations are often incentivized and evaluated based on metrics that reflect how they are doing with keeping their systems up and running at high throughput. “Complex systems fail in complex ways” doesn’t look great on a Board presentation. When something fails, we feel a guttural urge to identify what caused it, and come up with a fix for it so it “never happens again.” Ironically, we still believe that it's easier to “fix” human behaviors that arise from people navigating complex, high-tempo systems under pressure than it is to address the complexity of the human-machine (aka sociotechnical) systems we’re building, and use new methods and approaches to navigating that complexity. As Sydney Dekker says in The Field Guide to Understanding ‘Human Error’, “Abandon the fallacy of finding a quick fix.”
InfoQ: You state that "the belief that there is a root cause can lead to a lot of really unhealthy practices". What sort of practices do you see as a result of this belief and how can organizations start to self-correct?
Nash: In the face of complex systems failures, the notion of a root cause leads to reductionist approaches and thinking that detract from, or at least don’t incentivize, learning from the incident. My favorite analogy about this comes from John Allspaw, who notes that we never ask for the “root cause” of a successful product launch. So why do we do so for a failure, when it’s a product of the same (usually successful) system? As Ryan Kitchens said at SRECon in 2019, “The problem with this term isn't just that it's singular or that the word root is misleading: there's more. Trying to find causes at all is problematic...looking for causes to explain an incident limits what you'll find and learn.” Moreover, when human error is identified as the root cause, this can lead to a culture where people are afraid to speak up, whether it be something they think is a lurking issue that they may have “caused” somehow, or to relate their experience if they were involved in an incident. Countless other industries have shown us that this is a path to less safety, and more incidents.
One possible shift is to talk about “contributing factors” instead of a root cause, which is a small language change that can facilitate a much bigger psychological/organizational change. It doesn’t remove entirely the possibility that humans will be blamed for an incident, but if you embrace that there are always multiple contributing factors, then you are on the path to understanding that people are just part of sociotechnical systems that can fail, and when they do, the focus should be on remediating the system, not the individual.
InfoQ: With the current data in The VOID, have you drawn any new conclusions or do you have any new theories on incident management?
Nash: The biggest result in the VOID report we published in late 2021 is that MTTR (Mean Time to Remediate) isn’t what you think it is. Based on the data we see in the VOID, measures of central tendency like the mean, aren’t a good representation of positively-skewed data, in which most values are clustered around the left side of the distribution while the right tail of the distribution is longer and contains fewer values. The mean will be influenced by the spread of the data, and the inherent outliers. When you have data like that, you can’t rely on the mean value to accurately represent your data. This reinforces the results that Google engineer Stepan Davidovic published in his 2021 report on MTTR.
The second problem with MTTR is that in using it, we are trying to simplify something that is inherently complex. MTTR tells us little about what an incident is really like for the organization, which can vary wildly in terms of the number of people and teams involved, the level of stress, what is needed technically and organizationally to fix it, and what the team learned as a result.
Another thing we’re noticing—it’s not so much a theory but more a trend—is a new role emerging in the industry: the Incident Analyst. Organizations that are bringing these people on board (or in many cases, training them in-house) are acknowledging that incident response and incident analysis aren’t the same set of skills—while those responding are increasingly the very same people who built the systems, incident analysts tend to have a skill set that is more journalistic in nature. They do have to have enough expertise in the systems involved, but they’re also skilled at collecting and analyzing documentation and system data (monitoring/observability, chat logs, etc), interviewing and assembling responder’s stories, and stitching together an objective, comprehensive narrative.
InfoQ: In the podcast you explain a theory you have that organizations that understand that incidents will always have a latent set of contributing factors "have a better adaptive capacity". Could you elaborate on that?
Nash: System safety and human factors researcher David Woods defines adaptive capacity as “the ability to continue to adapt to changing environments, stakeholders, demands, contexts,
and constraints.” Essentially, it’s the ability to adapt how and when the system adapts. Given the nature of complex systems, we can’t predict or eliminate all possible failures, so instead we want to be poised and capable of adapting when they do occur. If you believe you’ve found the root cause of an incident, remediated it through action items and moved on, you’re missing the bigger picture. You don’t really know where your safety boundaries are, what pressures will potentially perturb your system, and what kinds of actions could help vs. make it worse. You’ve only fixed the prior problem, which, given the nature of complex systems, won’t happen again—that particular state and latent set of contributing factors likely won’t recombine.
Dr. Richard Cook calls this the “Cycle of Error” and we discuss it in detail in the VOID report:
Organizations that take a “contributing factors” view of their systems (instead of the above, more common view) are more likely to do three important things:
- Treat incidents as opportunities to learn about their systems
- Prioritize understanding the safety boundaries of those systems
- Invest in their people who best understand those boundaries and how their systems can shift/perturb them
All of those things will give them an advantage in adaptive capacity when the next unexpected failure occurs, especially because they are more likely to view humans as critical parts of their sociotechnical systems, as opposed to external actors (who make mistakes).
InfoQ: How can our readers contribute or assist with The VOID project?
Nash:
- Analyze and write up your incidents.
Incident Analysis platform Jeli just wrote up an excellent overview of which incidents you might choose to analyze and write up, and they also recently published a thorough guide to incident analysis.
- Share them with the VOID.
You can easily drop a link in via the site, but if you have a large corpus that isn't already in the VOID, please feel free to get in touch so we can help ingest those.
- Chat about an interesting incident on the VOID podcast. The podcast seeks to hear the stories directly from the people who were involved in an incident they analyzed and published. We’re a big fan of near misses, and stories where people learned something surprising or unexpected about their systems, teams, or organization as a whole.
- Become a VOID Member.
Members don’t just get their logo on the site, they get a range of learning and collaborative opportunities with experts in the field of distributed systems, Chaos Engineering, and incident management and analysis.