InfoQ Homepage Presentations How Did It Make Sense at the Time? Understanding Incidents as They Occurred, Not as They are Remembered

DevOps

How Did It Make Sense at the Time? Understanding Incidents as They Occurred, Not as They are Remembered

Bookmarks

View Presentation

Speed:

38:15

Summary

Jacob Scott explores the basics of failure in complex systems, the theory and practice of how it made sense at the time, and actions to take.

Bio

Jacob Scott a technologist who is deeply curious about reliability in complex socio-technical (software) systems. He is currently a staff software engineer in the Platform & Ecosystem group at Stripe, focused on user facing event systems.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Scott: With apologies to Beauty and the Beast, we're kicking off with a tale of the oldest time, song as old as rhyme. An engineer hits return, and just like with the Ever Given container ship that launched a million memes when it got stuck in the Suez in 2021, what happens next is an incident. I learned on Wikipedia, of course, that the Ever Given can fit 20,000 containers on board. If anyone watching is responsible for operating Kubernetes, please let me know if this is the world's longest and largest CrashLoopBackOff. I'm Jacob. I work as a staff engineer at Stripe.

Outline

I'm here to discuss the question, how did it make sense at the time? Specifically, I'm going to use examples of humans triggering incidents to explore the theory and practice of asking, how did it make sense at the time? With the goal of making users happy. I believe you should care about how it made sense at the time, because it will lead to happier users. There are many paths to happier users. Here's one concrete example relevant to us, laid out backwards. Fewer HTTP 500s mean happier users. Better decisions in incidents means fewer 500s. Better learning in post incident activities means better decisions in incidents. Asking, how did it make sense at the time, means better learning in post incident activities? Of course, before we can ask, how did it make sense at the time? We need to understand what we're asking. As I've highlighted, we'll focus on the last two steps in this trend for the rest of the talk. Where we'll spend our time is as follows. First, we have a few more preliminaries to run through to provide context and set the stage. Second, we'll flesh out theory and concepts. Third, we'll explore applying that theory in post incident activities. Finally, we'll wrap up.

Incident Triggers (The Human)

Now I want to unpack incident triggers. Focusing on triggers means ignoring most of what happens during incidents. We're only looking at how they start. I'm doing this because human triggers carry with them a common sense or intuition of, could they just have not pushed that button? This is strongly in tension with the perspective of, how does it make sense at the time? This tension in turn is fertile soil for those aha moments where everything clicks. We're focusing narrowly to spend our time in that fertile soil. I've said, how did it make sense at the time, probably 10 times already. As a point of order, moving forward, I'm going to use the acronym HDIMSATT interchangeably with how did it make sense at the time, to have a little bit less of a word salad.

Caveats

First, I'm taking artistic license. We'll be covering fragments of incidents to help us wrangle HDIMSATT, but we shouldn't confuse that with a deep understanding of them, let alone how they were experienced by those involved at the time. Second, I'll be critiquing some parts of publicly available incident write-ups. I can only do that because they were written about. It can be fraught to explain your failures to the world, especially to customers who pay you or prospective customers considering paying you. If none were published, our ability to learn from each other will be crippled. Courtney Nash gave a really great talk in the effective SRE track, where she presented research built on top of the VOID, an open repository of over 2000 incident reports from software organizations.

Salesforce's May 11, 2021 Multi-Instance Service Disruption

Let's explore the theory of HDIMSATT through the lens of human triggers of incidents. This is going to be the bulk of the talk. My goal here is to spark your curiosity and provide sufficient foundation to support discussion of HDIMSATT in practice. On May 11, 2021, Salesforce went down for 5 hours. This might have been a snow day for your go-to market organization, since many of them basically live in Salesforce. In my experience, engineering rarely has a good understanding of how others experience incidents. I challenge you to find a friendly account executive and ask them if they remember May 11th of last year, and if so, how it impacted them. Let's look at what happened according to Salesforce's public explanation of this multi-instance service disruption. The trigger for this incident was DNS. An engineer made a configuration change to Salesforce's BIND DNS servers and hit an edge case in a long existing script where BIND didn't restart cleanly. Basically, pushing the change to a DNS server shut it down. The engineer pushed the change globally via an emergency break fix process. Then, basically, all Salesforce globally went down. That's the incident for our purposes, because, as we've said, we're interested in the human trigger, which is the engineer running the script.

Now we can look at how Salesforce understood and explained what happened. This is a section of the document that they titled, Root Cause Analysis. I feel compelled to note here that I have problems with the term root cause analysis. Salesforce's RCA has two findings that are interesting to us. First, the staggering canary change management policy was not followed or enforced. Second, the emergency break fix process was subverted. Again, subverted is the term used by Salesforce. That is, they consider the use of this process during this change to be against policy. We could lift the RCA findings into a human trigger centric version to explore as follows. An engineer applying a DNS change did not follow staggering canary process and subverted emergency break fix process, and Salesforce went down for 5 hours. Now we hit a major killer. People don't show up to work to cause 5-hour outages. If there's anyone watching, who could ask a coworker, are you planning to cause a 5-hour outage today, and get a serious yes back. People don't show up to work intending to cause outages, but sometimes take actions which lead to outages. This is what motivates us to ask, how did it make sense at the time? Bringing this insight back to what happened to Salesforce on May 11th, we can ask, HDIMSATT to the engineer who triggered the incident to not follow and/or subvert these processes. Now we'll dive into that question.

First off, what Salesforce's write-up doesn't touch on is how frequently the emergency break fix process is, "subverted" in this way without anything going wrong. The background here is of a woman jaywalking, because, until recently, jaywalking was against the law in California. Everyone jaywalks. I think we all understand that we jaywalk because it saves us time, and we look both ways first to establish that it's safe. In my experience, I'm a lot more likely to jaywalk if I'm in a hurry because I'm late, or busy with a lot to do that day. If you have jaywalked before and especially if you jaywalk regularly, consider how it makes sense to you at the time to jaywalk, and how those considerations and motivations might translate to an engineer making a DNS change. In the academic discipline of modern safety science, which is where, give or take, the idea of just culture as used in our track title originates. The jaywalking intuition is formalized as the gap between work as written and work as done. That is to say, documents, runbooks, and rules are an abstraction which do not and cannot map exactly to work as it is actually done. For example, you won't find a pictogram of a dad struggling to put together an IKEA crib in the official instruction guide. Ryan Reynolds is definitely having a chaotic time here.

Another thing that we can ask is what the purpose of this DNS change was. Who asked for it? What pressure the engineer making the change might have felt to complete it quickly. I don't think that any of the folks pictured on this slide asked for the change. If a VP or director of XYZ was trying to do a demo that required the change and needed it in a tight timeframe, what do we expect, or what does Salesforce expect the engineer to do next? While Salesforce's write-up didn't include any information on this. Just like we did with jaywalking, we can consider how you might have responded to two different time sensitive requests, one from a teammate, the other from an executive in your reporting chain. Which request would you be more willing to cut corners in order to deliver on time? The point here is that pressure applied even implicitly, from those who hold power in an organization can impact decision making. I want to highlight why at the time, it's so important. The late Dr. Richard Cook, was a seminal researcher in the field of modern safety science, and wrote an amazing short paper titled, "How Complex Systems Fail," 22 years ago. One of the observations made in that paper is that all practitioner actions are gambles. Some of you may also be familiar with the business book, "Thinking in Bets," by former professional poker player, Annie Duke. Unsurprisingly, given the title in this bestseller, Duke argues that considering decisions as bets is key to long-term business success. People generally do not gamble intending to lose. When someone gambles, just like when an engineer presses a button, they do not know the future. That is, they take actions which make sense to them at the time.

Atlassian's April 5, 2022, 14-Day Cloud Downtime

Let's talk about another incident. This time, in April of 2022, Atlassian's cloud went down for 14 days. That's 60 times longer than the Salesforce outage we just spoke about. What happened? An engineer was given a list of about 1000 IDs of a legacy plugin to delete as part of a migration related to an acquisition. They did a substantial amount of testing, including testing on staging, and taking a smaller sample of those IDs, running them on prod, and everything worked. Then they ran the rest of the IDs. It turned out that the folks who gave them those IDs, gave them 30 plugin IDs for the initial prod test, but then gave them site IDs for the other 900 IDs. Their script probably deleted 883 full installations. That's not the unimportant legacy plugin they intended to delete, but installation of Atlassian products that you've heard of, like Jira and Confluence. Again, that's the trigger of the incident. Again, an engineer pressed return. If we look at how Atlassian explained what happened, the two things that they called out were, one, the fact that the engineer running the script was given IDs of an unexpected type. Two, that the script didn't provide any feedback or warning prior to deleting entire installation.

Even from this summary, I think we got a pretty good grasp of why it made sense for the engineer to press return. They got IDs. They followed a clear rollout process, and all of their preflight tests worked cleanly. To be honest, when I first read this incident report, and I got to the part where the first 30 IDs were app IDs and the rest were site IDs, I shivered. If I was doing that job that day, I do not think that there's any way I would have done anything different. That said, there wasn't much in the incident report about the folks who supplied the IDs, so there's not much to say about HDIMSATT to them. Notice that HDIMSATT is often different for everyone, and divergence can be painful. The multi-agent version of this problem is out of scope for this talk. If you're curious, what I recommend Googling is the phrase, Common Ground and Coordination in Joint Activity.

Salesforce vs. Atlassian

Boil down to the essentials, both Salesforce and Atlassian, or at least the shadow versions I've constructed for this talk, remember that I didn't experience these incidents firsthand, had long painful outages after an engineer pressed return and a script did something surprising and unpleasant. At the same time, even skimming the surface of how these organizations publicly explained the trigger of these incidents shows real differences. After reading Salesforce's incident report, I think that Salesforce believes that had the DNS change been assigned to a "better engineer" who followed the organization's policy to the letter, the outage would have been completely avoided. After reading Atlassian's incident report, I think that the organization believes that running the script made sense at the time to the engineer who hit enter, and that if another engineer was at the keyboard, it probably would have made sense to them too. From this difference in understanding and explanation, which organization is poised to have richer and more accurate learning in post incident activities? Atlassian.

If there's a part of your brain that's whispering, yes. Why could the engineer have just not have hit return, or is this simply nihilistic chaos where no rules matter? This slide is for you. From the same paper we touched on earlier, come two more observations. First, complex systems are heavily and successfully defended against failure. Salesforce and Atlassian both already have a tremendous amount of infrastructure, SREs, paging, automation, things that work well, and follow best practice architectural patterns. That's how they got to be large businesses that are well known enough for me to select them as examples for this talk. All of that is for better or worse, table stakes for where we are, because the second point the paper makes is that catastrophe is always just around the corner. This is really uncomfortable but it's fundamental. What keeps catastrophe around that corner is people taking actions, which make sense to them at the time. On the other hand, when catastrophe turns the corner, and visits our system, who is frequently involved? Again, someone taking actions which make sense to them at the time: pedestrians jaywalk, and engineers hit return. To learn and adapt, we need to understand not just what went wrong, but why we believed it would go right.

A Near Miss - Naomi's Baby Monitor

I want to talk about one more very different non-incident. This is my baby daughter, Naomi, who's 15 months old. Earlier this month, my wife and I accidentally left Naomi's baby monitor off. Thankfully, unlike our previous examples, she slept through the night fine. Nothing bad happened. This is what modern safety science calls a near miss. It's really just random chance that Naomi slept through the night fine when we forgot the baby monitor. She could just as easily have woken up and been fussy, in which case our time to respond would have been higher than usual. Near misses are interesting because the lack of a bad outcome tends to make them less fraught to explore. Just think about the difference between an incident that costs your employer $10 million, and a very similar one that almost cost your employer $10 million, maybe a feature flag disabled money movement on that latter incident. Which incident would you expect participants to be more nervous about? Which incident review meeting do you think you would learn more from?

Beyond the near miss, the other interesting thing here is that the trigger was not action but inaction. We basically didn't hit enter, and we could have had an incident because of that. HDIMSATT. Like I imagine most parents watching, we don't have a formal checklist for bedtime, like a surgeon might before operating, nor do we have paging configured for the baby monitor. Really, what happens is that during the past year that Naomi has been sleeping in her crib with plenty of fussing, we've come to understand intuitively that our bedroom is close enough to hers, that if something is really wrong, we'll hear her regardless. With a little bit of hand waving, basically as the importance of the baby monitor being enabled decreased, so did our cognitive vigilance. Another thing I want to point out is that learning from and adapting to Naomi's previous fussiness is what enabled our "high reliability baby ops." Failure-free operations require experience with failure.

In Practice, HDIMSATT is Contextual

How do we take what we've just learned to market at our jobs? How do we have impact and make things better? What does this look like when things stop being concepts and start getting real? In practice, HDIMSATT is squishy. Success in practice really depends on the organization. We'll go over some building blocks. The best high-level advice I have is to experiment, remix, and catch a wave. By catch a wave, I mean, for example, watch for something adjacent which spikes leadership's interests, frequently this is an incident. Sneak in some HDIMSATT to what is already in progress. Practically, when's the right time to bust out HDIMSATT? One thing you can do is look out for terms like these, which are frequently associated with actions that mostly make sense in hindsight. Given a failure, you can almost always find a test case, alert, or dashboard, which would have detected it. Of course, we build all of those things before the failure happens, so HDIMSATT to write or not write tests, before we knew that something had failed.

I love this figure. I think it does a great job of showing the dangers of hindsight. Whoever triggered the incident was faced with a very complex maze and made a number of decisions, but eventually ended up at a boom. If we don't uncover how things made sense at the time, we are left with a view of the incident, which not only isn't "correct," but won't be representative of how folks will engage with the world tomorrow. We can't change what happened yesterday, but if we understand, how it made sense at the time, we might be able to do better tomorrow. An obvious place to HDIMSATT is existing post incident activities: post-mortem meetings, incident reports, and so on. There's a wide range of options here. Starting with asking HDIMSATT in post incident meetings, or putting the term in templates next to root cause or similar sections. More complex, which means a larger investment, but also a larger reward might look like updating your post incident activities to follow modern safety science informed approaches like those found in Etsy's debriefing facilitation guide, or Jeli's How We Got Here, or Howie guide. Understanding HDIMSATT during post incident activities often means asking adjacent questions to tease insight out of participants. We touched on many of these questions earlier when talking about Salesforce's outage. How does DNS work at Salesforce? What's the genesis of the change, and so on? It turns out that they serve well as a starting point for many other contexts.

Modern safety science can sometimes seem like a deep rabbit hole. As you've seen, luckily for us, HDIMSATT is near the surface. Some resources, for example, the textbook, "Foundations of Safety Science," by Sidney Dekker, are much deeper. In my opinion, perfect is the enemy of impact. I think you can almost certainly improve things in your organization without needing to venture too deep into this hole. I say that to dispel fear of the activation energy required for adoption. Let me be clear, this domain is fascinating. If you're curious, I encourage you to spelunk.

How HDIMSATT Applies, Beyond Just Incidents

We've spent the talk zoomed in to incident triggers as a way to optimize for sparking intuition. Very little of what we've discussed is actually trigger specific. Since we can take that prompt and have framed HDIMSATT in terms of incidents, we can also show examples of how it applies beyond just incidents. One thing we can do is ask HDIMSATT about anything in the past. HDIMSATT to Steve Jobs and Apple, but not Blackberry, that the future of computing was full touchscreen smartphones. How did it make sense at the time for Softbank to invest in WeWork? The woman you see on the right is Dr. Katalin Karikó. Her research was considered so disappointing in the 1990s that she was demoted from a tenure track position at the University of Pennsylvania. Today, she's famous and widely respected for her contributions to mRNA COVID vaccines. How did it make sense at the time?

HDIMSATT looking forward is perhaps the most tricky and most valuable. Because we're looking at the future, we can work to align what makes sense with our desired outcomes. A canonical example I think of is break glass systems, which might allow you to deploy from a non-main Git branch in emergencies, even if it's normally disallowed, or to Salesforce's emergency break fix system that we saw earlier. These will be used via work as done in ways that make sense at the time. We have agency today to design them as a pit of success. We can also think about what we might ask in the future, about how what we're doing now made sense and feel a bit more motivated to write down answers and context behind our decisions.

Resources

This talk is largely a work of synthesis. It really mixes many building blocks put together by researchers and practitioners, going back to the 1990s. You can find breadcrumbs for those building blocks, as well as citations, attributions, the full list of folks I need to thank, and so on, at qcon2022.jhscott.systems.

Conclusion

There's a limit to the reliability of systems built on understanding failure as being caused by software bugs or human error. One way to get beyond it is to learn from how it made sense at the time. HDIMSATT helps us pierce the illusion of work as written, and consider work as done. Understanding the safety and reliability of complex systems is an exploration we're on together, and you can help by sharing what you learn.

Questions and Answers

Brush: Let's say you have some executive that keeps wanting to add counterfactuals to your incident report, why didn't we just, or we should have? How would you suggest, politely or not politely, resetting back to, how did it make sense at the time?

Scott: One of the things that I'm most curious about today is adoption. I think it really depends, unfortunately, on your relationship with this leader, what other allies or other people you have in your orbit, how the rest of the organization is doing. I think there's this question about whether it makes more sense to say, ok, we can do both, or to nudge them and say, do you have time to read how complex systems fail? Or engage with them, what are you really concerned about? Counterfactuals may not be that useful to what you're trying to do. Actually, I submitted a panel discussion at the Learning From Incidents conference, I think in February 2023. The talk I submitted there was a panel on exactly this question, about learning from chances in failures, and adoption of learning from incident reports. It's the same thing.

Brush: We all know that public incident reports aren't exactly the same thing as internal ones, because, obviously, there's a PR component and a legal component. Is there anyone that you've ever read that you thought, this company gets it.

Scott: Laura Nolan, who's formerly a Senior Staff Engineer at Slack, if you go and look at Slack's incident reports, many of them up until October or something, many of the trickiest ones were written by Laura Nolan. I think they do a great job of covering both what happens from a user perspective, and what happens from the perspective of folks participating in the incident, and overlaps with, how did it make sense at the time? The other person is Pete Shima, who I think has helped. He works in reliability at Epic Games, helping make sure that Fortnite stays up. There are some interesting incident reports out of Epic when there have been big Fortnite outages. I think that one is especially interesting. Learning from incidents or this perspective of modern safety science intersecting with software reliability can really be anywhere. You don't have to be FAANG. You don't have to be X, Y, or Z. The people keeping Fortnite up can also apply this and have increased learning and build trust, and all those sorts of things, including through public incident reports.

Brush: It's interesting that you named two companies that are really what I would consider consumer based. I know Slack has enterprise customers and enterprise deals as well. I think we tend to think of them as a very consumer focused company. Do you think this is harder for enterprise?

Scott: For sure. This goes to adoption as well. Maybe you could compare it to, although it's very different, but publishing demographics for engineering organizations. Like, why would you want to do that, if it's bad, and the cultural sea change that had to happen. I think maybe it was Tracy Chou at Pinterest who drove a bunch of that. It's the bar to release that information. I think, yes, particularly for enterprise companies, where if I simulate being an executive, it's like, what's the good? The good here is like secondary, it's ephemeral. Like maybe he said, there's some kudos, we helped the industry a little bit at a time. The potential downside is like someone at an important contract, takes this the wrong way, because maybe they don't understand the differences between safety. It's complicated. It's complex. I don't think you can expect everyone to do that. There's a lot of apprehension, and the reward is qualitative. It all ties back. It's one of these things with learning from incidents in complex sociotechnical systems. It's particularly challenging for enterprise folks.

Brush: I was wondering if like, you have to have two versions of the learning. One is the one where your customers are mature enough, they're ready to see. Then there's one that is your organization that wants to do that.

Scott: For sure. I would point to, as you said, the differences between internal and external. The authority I would point to is Adaptive Capacity Labs, which is John Allspaw, Dr. David Woods, and the late Dr. Richard Cook, both of the former two being professors. They have a great blog post on the multiple audiences for incident reports and for post-mortems. That's something interesting as an engineer to think about systems thinking or whatever. There are different audiences. This is why maybe the reports from Slack and Epic are so interesting. The more you can learn from the more people the better, but also some of those learnings are highly contextual. I'm going to learn a lot more from an incident report from an incident I've been in that's written to engineers, because I know what systems they're talking about. One of the things from Sidney Dekker is that complex systems are path dependent. The history matters. How these decisions were made years ago, which led to the systems that you depend on now having weird quirks. That's not a thing that I think you can really expect to be in a public incident report. It's like, what weird language choice, you have to hunt in the Google Drive. I really do think there are different audiences and different learnings possible in different settings. John Allspaw retweeted a meme he had made, which is the Pawn Stars, or something, where there's like the dad and the father, and it's like, it's helpful. It's like, it's no good because it's public. It's like, but they have to start somewhere. Look at John Allspaw's Twitter for a recent meme, well-said, on this very topic.

Brush: What is the most interesting thing you think you've ever learned by asking, how did it make sense at the time, that you don't think you would have learned?

Scott: I had an incident where, by doing something incremental, we thought that we were taking a safer approach, but, in fact, there were rails for doing a larger chunk of work at a time that would have made it safer. It's not exactly, how did it make sense at the time, but by continuing to ask and be curious, and think about, maybe how is this supposed to work? Or like, when other people did this migration, how did they not explode? Even though it wasn't like directly a remediation. It was actually like, we had a discussion about the incident as a team, and a few days later, we followed up with the other team who built the infrastructure they were building on. It turned out that, yes, actually, maybe the docs we were looking at were not exactly as good as we thought. Like, it'd point us to a point in the code base, which actually had some of this dual read, dual write stuff built in, but only for this specific path, that if you chose a riskier path, you got a better outcome because there was protection for the risky path, but not for the safer path. If you back up from specifically, how did it make sense at the time, to sort of related like curiosity, systems thinking, whatever. That was pretty interesting. That was just in the past few months.

Brush: It seems like related, to me, like when folks sometimes slow down their deployments, so it doesn't trigger the alerting anymore.

Scott: Yes, all of these unintended consequences. Just after Black Friday, Cyber Monday, many folks may have had code freezes. The classic one is like, you have a memory leak, because your code froze for a week, and normally you deploy multiple times a day, and so like your Java [inaudible 00:37:35]. Then you're like, I had this thing I did that should have made me safer, in fact, made it worse.

See more presentations with transcripts

Recorded at:

Sep 14, 2023

Jacob Scott

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?