[Note: please be advised that this transcript contains strong language]
Transcript
Allspaw: We're going to talk about resilience engineering, what research looks like, what we know a little bit about resilience, what we know about what it means to engineer it.
Here is an obligatory slide about me, here are some places that I've worked, some books that I've written. I wrote a blog post about blameless postmortems a while ago. I gave a talk - really, a mediocre, at best, talk - with my friend Paul Hammond in 2009 that people seem to enjoy. I got my master's degree while I was CTO at Etsy in Human Factors and System Safety after not having a lot of satisfying answers from computer science and software engineering about the questions that I had.
I'm also part of an industry academic consortium called the SNAFU Catchers, which is effectively bringing resilience engineering, human factors, system safety methods in understanding IT and software business critical digital services. Last, but not least, I'm in a small consulting group, and I work with my absolute heroes. If anybody had seen Crystal's talk earlier, which was very excellent, Crystal, Crystal mentioned a book called "Behind Human Error." Two of those authors are my partners, so I'm feeling very excited.
Resilience Engineering
Some background here, resilience engineering is a field of study. It first emerged around early 2000s, largely from a related field, really sort of a fueling field called cognitive systems engineering. You could think of that as an offshoot of human factors, in the UK, human factor is actually termed as ergonomics, but you get the gist there.
It was mostly an acknowledgement or in response to a couple of accidents in the late 90s and 2000 that NASA had. Really, the acknowledgement is that, on paper, most of what NASA was doing and other high consequence, high temp domains, shouldn't work as well as it does. They're entitled to way more accidents than they actually had. There's a deeply unsatisfying answer for that, so resilience engineering, born out of this idea that there must be something more, don't know what that is. It's now sort of developed, there's about seven symposia over about 12 years. Here are some of the symposia proceedings, you can get them as books on Amazon.
Second bit about resilience engineering, it is a community, and it's largely made up of researchers and practitioners from these fields. You can imagine this is a seriously multidisciplinary community. The best part that I really love about this community and most of the communities that fuel them is there's no such thing as purely academic. It's not like mathematics, sorry for any math majors here, but it's not like mathematics where you can just go and you do your work. It's not even really like psychology, it's always grounded in naturalistic, real world conditions. That's what's really fascinating, there's very little theory that doesn't emerge from real studies of real work.
The second part is that it's a community that is fueled, at this point, now 12 years on, all of these domains. You can see that these domains have these things in common, high consequences for getting it right or wrong and time pressure. . You'll notice that software engineering isn't on here, that's why I set aside this little bit because it's actually quite new. It's only maybe three our four years that software engineering has become part of this community, I would say it's still sort of nascent. In fact, what I'm about to explain a little bit later are maybe a meaning about resilience that you're not necessarily familiar with, but it has been and is reserved for something different than what you might all be used to.
For anybody who's read a little bit about this, here are some of the cast of characters, these are people that I now hang out with. Crystal mentioned Erik Hollnagel, Sidney Dekker, David Woods, Richard Cook, Anne-Sophie Nyssen. There's actually a great deal of resilience engineering, if you could say population wise, it's much more centered in Europe than it is in the States. However, I will add these folks, the bottom row here, it's got my beautiful mug there, are Nora Jones, Casey Rosenthal, Jessica DeVita, and J. Paul Reed. Here are folks who are now enrolled in master's degree programs in studying what does resilience look like in general, but yet they are all from software.
Model of The World
In order for me to talk a little bit more, I need to sort of describe a bit, set some context, give you a lens of how resilience engineering or cognitive systems engineering views or perceives, the lens with which we look at our world in software. This is going to be a brief description, I'm going to start with this line, this above the line and this below the line region.
We're going to start with this, what we see here is represented, it's supposed to be represented. It's basically your service, your app, your API, the thing that you are delivering to your customers. You see your code, you see your databases, and you see the consumers, the actual part of this system includes your users, the customers, and all that sort of thing. Of course, if we take this to be the system, we miss out on a whole bunch of stuff. It is definitely familiar to the people in this room.
What we're going to do here is we're going to connect it up to all of these things that you use to make your service work. You have your deployment tools, your code repositories, testing tools, monitoring, all that sort of bit. You could say that this is the system because many of you spend much of your time focused on these things, not the things over here on the right but, really, these things here. These are the things that are mostly front of mind, and you use these things to manipulate or, at least, make changes into this other thing, but if we stay here, we wouldn't see where real works happens.
What we're going to do is we're going to draw a line here, and we draw this line called the line of representation. Above this line, we see you, all of the people who are getting stuff ready, adding stuff, changing stuff to the system, taking stuff away, the architecture, the people doing the structural framing of how it all hangs together, and keeping track, what it's doing, how it's doing it, what it's done in the past, what's going on with them. You'll notice that in this picture each one of these people has some sort of mental representation of what that system is, and if you look a little more closely, you'll notice that none of them have the same representation.
By the way, that's very characteristic in roles of operators in all of these domains. Nobody has the same representation. I'm going to explain this for a little bit. Let's zoom out for a little bit, I'm going to summarize, we've got your product or service represented here. We've got the stuff that you build and maintain with, and then we have you all. I'm going to say that the work happens here.
This is our model of the world, and it includes not just the things that are running there, but all of you and the kinds of activities that you're performing, the cognitive work that you're doing to keep that world functioning. We're going to elaborate a little bit, there's a busy slide coming up, but if we play with this a little bit more, we end up with this sort of model. This model has this line of representation going through the middle, and you interact with the world down below the line via a set of representations. Your interactions are never with the things themselves, you never actually change the systems.
What you do is you interact with the representation, and that representation is something about what's going on down below. You can think of these green things as sort of the screens, these little keyholes that you know is only a part of the universe behind it. During the day, that's what you pay attention to, but the only information you have about the system comes from these representations.
What's significant about that is that all of the activities that you do, the observing, the inferring, the anticipating, the planning, the reacting, the modifying has to be done via those representations. There's a world above the line and a world below the line, and though you and we mostly talk about the world below the line as if it's very real, as though it's very concrete, as if that's the thing, the big surprise is this, you never get to see it. It doesn't really, in some sense, exist.
In a real sense, there's no below the line that you actually touch, you never see code execute. You never see the system actually work, you never actually touch these things, what you do is you manipulate it. It's not imaginary, it's very real, but you manipulate a world you cannot see via these set of representations. That's why you need to build these mental models, those conceptions, those understandings about what's going on. Those are the things that's driving the manipulation, it's not the world below the line that's doing it, it's your conceptual ability to understand the things that have happened in the past, and the things that you're doing now, why you're doing those things, and what matters at the time, and why what matters matters.
Once we adopt this perspective, once you step away from the idea that below the line is the thing you're dealing with and understand that you're really working above the line, that a whole bunch of things change. This is the lens with which resilience engineering and related fields view our world. In other words, these cognitive activities, both collectively in teams and individually up and down an organization, are what make the business actually work.
We've been studying this in detail for about a little less than 10 years, maybe about 8 years. My mission that I'm still continuing to be on is to bring these methods that had been largely in the cockpit, or an air traffic control tower, or a surgical trauma unit, or military intelligence and law enforcement and fire fighting, those methods to come in and look at us. Finally, the most important part of this is, and we know this, all of this is changing, and it's changing over time, it's a dynamic process.
Incidents in Software
Let's come back to the title of the talk. First, we'll have to talk a little bit about what this is, what this means. The term has become pretty popular, actually, and it's taken on some new meanings. Second is we'll talk about what these sources are in typical environments that you may be a part of. Here's what we see when we look closely at incidents in software.
We find people who bring different contexts and knowledge, they show up to incidents. I'm going to make it clear why we look at incidents versus other things a little later, it'll probably become apparent. What are these types of things that inform their expertise, their knowledge, the context? What's been happening in the world? Is there an election, geopolitical events, other outages of vendors? They're informed by what they're using and tracing and observability tools and alerts. They have in their memory recent changes in technology stack in their company that may or may not be related. Time-series data, new dependencies, dependencies that have always been there, dependencies that might be there.
What's been investigated thus far when they arrive on the scene? Status of other ongoing work, everybody's doing a legacy migration. Observations and the hypotheses don't happen in a vacuum, your relaying observations may cause someone else to develop a hypotheses that wouldn't have necessarily existed without your relayed observations. Logs, time of year, day of year, time of day. I used to work at Etsy, turns out Cyber Monday is a different type of Monday than most Mondays. Who's on vacation? Who's at a conference? These are contexts that people bring, and this is a lot of stuff. It doesn't get neatly categorized but, yet, you all have it.
What we see is multiple people, Liz's talk is spot on. I would argue it's never been not a team sport, individual people don't work through problem solving by themselves, so when we look at people who resolve outages, and especially significant and really sort of costly incidents really filled with uncertainty, we see a huge variance. We see variance across the people responding in tenure, that turns out to be really significant. We see domain experience, the ability for a DBA database engineer to get across what they're seeing in language that a network engineer might be able to understand, and vice versa, tells you about the necessary, the critical importance of this Venn diagram of shared understandings. Finally, past experience with details.
Here are some other things that we see, we see that there are multiple perspectives that emerge as an incident unfolds, what it is that is happening, what can and what cannot be done to stem the bleeding or reduce the blast radius, who has authority to take certain actions. In some organizations, in the most successful organizations, this is somewhat malleable, but there's certainly other organizations that people might be like “I think I could do this, but I don't have permission to do that.” Seeking permission, coordinating what that permission looks like, making sure that's all clear, these are things that we see. We also see what are things that are stated that absolutely are off the table, that should not be tried.
We also see multiple threads of activity, some of these activities are parallel, some of them are serial, some of them are productive, and some of them are unproductive. Of the unproductive - thanks, Crystal, for pointing out hindsight bias - unproductive threads are always seen as unproductive in hindsight. All unproductive threads start out as, potentially, productive threads. We don't have enough time, but there's a whole bunch of technical terms, I'll have some links that you can learn about these. These terms might not mean actually what you think they mean, but they have a history in the resilience engineering research.
Am I wrong in thinking that, what I'm describing here, this is familiar to people? Incidents are messy. Unfortunately, they don't follow the step one, step two, step three that you saw at the conference or that you read in the book. This is representation of these lines of reasoning and some relationships between them. I think if you look reasonably carefully, you'll see that a lot of this rings true. By the way, this is a diagram written by David Woods in 1994, and it was in response to how control room operators in nuclear power plants understand and solve problems.
The key thing that I want to keep coming back to is time pressure and high consequences. This is what makes these things pressurized and consequential. Pressure and consequential are what make incidents the ideal thing to study when we want to look at decision making, when we want to look at coordination. What are the guts? It's one thing to say you have to collaborate. Unhelpful advice, what does collaboration look like? We do need to bring some science to this and, actually, as it turns out, there is.
By the way, this messy bit right here, these words, we use these words largely because there's no alternative, but debugging and troubleshooting, I'm going to go on the record, are not sufficient to describe the environments and scenarios that I just described to you. A more technical and much wordier, certainly more academic sounding, would be anomaly response and dynamic fault management. It's a mouthful, you're not going to be putting that in your marketing materials, you'd rather just say debugging. Allspaw doesn't like it, but I'm just going to say debugging anyway.
I will tell you right now that when you say debugging and when you say, "I'm going to debug something," if people don't understand that what you're talking about is time pressured, super concentrated, and the longer it goes, the worse it will be, then it's probably not the best word for you, stick with anomaly response.
Terminology
This brings me to terminology. What resilience? Getting terms clear is worth the trouble, resilience is a bit of a word. What's the word? There's examples that take on new meaning over time, like DevOps and Agile. Resilience, unfortunately - and this is a natural thing, there's nothing to be sort of, "Oh, my God. I can't believe there's a coopting of the term." I want to give you something here to think about. Does anybody recognize this man by this photo? Do you know who it is? He's Ward Cunningham. I've had the privilege of meeting Ward and talking with him. He is such a killer guy and brilliant engineer and super humble. He's a great guy, he works at New Relic now. In 1992, he wrote something, the sum up, basically, is that software development may incur future liability in order to achieve short-term goals. What did he call this? Technical debt.
This is often cited, but never read, technical debt, in Ward's words, was debt that is explicit when you take it out. Ward coined the term, and that term did not include accidentally having taking out technical debt. I think he just told me, "You don't accidentally take out a loan" That was his original intent but, yet, why would you have that outage? Technical debt, I guess. Why are you having culture problems? I've just got too much technical debt. In fact, it's quite convenient, if you have a problem, you don't really have an explanation, just say it's technical debt and you'll be fine.
I do want to be very clear, you have likely gotten more mileage, practically, from this term. Did he say something and point something out that wasn't familiar to people at the time? No. Anybody who is in software engineering knew what he was talking about was familiar to them. It was a practice, it just didn't have a term. If you see Ward Cunningham, you should say thank you because he's probably helped justify budgets for head count and roadmaps for all of you.
Having said that, the resilience engineering perspective - I'm going to say, 12 years, NASA, scary shit, people who know what they're talking about, they'll say this. Resilience is not preventative design, it is not fault-tolerance, it is not redundancy. If you want to say fault-tolerance, just say fault-tolerance. If you want to say redundancy, just say redundancy. You don't have to say resilience just because, you can, and you absolutely are able to. I wish you wouldn't, but you absolutely can, and that'll be fine as well.
I just want to point that when you read resilience, if I'm lucky, you'll start reading a little bit more, and you'll see that this is a term that’s different. It's also not chaos engineering, but given that there's a very close and actually would think it's an incredibly good, I wasn't originally thinking this, but I think it fits very much well with resilience engineering. Why? Because chaos engineering is a tacit, in fact, probably implicit acknowledgement that we cannot understand what our system's behaviors are by simply pulling the parts out, looking at the parts and the components, and putting them back together.
It's an acknowledgement that behaviors are surprising. Hence, if behaviors weren't surprising, you wouldn't have to do an experiment. Therefore, it's nothing about hardware or software. It's not a property that a system has, one person said. Resilience is aimed at setting and keeping conditions such that unforeseen, anticipated, unexpected, which are all hallmarks of complex systems and complex systems failures, can be handled. I want to really drink this in, unforeseen and unanticipated, unexpected, and fundamentally surprising, which mean they are not imagined.
"Things that have never happened before happen all the time." This is a quote for Scott Sagan who wrote a book on the limits of safety, this book was about nuclear power. By the way, my colleagues and most people who I know who have been focused on resilience engineering since the beginning have mentioned to me that, unfortunately, they would credit the accident at Three Mile Island in 1979 as really the start of their career. That's a completely different talk, but it's hard to overstate how pivotal that event was.
Instead, what they say is, look, this thing that is in here that's not robustness, that's not fault-tolerance, what is it? They would say that it's proactive activities aimed at preparing to be unprepared, and here's the key part, without an ability to justify it economically. You can't justify it economically because it hasn't happened, and it's not even imagined. We're not talking about what could possibly go wrong and what is inside a BCP document. What's inside a BCP document, or a DR, or scenario planning are things that you could think of.
Let's say that it's sustaining the potential for future adaptive action when conditions change. It's a continued investment in anticipating what's happening, learning what has happened, responding given anticipating and learning what has happened, and keeping an eye out for how things could possibly change, which may cause you to rely on resources that are not officially sanctioned. It's not like, oh, resilience is having slack. No, it's actually not because, if it was that easy, you would know how much slack to have.
It's something that a system does, not something that a system has. What you won't see are people who have taken this serious over the past 12 years who says you can have more or less resilience like, put a unit. I got six units of resilience, really need to get to eight, you're not going to hear that. It is not simply reactive, the idea here is that it's at a higher level here.
All analogies have limits, and so you need to humor me here, but resilience is not what results from doing chaos experiments. Resilience is about funding the teams that develop and perform chaos experiments. Resilience isn't having a spare tire, resilience is the ability to find ways of getting to your destination. Do you see where that is?
Sustained Adaptive Capacity
Another way of thinking of resilience is sustained adaptive capacity. My colleague Richard Cook has said this, “Poised to adapt” you'll see the slides, and this is from the Velocity talk that you should see anyway. In other words, finding sources of resilience, sustained adaptive capacity, means finding and understanding cognitive work because I've already established that that is where work really happens.
Here's the trust here, I think we can agree with this. A resilience engineering perspective might as this question. What are the things, people, maneuvers, workarounds, knowledge, that went into preventing it from being worse? Quite often, when we look at post-incident reviews, what's deemed the Safety-I perspective is to let's find all the gaps, and our corrective actions are to fill the gaps. That's fine and totally reasonable, it's also a very traditional way of thinking about it.
Safety-II is a different paradigm that flips it around. It says what are all the things that prevent us from having these normally? It's hard to actually say let's get in a room and then talk about all of the outages that didn't happen today, because you don't know where to direct your attention. There are a boatload of possibilities that didn't happen, so the key is to highjack, Trojan horse a postmortem of an event that did happen and ask more deeper, closer questions about what, despite the uncertainty and ambiguity, made it go not nearly as bad as it could have.
How do we find this? You should find incidents that have a high degree of surprise, whose consequences were not severe, and look closely about the details about what went into making it not nearly as bad as it could've been. Once you do that, then you've got a shot at protecting and explicitly acknowledging these sources. What do I mean by this? I'm going to break this down a little bit.
Indications and Severe Consequences
What are indications of surprise and novelty, indications that people are struggling in uncertainty? I'm going to put up a couple of snippets from chat logs of real events, and I think you'll get it. I can't do the methods of cognitive task analysis and qualitative research in the next five minutes, but this gives you a sense, these are pretty unambiguous signs that shit's ambiguous.
Indications about contrasting mental models. When you see people saying that they're confused about something but they clearly give you some sense that they know about these things, but a part of it is strange, or that they contrast what their expectancies were and what has actually transpired or revealed to them. About actions, what to do, what sequence to do them in, what's possible, direction that we could go in when you don't know much that could possibly result in some information that would be helpful.
Why not look at incidents with severe consequences? Well, the first and foremost is that scrutiny from stakeholders with face-saving agendas tend to block deep inquiry. A lot of organizations, especially when you go for a severe - when I say a severe, it doesn't have to mean loss of life, although it can, when there's a significant of press coverage and media coverage, when there's activities amongst shareholders, or investors, or boards of directors. Paradoxically, the greater the visibility of an untoward event, the less time and resources are going to be allocated because you need to find an answer now. You'll have, actually, more time and space to do a more close and detailed post-incident review with not so severe. That's why we go look.
With medium-severe incidents, the cost of getting details and descriptions from people, especially their perspectives and where they were at the time, get in the tunnel with them and try to understand what their experience was like and paint that multiple perspective picture, is really low to the potential gain. It's very difficult to learn from super severe events, it's not really great. You need actually this sort of Goldilocks, you need enough significance and severity to mean something and direct attention, but not so high that the pressures are from legal and political organizational constraints.
Some (Contextual) Sources
Let me tell you what we've seen, and when we look at some of these - when I say we look at some of these, it's not just the software, but reflecting what's been seen in these other types of domains. By the way, if any of these sound unfamiliar to you, that would be shocking to me. Putting some conceptual, empirical meat on conceptual bones. Esoteric knowledge and expertise in an organization. What does this mean?
We call them knowledge islands, the people in your organization. Wake up in the middle of the night “Oh, my God, this is going on, looks like there's a database, I don't understand, but it's doing, but it's showing us this. What's going on? Oh, call Silvia.” Everybody knows. If you've been there for three months, you know that when weird shit happens to the database, you call Silvia. How Silvia knows this is unclear, if you talk with Silvia, she'll tell you a little bit about these tools that have built up around these.
When I say esoteric, I mean the real nooks and crannies. "Oh, yes, there's this one time with MySQL version. We had this that takes care of this, and I've got this shell alias." This is a source of resilience, a serious source of resilience. By the way, everyone has knowledge islands, in fact, I would argue that it's impossible to get rid of them. That doesn't mean that the effort to spread that knowledge isn't worthwhile.
We also see flexible and dynamic staffing for novel situations. In really successful organizations, it would say that it would have very resilient behaviors and actions. It's teams whose current day job constraints and production pressure can be relaxed in the face of an event. Can you borrow expertise and have them be productive from another team? They can go back after. This is beyond the cartoonish idea of a bus factor, it's much more beyond that.
Authority that's expected to migrate across roles, there are a number of really famous incidents, one in particular tonight, capital incident, which if you're not familiar with, in 2012, a hedge fund financial trading firm rolled out new code, and there was a bug. They rolled back the code, which unfortunately made it seven times worse, and they lost about $480 million in about 20 minutes. A decent part of those 20 minutes was trying to get across. The engineers wanted to shut the system down, and they didn't have permission to because they thought maybe it would still work. By the time 20 minutes was up, they were bankrupt. That's a not great example, that's a brittle example of how migration of authority can't happen. Does this person have the ability to halt things or completely stop the flow of some sort of revenue producing service?
A constant sense of unease that drives explorations of normal work. We heard an example earlier today, I can't remember exactly who it was, scenario where this particular was only one character away. You're all only one character away from making it all explode, you're clear on that, right? No, seriously, you are, it's been proven. I really hope that I never prove that and revalidate, I hope you all never do, but it is absolutely possible. In fact, take the flip side, I always ask, especially when I go to sort of security, I say raise your hand if I give you an afternoon, you could find one character that screws it all up. Everyone raises their hand.
The capture and dissemination of near-misses. The rush of cortisol that your amygdala just firehoses across your brain when you hit enter, and for a split second you go, "Oh, my God. I'm in dev.". You change your PS1 and PS2 to be red in PROG but blue in DEV, but was it blue, or was it green? Oh, shit. SSH in a for loop. Come on. When those happen, what we see in really successful organizations, one who does a decent job of preparing for the unprepared, these near-misses are disseminated like they were public service announcements.
I've seen it, I've seen it in a number of organizations. "Dear everyone who touches computer at this company, check this out. I was going to do blah today, and I went to go do X and Y, and I almost did Z. Holy crap. Oh, my God. I had to go for a walk. It was terrible, so just want you to know, don't do any of the A and B. I got some ideas, you got better ideas, but I'm thinking we should probably make it so that, actually, maybe you could make it so you could never do that again,"
Summary
Summary. Resilience is something a system does, I want you to consider this, not what a system has, at the end of the day, you can't say I gave more resilience today by installing another server. We had one, but now we got two behind a load balancer. Think of it as something that's beyond that. This is why it was reserved.
Creating and sustaining adaptive capacity in an organization, while being unable to justify doing it specifically, is resilient action. If I were you, realize that many things in your organization, that your organization pays for, does not have accurate or precise ROIs, many things. Here's one, brand design, super important, crazy expensive, and nobody knows if it was worth it. I don't know a lot about marketing, I do know that.
How people, and that is to say, the flexible elements of The System, cope with surprise is the path to finding spruces of resilience. Another way of summing this up is from my colleague, Dr. Richard Cook, who said this, "Resilience is the story of the outage that didn't happen."
See more presentations with transcripts