BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Resilience Hides in Plain Sight

Resilience Hides in Plain Sight

Bookmarks
50:08

Summary

John Allspaw describes what resilience is, and how it's incredibly hard to recognize it.

Bio

John Allspaw has worked in software systems engineering and operations for over twenty years. John’s publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.” His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement. John served as CTO at Etsy.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Allspaw: In the days following the 2001 World Trade Center attack, a critical priority was to restore electric power to lower Manhattan. Because of where the Twin Towers were located, their collapse produced massive disruptions to all kinds of infrastructure systems, many of which were highly interdependent. Multiple organizations from government and private sector, many of which had never needed to work before, settled on a combined strategy. This included three parts. The first was shutting down the networks that were providing power to areas that it wasn't great to be providing power for, because those areas were at risk or vulnerable in some ways. It was not great that power was going there. Then provisioning spot power, and this is in the form of trailer mounted diesel generators. Then, finally, they installed temporary distribution cables. What these would do is they would take power from the undamaged and the functioning parts to the damaged and non-functioning ones. Here's my question. What made this possible? The short answer is cheeky, and it's resilience.

My name is John Allspaw. I work for a company called Adaptive Capacity Labs.

What is Resilience?

There are lots of claims about what resilience is in the software world. There's lots of diverse uses, hijacking of agendas, for it to mean what you would want it to mean. It's quite convenient in that way. We can just say it and we look and we nod and, yes, resilience, and we walk away from the conversation with completely different ideas. There's just a little bit of nomenclature that I want to set out, because in order for the rest of the talk to make sense, I need to at least beg your humoring me temporarily, a leap of faith, suspension of disbelief. Resilience in the way that I'm using it, and the way everyone else in this track is in the same frame, as that which comes from the field of resilience engineering, a field that is about 20 years old. A field that software's only recently in the last 5 or so years have started exploring. Resilience isn't reliability++. It's not like, reliable but harder, but more awesome. It's not any of these things. It's not fault tolerance or high availability. It's not chaos engineering, although it is incredibly related in a bunch of different ways. It's not about hardware or software at all. If I were to make a connection, on the fly to a bit of what Vanessa talked about, software and hardware don't resolve outages, do they? People do. To be a little bit more specific, people with expertise calling upon resources that already exist prior to the outage is what solves outages.

Stripe puts out this Increment magazine, issue 18, is where they did an interview with my colleague, David Woods, one of the founders of the field of resilience engineering. He was asked this, what's the difference between reliability and resilience as it relates to complex systems? He said, "The problem is that reliability makes the assumption that the future will be just like the past. That assumption doesn't hold because there are two facts in this universe that are unavoidable. There are finite resources, and things change." My other colleague, Richard Cook, put it a little differently. He had said that one way of thinking about resilience is the ability of a complex system to adapt to challenges, which is all right. It's not really precise. It helps at least in the contrast here. It's inferred from behavior of systems in response to challenges. It's not a thing that you have. Matter of fact, Erik Hollnagel has pointed out, another founder of the field, said that resilience is something you do, not something you have. Woods, who's been working with Hollnagel going on 4 decades, put it even more succinctly, which is that resilience is a verb, not a noun. To contrast this with reliability, think about it this way, the likelihood that one of many identical things will fail over a period of time. Reliability is derived by testing or observing populations of these identical things over time. Are your systems the same every day? Are your systems the same every second? I think it's a fine colloquial usage. In this contrast, you see that they aren't the same.

Same with this idea of robustness. You have two examples in a vehicle, shock absorbers and struts in the front. They're designed to mitigate or dampen or make less disruptable variations in the road as you drive. It's limited in its range. There's a couple of potholes that it won't handle. The other is a spare tire, a redundant piece of kit that lots of cars have. Neither of those things are resilience. I want to get this across in this way, both of those, the spare tire and the shock absorber, those are preventative designs, they're preparations for a few vehicle-specific situations. A spare tire doesn't help with congested traffic. It doesn't help with shutdown of routes along your journey, or critical detours, or the driver falling ill. This is sometimes uncomfortable, because as engineers we're hardwired with an almost unspoken assumption, like an ethos, like a visceral, like something at our core, an aspiration, which is that we can design and build things in a way such that they won't break, that they'll continue working even if stuff happens. We're going to make it bulletproof. It's a very reasonable aspiration, but it's just that, it's an aspiration. Because our daily experience tells us that that's a fantasy, and stuff does break. We want to believe that we can make it so that we're prepared for everything. Resilience isn't about designing for failures that you can imagine. Resilience in the resilience engineering sense is reserved for unforeseen, unanticipated, unexpected, fundamentally surprising situations. Situations that you didn't anticipate were going to be the case, or you anticipated might be the case, and that you're confident you're prepared for.

Adaptive Capacity

You're going to hear the word surprise come up a good deal. There are a number of different qualities and facets to the concept of resilience. The one that I want you to leave with, the one that I want you to search and replace in your mind, is the term adaptive capacity. Adaptive capacity is a term that comes from ecological environmental sciences. It's related in those worlds to fitness, fitness of an organism to meet demands of its local and neighboring environments, and diverse in nature and all of that sort of thing. For us, I've got two ways of saying something similar with two definitions that take complementary stances. One way is to think of adaptive capacity as the potential to adapt in the future following a disturbance. By disturbance, we mean a change in conditions, or information, goals, or difficulties that threaten a plan-in-progress. This is, as well as the next one, irritatingly vague. Possibly, unless you're used to reading academic papers, you can't exactly wrap your head around it in a concrete way. The thing I want to emphasize here is that adaptive capacity is about the potential to adapt. It is not the adaptation itself, it is what made adaptation possible. If you don't have the raw ingredients that makes adaptation possible, there is no adaptation. The word for that is brittleness. Another way of thinking about it, my colleague, Dr. Woods, adaptive capacity can be seen as a system's capacity to adapt to challenges ahead, when the exact challenge to be handled cannot be specified completely in advance. I think that's a better one. Or, when attempts to specify future challenges end up missing important aspects of those challenges when they do occur. Meaning, I'm confident about this thing that could possibly happen, and it's happening but it's actually not exactly. It's a little bit like that but it's different enough that it might as well be not at all related. Surprise.

On-Call NERFs (New Relic Emergency Response Force)

Now that I've given a little bit of terminologies groundwork, I'm going to walk you through a concrete example of what actual engineering adaptive capacity looks like. Why am I saying it that way? Because for a majority, 10 to 12 of those 20, 22 years that resilience engineering has been a field, were spent on just identifying sources of adaptive capacity and resilience in the wild, that are already existing. Because if you don't have a handle on what concrete, non-abstract, non-generalized phenomena that you want to deliberately make happen, you have to understand what that is. This is software engineers, you can't write code, you're like, write the code to do the thing that I want. What is that? You're just going to have to write the code. You haven't even told me what it is. This is an example. A couple years ago, my colleagues, Richard Cook, as well as Laura Maguire, we were engaged in a research industry consortium, known as the SNAFU Catchers, where researchers from resilience engineering partnered with companies to study them. There's this paper that was written by Cook and my new colleague, Beth Long, in Applied Ergonomics, the title is, "Building and revising adaptive capacity sharing for technical incident response: A case of resilience engineering." I would encourage you to read the paper. It's a good paper. For those of you who don't have access to something to get past the $40, or whatever the paywall does, there's a translation that I worked on with Beth and Richard, in that same Increment article. It was about observations that we made at a company called New Relic, where Beth worked at the time. Typically, what would happen in an incident situation, somebody would be on-call, like, "This is a weird thing. This is definitely an incident. We're going to gather a bunch of people together from my team. Then like, we'll see what's happening." There's diagnostics, and there's developing of options, which we do. For the most part, these are pretty successful. This is a garden variety textbook incident.

There were other incidents, however, that were spicy. A term that New Relic was using at the time. These are situations that genuinely challenge the people, even the people who wrote software involved. You try to work out, what is happening? What can we do about it? Will those things work? Will any of those things make things worse? Which things could also be happening that we're not seeing? What would happen, and there was this observation, is that there were a handful of people that would sometimes volunteer, jump in to help these local teams, and it would make a difference. Their involvement would make a difference. Now I'm going to ask, has anybody ever had an experience in your organization or anywhere you've worked, where you're responding to an incident, and really wrestling with, "What's happening? I'm not entirely sure what's going on here. This is weird." When a particular person you see appear in Zoom, or the Slack, or IRC channel, whatever, everybody gets a sense of like, I'm feeling a little bit better, because this person has shown up. Richard Cook, spent a lot of his career as an anesthesiologist. He would describe this, there are these people who have expertise that would show up, and he would put it in the same frame of the emergency room. He said, and then there'll be other people who would come to the door, and they say, "What can I do to help?" Some people, you say, "No, we're good. We're fine." These people, what we would sometimes call knowledge islands, you like to think of them as both the greatest strength but also sometimes a liability. The fact of the matter is, it helped a great deal.

What the study did was capture what ended up happening when New Relic made these very, what seemed to be quite obvious observation. This is from the paper. They established a volunteer as a support cadre to assist in response to high severity or difficult to resolve events. This volunteer incident group would provide a deep technical resource that could be called upon to support incident response. There were a handful of these people throughout the organization. This is like a team, let's say, at New Relic. We've got a manager and a bunch of engineers. In the green, we've got a member of this, they call it the NERF team, the New Relic Emergency Response Force or something like that. They just referred to them internally as NERFs. That's the person in the green. What would happen is they established an on-call, actually, the first situation, there were eight NERFs that were located in eight different home teams. They had a day job. I'm working on this team. I'm writing some code. I'm doing some stuff, whatever. They would be distributed across different teams. What would happen is, again, still, when an incident would show up, a team, for the most part could handle it for themselves. If you were on NERFs, they would have a weekly on-call rotation, and you wouldn't be involved in every incident. What you would do is you would keep an eye out. You would be aware of incidents that were ongoing. People throughout the organization knew who the NERF who was on-call was, and they could reach out to them to say, can you get some eyes on this with us? Or they're doing their work, they're seeing, peripheral vision eye, ok, they would keep an eye on the incident Slack channels. They'd have split attention. If they were to be called, they would be able to come up to speed much quicker, because they would have been paying half attention the entire time. Even doing that increases their understanding of what's currently happening across. In theory, this on-call NERF engineer would participate in incidents only occasionally, but in practice, they usually monitored the incident progress, effectively staying on hot standby. Again, being alert to active incidents mean they could come up to speed.

Here's the rub, and here's where this is otherwise a fine novel move. This team up here on the right, is their home team. That manager understands. There are 4 engineers, that manager doesn't have 100% of 4 engineers. They have 100% of 3 engineers, and during that NERF's rotation, anywhere from 0% to 70%, maybe 80% depending on what happened. They gave up a local benefit, sacrificed a local productivity benefit for a more global advantage. This in resilience engineering nomenclature would be expressions of initiative and reciprocity. Members of the group were aware that their participation in this support rotation volunteer group took time away from their primary work, and they tried to build in backstops to avoid complex. For example, a team of five engineers with one NERF could expect that engineer to be focused on incidents about one week out of eight. This engineer, this NERF would schedule work for their on-call week that was both interruptible and less taxing than that of their teammates. The workload of the engineer's home team however stayed the same. The on-call NERF engineer's decreased productivity was treated as overhead and understood to be a worthwhile investment. A person had asked about being able to justify economically. They can't and they were fine with that.

Advantages of NERFs

What were some of the advantages? The organization once they set this up, benefited quickly. This was a bottoms-up, grassroots thing that they did to establish the NERFs. For one, some incidents were resolved faster. Bringing expertise in diverse incident experience, NERFs were able to help first responders identify and resolve the issues a bit more efficiently. Second is that the NERFs relieved some of the strain of managing incidents with severe consequences. First responder engineers knew that a specific person could appear when an incident was severe, long running, or really spicy, which helped lessen their anxiety about it. It reduced the fire alarm effect. Previously, a serious incident would capture the attention and efforts of many senior engineers across the organization, disrupting work across lots of teams. Now, non-responders could safely stay focused on their own tasks, knowing that the NERF group member was on-call, and would engage if needed. Group members who were not on-call meanwhile could focus on their home team's work for seven out of eight weeks. I'm just highlighting a handful a bit here, there's so much more to this particular case. It is probably one of the most important research cases that's been written up either in the Increment, engineer friendly, or the more academic Applied Ergonomics paper. It's, at the time, 2019, one of the very few real-world, concrete, grounded cases of adaptive capacity being engineered deliberately.

What Made This Possible?

The question we ought to ask when identifying adaptive capacity is, what would make this possible? There are a handful that I want to highlight. First, there were four. They had reserved and standing Slack channels used for incident handling. Meaning, they had, I think they called them emergency rooms, like A, B, C, and D. Those were always there. If there was an incident, and you wanted to know what was going on, you'd go to emergency room A. If there was more than one going on, then the second one would be using B. Very rarely there would be four. Then they realized at some point, no, they're probably related, we can collapse, that sort of thing. They knew where to go. Engineers knew where to go. NERFs knew where to go. There was no, what's the incident channel? Can you show me? Will you invite me? All of that stuff which removes people from handling the incident. Second is that this volunteer support cadre were comprised of engineers with greater than average tenure. They had been at the company for a long time. Because the company in engineering was quite good at supporting internal mobility, they had held a bunch of different roles on a bunch of different teams. Database is like, ok, I'm working a year or two here. Wait, and now I'm working, same system, same company, here and here. They were able to see it from different perspectives. These are the ingredients, of course, in what is expertise. It's what makes expertise different than experience. It is the variety and different ways of looking and experiencing, plus, they've been around for a long time, so they saw a huge swath, a broad collection of different ways of getting surprised.

This almost sounds banal, like almost, of course, to some of you, they had access to critical resources. All the code repos were available, actually, to every engineer in the company. Very tiny, exceptionally secret to have something like that. This doesn't sound like a very big deal. Is there anybody who works in a company that that's not the case, where you only get to see the repos? Presumably, is accurate or not. Some cases, it's mandated for usually their security or compliance or those sorts of regulatory. Sure. In this situation, this is a source of what made the NERFs and actually all incident response possible, because there wasn't, "I can't see your stuff. You can't see my stuff." Deploy logs, deploy data, runbooks, telemetry, all of that were open across all of the engineering teams, that is a source of adaptive capacity. Flip that around, raise your hand if that's a situation that you work in now, where all engineers can see all repos, at least look at logs? Now, as a thought exercise, I want you to think, changing all of that, and it's no longer the case, will you be just as good at responding to incidents? It's not hard to justify locking that stuff down in the spirit of compliance and regulatory. It's an investment to protect that access being open, and it doesn't come for free. Anybody who's ever had a change of CTO, technical leadership knows that there's a lot of swagger that happens, and we're going to change things around here, says the recovering CTO.

It wasn't just about those conditions or activities already present. It was those that were introduced. I'm going to read this from the paper. What happened is that leadership and management recognized that this NERF arrangement was very valuable, and it wasn't their idea. Recognizing this resource had become an important part of incident response, the organization sought to raise the status of the role, if you were a NERF, that was a big deal. Provided financial incentives to NERFs, make inclusion in the group an explicit part of career advancement. They went further in its evolution, term limits were established for group members, so that turnover would better distribute additional work, and more people had opportunity to sustain this critical resource. In other words, they prepared to be unprepared. Do you see that distinction that's different than reliability, and high availability, and fault tolerance, and all of that? Establishing this reserve group was not a better process or a framework. It wasn't that this is what we do now. What they did was they rearranged existing activities and roles in a way that made it easier for the organization to reconfigure and reprioritize when surprise showed up. They did not recognize what they were doing as building adaptive capacity or resilience, but that's exactly what they did. The NERF program had been in play for a couple of years before we as researchers got there and recognized this as adaptive capacity, not just building and maintaining but evolving and protecting this investment, this valuable way of working that can't be justified ahead of time. It was expensive to do this. There was a cost to the organization, because they gave up productivity in their home teams.

What Are Additional Sources of Adaptive Capacity?

I want to leave you with some questions to mull over. Not just questions, questions, of course, but comments, rants, and raves. What "stuff" conditions, norms, activities do you and your colleagues critically rely on? Meaning you rely on and maybe don't even think about, because it's just what you do? That you rarely, if ever, think as being critical. I'm going to ask you to think about if there are any stuff, I'm using stuff to be as broad as is, that makes it easier to handle time-pressured, consequential incidents in creative or improvised ways that are already existing. As my colleague has said, the resilience is coming from inside the house. You already have it in the forms of things you are doing. I'm literally saying the organizations that you inhabit and you work at right now have sources of adaptive capacity, in fact a huge variety of it. Why do I know this? Because you are almost entirely from almost every standpoint, successful. Our field measures success in literally how many decimal points after 99%. A vast majority of changes made, work and are great, the vast majority of days and weeks and months do not have incidents, those stick out to us in an outsized way. The real truth is that things work because of you all doing your everyday work, which is hard to consider to be very novel. We don't pay much attention to that, but yet it's always successful. You're preventing incidents constantly. Does this fit with people? What doesn't connect for you?

Questions and Answers

Participant 1: At some point, you talked about being like a biological system, and you talked about the incentive structure that sprung up around NERF. Without that fitness function, without that incentivization, like, how does the org know that this is as useful as growing a third arm. It's a waste of resources, if not incentivized, as like third arms are useful, you can grab your coffee and your newspaper and scroll your phone. Without the extra pay, like how do we know that we're not just dealing with one person from a team. Without the reduction in length of incidents, how do we know this is actually the right adaptive evolution?

Allspaw: I wanted to present the case that happened at New Relic not as advice but as a real-world demonstration about what adaptive capacity looks like. I wanted to use it in an explanatory descriptive frame not as a, you should do what they did. Going back to the other big media part of your question, there are two things that come to mind. The first is, you're employed right now, your organizations are successful enough to have you employed. Were you incentivized? Was financial incentive the motivator for you preventing the incidents that you did last week? Or, could I incentivize somebody who doesn't have expertise enough that they'll be as successful as somebody who's not financially incentivized but with more expertise? The other is that there are lots of expensive activity that your organizations are already engaged in, all of you, that don't get financial scrutiny or economic scrutiny, like very expensive things. Has any of your companies undergone what's known as a brand redesign? One of the most expensive endeavors that a modern business can undertake. There are reams and decades of research that validate that the ROI on brand redesigns cannot be quantified. The most expensive thing, UPS changed its logo. Was it worth it? They don't know either.

Participant 2: In our company it really looks like that. The only difference is like, we change the on-call NERF each week during the team. We have like junior, as with senior, we share the pain because like, yes, when you're on-call, if you do that each week, it's a toll on your life also. Sharing that pain, like the incident we give is the holistic awareness of your system, so you grow faster as an individual, even junior, if you're exposed to those things. Then you get a second thought when you come to develop afterwards. You get this awareness.

Allspaw: You feel more comfortable knowing that shit's going to break regardless. That's a great example. In fact, you said share the pain, understand that also, just as important, share the exposure. Share the opportunity across a broader group of people to see weird shit. Seeing weird shit is what builds expertise. The broader the weird and circumstances, the better. That's it. It hides in plain sight. On-call shadowing prior to going on-call for a new hire, same. Any situation where you have a more tenured or more expert talk and/or work in even a formalized way with people with less or different expertise could be seen as that because those are preparations for situations that haven't happened yet that you can't predict. You just know that it's valuable to do that.

Participant 3: The way you described the NERF organization initially was a group of people who had wide expertise, who are very good. The sorts of people who want to come help you when you have an incident, you have weird stuff going on. Then it sounds like the company then went ahead and incentivized everybody in the company to go ahead and join that. What was their experience? Did that help it? Did that preserve it, or did that make it such a desirable thing to join that it actually killed the golden goose?

Allspaw: The short answer is that the explicit support, it wasn't any of those bullets there, but the fact that management and leadership recognized this thing, this adaptation, that hands-on practitioners put together because it was valuable, was in and of itself a demonstration that, number one, management and leaders are distant and have very widely varying understanding of what actually happens, and that expertise is valued in this company beyond others. I wouldn't overfocus on that financial incentive. It's not lottery. In fact, it might even just be like a token amount. The power there is that it's a recognition. To zoom a little bit further, a couple years after, external market factors had impacts. Did you know that that happens, sometimes? There's like, broader economic conditions. I don't know what the current situation is with the NERFs, if it didn't continue on, but that's a different case.

Participant 4: In large institutions, the greenfield can try this, it's even in large banks and large financial institutions. Most of the complex incidents are based on legacy infrastructure. In those worlds, it's very difficult to create that kind of adaptive capacity when you have like a typical expert, and then you spread across, in my experience, that's where the true pain points happen.

Allspaw: I think you're absolutely right. This is what made this case remarkable enough to study in depth. This was an exemplar. The purpose of having an exemplar is a demonstration that particular aspects of that exemplar's performance is possible. Nothing goes anywhere without it. I say that as somebody who gave a talk about continuous deployment once, early on.

Participant 5: You defined adaptive capacity as a system's capacity to adapt to challenges ahead. This example was about perhaps building human capacity plus capability. Is there an example which is not directly related to building human capacity? Are there other examples of something else, maybe?

Allspaw: No. People are the only adaptable element in complex systems. Full stop. That's it. It's not that the hardware, software, and architectures, and frameworks, and programming languages can't be built to be malleable and adaptable, but they don't have an ability to adapt to situations that were unforeseen. That's what resilience and adaptive capacity is reserved for. I heard the term fundamentally surprising there, another little bit of terminology. There was a paper by a man named Lanir, on the Yom Kippur War, established these concepts of situational surprises and fundamental surprises. The best way I can explain it to you is another colleague, Dan Eisenberg, who's an operations research person, put it this way. A situational surprise is one where the surprise can be imagined and can be like noodle on about likelihood and probability statistics and that sort of thing. Fundamental surprises cannot. He said it this way, "Situational surprise is when you buy a lottery ticket and you win the lottery. A fundamental surprise is when you win the lottery and didn't buy a lottery ticket." That's the thing that we as engineers, like we're confronting. The fact of the matter is, your software doesn't resolve incidents. If you have on-call, then you have already recognized and then are demonstrating the value of human expertise. If that wasn't the case, then you'd write code to build the company and have the code run the company.

Participant 6: People are trying to do that.

Allspaw: I'm not going to lose any years of my life holding my breath, I think. You'll note that all AI and LLM centered companies on their career pages always have SREs and infrastructure engineers listed.

Participant 7: It's not an either/or. There are remediation systems that work.

Allspaw: It's not. I'm being snarky. Not for unforeseen scenarios. The point that I'm getting at is, without focusing on an incident, forget about that, what makes all of the non-incidents happen? What they did was expand that, it's to the point that you were saying, by doing that rotating and making space, explicitly giving support to practices, shadowing, and mentoring, all those internal conferences, all that stuff. You're increasing the fuel that prevents incidents from happening already.

 

See more presentations with transcripts

 

Recorded at:

Mar 07, 2024

BT