InfoQ Homepage Articles Prepare to Be Unprepared: Investing in Capacity to Adapt to Surprises in Software-Reliant Businesses

Culture & Methods

Prepare to Be Unprepared: Investing in Capacity to Adapt to Surprises in Software-Reliant Businesses

Aug 12, 2024 11 min read

John Allspaw
DevOps/Resilience Engineering Thought Leader

reviewed by

Ben Linders
Trainer / Coach / Adviser / Author / Speaker @BenLinders.com

Write & Win: InfoQ Contest

Join the contest to:

Win a conference ticket
Boost your profile
Help the community

Send your article proposal

Key Takeaways

Building and maintaining resilience requires intentionally creating conditions where engineers can share, discuss, and demonstrate their expertise among others.
Engineering resilience means enhancing and expanding how people successfully handle surprising scenarios, so creating opportunities for people to share the "messy details" of their experience handling an incident is paramount.
The primary challenge in resilience engineering is understanding what does not go wrong in order to expand what goes well. We notice incidents, but we tend not to notice when they do not happen.
Investing time and attention in learning about the goals other groups have, and what constraints they typically face supports reciprocity where groups mutually assist each other when needed.
When people make mistakes, their actions are looked at closely, but when people solve problems, their actions are rarely looked at in detail. Resilience Engineering emphasizes the critical importance of the latter over the former.

Typical approaches to improving outcomes in software-reliant businesses narrowly focus on reducing the incidents they experience. There is an implied (almost unspoken) assumption that underlies this approach: that incidents are extraordinary aberrations, unconnected to "normal" work. They’re often seen as irritating distractions For over twenty years, the field of Resilience Engineering has aimed at flipping this approach around - by understanding what makes incidents so rare (relative to when and how they do not happen) and so minor (relative to how much worse they can be) and deliberately enhancing what makes that possible.

This article touches on a few aspects of this perspective shift.

Being prepared to be unprepared

Resilience represents the activities, conditions, and investments in capabilities that support people to adapt to situations which cannot be anticipated or foreseen. This is not my definition; it's one that the Resilience Engineering community has developed after over 20 years of studying adaptation in uncertain, surprising, complex, and ambiguous situations.

Resilience is not reliability++

My colleague Dr. David Woods has written about how reliability is different from resilience:

The problem is that reliability makes the assumption that the future will be just like the past. That assumption doesn’t hold because there are two facts about this universe that are unavoidable: There are finite resources and things change.

In other words, reliability is better understood as a probability: it is the likelihood that one of many identical things will fail over a specified period of time. Reliability is derived by testing or observing populations of things that have known (and precise) ideal dimensions and behaviors. Making predictions using reliability data assumes (incorrectly) that the future will look just like the past.

Related is the concept of robustness, which is all the processes, mechanisms, or automations we put in place to handle known failure modes. It is in contrast to making predictions using reliability data because it anticipates specific future failures and puts in place countermeasures to either mitigate them or lessen their impact. We can build robustness around failure modes we can predict, but the world changes in ways that surprise us - in ways that defy prediction.

Resilience is about the already-present skills and capabilities people draw on when responding to surprises, and the related ability for the system (people and the technology they operate as a whole) to anticipate and adapt in response to the surprises that occur.

Resilience hides in plain sight

A well-known and contrarian adage in the Resilience Engineering community is that Murphy's Law - "anything that can go wrong, will" - is wrong. What can go wrong almost never does, but we don't tend to notice that.

People engaged in modern work (not just software engineers) are continually adapting what they’re doing, according to the context they find themselves in. They’re able to avoid problems in most everything they do, almost all of the time. When things do go "sideways" and an issue crops up they need to handle or rectify, they are able to adapt to these situations due to the expertise they have.

Research in decision-making described in the article Seeing the invisible: Perceptual-cognitive aspects of expertise by Klein, G. A., & Hoffman, R. R. (2020) reveals that while demonstrations of expertise play out in time-pressured and high-consequence events (like incident response), expertise comes from experience with facing varying situations involved with "ordinary" everyday work. It is "hidden" because the speed and ease with which experts do ordinary work contrasts with how sophisticated the work is. Woods and Hollnagel (Woods & Hollnagel, 2006) call this the Law of Fluency:

"Well"-adapted cognitive work occurs with a facility that belies the difficulty of the demands resolved and the dilemmas balanced.

In other words: as people gain expertise, becoming more familiar and comfortable with confronting uncertainty and surprise, they also become less able to recognize their own skill in handling such challenges. This is what brings more novice observers to remark that the experts "make it look easy."

In addition to this phenomenon where ingredients of expertise become more "invisible" as it grows, there are many activities people engage in which support resilient performance that, to practitioners, are viewed as just "good practices."

Peer code review is an example of such an activity. On the surface, we can view reviewing code written by peers as a way the author of a given code change can build confidence (via critique and suggestions from colleagues) that it will behave as intended. Looking a bit deeper, we can also see benefits for the reviewers as well: they have an opportunity to understand not only what new or changing functionality their peers are focused on, but also gain insight into how others understand the codebase, the language they’re writing with, specific techniques which may apply in their own projects, and a myriad of other sometimes subtle-but-real benefits. Seen in this way, the ordinary practice of peer code review can have a significant influence on how participants are able to adapt to surprising situations. Given the choice between an incident responder who doesn’t engage in code review and those who do, the latter has a clear advantage.

Adaptive capacity comes from amplifying expertise

In order to increase (and sustain) adaptive capacity, we need to first look closely at what makes it possible for people to respond to incidents as well as they do in your organization – what practices and conditions and resources they depend on. Incidents can always end up worse than they are. Look at what concrete things people do to keep them from getting as bad as they could have been. How do they recognize what was happening? How do they know what to do in this situation? Once you can identify these sources of adaptive capacity, you can a) enhance them and b) protect them from being eroded in the future.

Here are a few examples:

New hire on-call shadowing. When a new teammate first takes on-call rotation responsibilities, they will shadow an experienced person for their first rotation. This provides an opportunity for the novice to understand what scenarios they may find themselves in and what a veteran does in those situations. It also gives the veteran a chance to explain and describe to the novice what, how, and why they are doing what they’re doing. This is a practice which can easily erode, especially under economic tightening: why pay two engineers to be on-call when it only "takes" one?
Visibility across code repositories and logs. Many companies have a (sometimes implicit) policy of allowing all engineers access to all code repositories in use by the organization. This accessibility can help engineers explore and understand what mechanisms might be in play when trying to diagnose or rectify unexpected behavior. This is another example of a critical source of adaptive capacity, even though companies with this sort of policy don’t often recognize it as such; it’s just seen as 'the way we do things.' It's also not too difficult to imagine this sort of all engineer/all repos access being removed or significantly reduced, in the name of compliance or security concerns.

Resilience is fueled by sharing adaptive capacity across the organization

To share adaptive capacity means first taking the initiative to understand:

what makes work difficult for other "neighboring" units (teams, groups, etc.),
what makes them good at handling surprises when they arise, and
finding fluid ways to keep updated on how well they are handling surprises.

Only then can groups commit to mutually assisting each other when situations that need that neighboring support begin to emerge. This is called reciprocity.

How does this happen, practically? By investing time and attention in learning about the goals other groups have, and what constraints they typically face. One way this can happen in software teams is when engineers are encouraged to participate in voluntary and temporary "rotations" of working on other teams for even a short period of time.

The best and most concrete study on how adaptive capacity can be shared is "Building and revising adaptive capacity sharing for technical incident response: A case of resilience engineering" by Dr. Richard Cook and Beth Long in the journal Applied Ergonomics. Their study of an organization’s approach resulted in two main findings:

The ability to borrow adaptive capacity from other units is a hallmark of resilient systems
The deliberate adjustment of adaptive capacity sharing is a feature of some forms of resilience engineering.

An accessible version of this research paper, On adaptive capacity in incident response, describes a practitioner-led effort that created a new way of handling particularly difficult incidents. Under certain conditions, incident responders could "borrow" from a deep reserve of expertise to assist in effective and fluid ways. This reserve consisted of a small group of tenured volunteers with diverse incident experience. Bringing their expertise to bear in high-severity and difficult-to-resolve situations meant that incidents were handled more efficiently. Members of this volunteer support group noted that while the idea for this emerged from hands-on engineers, leaders at the company recognized its value and provided the support and organizational cover so that it could expand and evolve.

Success, in any form, tends to produce growth in things such as new markets and customers, use cases, functionality, partnerships, and integrations. This growth comes with increased complexity in technical systems, organizational boundaries, stakeholders, etc. This additional complexity is what produces new novel and unexpected behaviors which require adaptation in order for the organization to remain viable. In other words: success means growth, growth means complexity, and complexity means surprise.

Sharing adaptive capacity is what creates conditions which enable an organization’s success to continue.

Investments in the sharing of adaptive capacity pay off in the ability to sustain success and keep brittle collapse of the organization’s work at arm’s length.

Building skills in incident response is building expertise (not just experience)

The best people can do when it comes to responding to an incident is to a) recognize immediately what is happening, and b) know immediately what to do about it. Anything that can bolster people’s expertise in support of those two things is paramount. Everything else that is commonly thought of as a skill is secondary and often misses the forest for the trees.

For example: it’s not difficult to find guidance akin to "clear communication to stakeholders." On the surface, this is quite reasonable advice. When it’s unclear or ambiguous as to what is happening (which often happens at the beginning of an incident) reporting "we don’t know what is happening and therefore we’re unsure how long it will take to resolve" isn’t something non-responding stakeholders typically view as "clear" communication. Yet, it’s the reality in many cases. Note also that efforts to communicate "what" is happening and "when" it’ll be resolved takes away from the attention being spent on resolving the issue. This is an unsolvable dilemma. Dr. Laura Maguire’s PhD dissertation work explored this very dilemma and the phenomena that surround it, and she wrote a piece summarizing her findings, Exploring Costs of Coordination During Outages with Laura Maguire at QCon London.

So, what activities are productive in building skills to respond effectively to incidents? Understand the "messy details" of incidents that have already happened, from the perspective of the people who actually responded to them! What made it difficult or confusing? What did they actually do when they didn’t know what was happening? What did they know (that others didn’t) which made a difference in how they resolved the problem? These are always productive directions to take.

Finding existing sources of resilience

Look at incidents (otherwise known as "surprises which challenge plans") you’re experiencing and look for what made handling those cases possible.

What did people look at? Telemetry? Logs? How did they know what to look for? How did they know where to look? Did they go looking in parts of code repositories or logs that others are responsible for? If they called others for help: how did they know who to call, or how to actually contact them?

Many of these activities require first having access to the data they relied on; they had this access. This may seem so ordinary to be dismissed as important, but it’s often not difficult to come up with reasons why that access might be taken away in the future.

People make use of whatever they think they need to when faced with solving a problem under time pressure with consequences for getting things wrong. More often than not, what people actually do isn’t given a good deal of attention. Unless, of course, their actions are seen as triggering a problem in the first place, or making an existing issue worse.

In other words: when people make mistakes, their actions are looked at closely. When people solve problems, their actions are rarely looked at in detail (if at all).

Resilience Engineering emphasizes the critical importance of the latter over the former.

About the Author

John Allspaw

John Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John’s publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.” His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement. John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University.

Show moreShow less

Write Your Way to a QCon or InfoQ Dev Summit!

Join the InfoQ article competition to win a complimentary ticket to QCon or InfoQ Dev Summit! We're seeking in-depth technical articles written by software developers for software developers.

Send your proposal

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?