BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Adaptive Responses to Resiliently Handle Hard Problems in Software Operations

Adaptive Responses to Resiliently Handle Hard Problems in Software Operations

Key Takeaways

  • Resilience - adapting to changing conditions in real time - is a hallmark of expert performance.
  • Findings from Resilience Engineering studies have revealed generalizable patterns of human cognition when handling complex, changing environments.
  • These studies guide how software engineers and their organizations can effectively organize teams and tasks.
  • Five characteristics of resilient, adaptive expertise include early recognition of changing conditions, rapidly revising one’s mental model, accurately replanning, reconfiguring available resources, and reviewing to learn from past performance.
  • These characteristics can be supported through various design techniques for software interfaces, changing work practices, and conducting training.

As software developers progress in their careers, they develop deep technical systems knowledge and become highly proficient in specific software services, components, or languages. However, as engineers move into more senior positions such as Staff Engineer, Architect, or Sr Tech Lead roles, the scope of how their knowledge is applied changes. At the senior level, knowledge and experience are often applied across the system. This expertise is increasingly called upon for handling novel or unstructured problems or designing innovative solutions to complex problems. This means considering software and team interdependencies, recognizing cascading effects and their implications, and utilizing one’s network to bring appropriate attention and resources to new initiatives or developing situations. In this article, I will discuss several strategies for approaching your role as a senior member of your organization.

Resilience in cognitively demanding software development work

Modern software engineering requires many core capabilities to cope with the complexity of building and running systems at speed and scale and to adapt to continuously changing circumstances. Resilience Engineering offers several concepts that apply to adapting to inevitable pressures, constraints, and surprises.

Resilience has been described in many ways by different fields. It has been used to describe psychological, economic, and societal attributes but comes primarily from ecology. It is used to describe adaptive characteristics of biological and ecological systems, and over the years, our understanding of resilience has changed. In software, perhaps the most impactful description of resilience is from safety researcher David Woods and the Theory of Graceful Extensibility. He defines it as "the ability of a system to extend its capacity to adapt when surprise events challenge its boundaries".

This means an organization does not just "bounce back" or successfully defend itself from disruptions. Instead, it can respond in such a way that new capabilities emerge. Consider how, as Forbes notes in their article on business transformations, during the pandemic, commercial airlines responded to decreased travel by turning routes into cargo flights or how hotels that had lost travelers began offering daily room rates for employees working from home to stay productive safely.

Similarly, this resilience perspective is helpful for software engineering since "surprises" are a core characteristic of everyday operations in large-scale, continuous deployment environments. A core aspect of system design that allows for more resilient and reliable service delivery comes from designing, planning, and training for surprise handling.

Resilience Engineering techniques for everyday performance improvement

Researchers studying performance in high-demand work - like flying a fighter jet at 1800 mph close to terrain, rapidly shutting down a nuclear power plant after an earthquake, or performing open heart surgery on an infant in distress - have identified important human perceptual, reasoning, and response capabilities that allow someone to respond quickly and appropriately to a given event.

Even with extensive preparations and training, unexpected events can make following a playbook or work process difficult or impossible. Add time pressure and uncertainty about what is happening and how quickly things might fail, and the situation becomes overwhelmingly hard to manage.

People are forced to adapt in these kinds of surprising situations. They must rapidly identify a new way to handle the situation as it deteriorates to prevent the failure's impacts from spreading. Successful adaptation is often praised as "quick thinking", and in this article, we’ll explore the basis for quick thinking - or resilient performance - during software incidents.

The theoretical basis for quick thinking is drawn from research into high-consequence work settings. When applied to software operations, it can enhance resilient performance and minimize outages and downtime. Here, these findings are adapted into strategies for action for individual engineers and their organizations. These are:

  1. Recognizing subtly changing events to provide early assessment and action
  2. Revising your mental model in real time of how to adjust your actions
  3. Replanning in real time as conditions change
  4. Reconfiguring your available technical and human resources
  5. Reviewing performance for continuous learning

Together, these five capabilities can enable quick and accurate responses to unexpected events and move quickly without breaking things.

Recognizing: The importance of early detection

Early recognition of a problem, changing circumstances, or needing to revise our understanding of a situation is crucial to resilience. Early detection is beneficial in that it allows:

  • more possibilities for action because the situation has not progressed very far
  • the opportunity to gather more information before needing to act
  • the ability to recruit additional resources to help cope with the situation

Early detection is not always possible due to a lack of data or poor representation of the available data. However, engineers can better recognize problems earlier by continually calibrating their understanding of how the system operates under varying conditions and noticing even subtle changes quickly. Here are three practical ways for software engineers to achieve this in day-to-day work:

Calibrating to variance: One approach is to become more familiar with expected vs. unexpected system behavior by regularly monitoring different operating conditions, not just when there is a problem. An active monitoring practice helps calibrate variance, such as when a spike in volume indicates a problem versus when a certain time zone or customer heavily utilizes the service.

Expanding knowledge about changes: Another strategy is to develop a practice of reading incident reports and reviewing what the dashboards looked like at the earliest indication of trouble to get better at noticing what an anomalous event looks like.

Encouraging knowledge transfer: Lastly, another technique for lightweight calibration to help early detection is asking, "What did you notice that caused you to think there was a problem?" whenever a coworker describes a near miss or a time they proactively averted an outage. Their explanations and your interpretations of these vicarious experiences reinforce a more elaborate mental model of nominal and off-nominal behavior.

Revising: The role of mental models in solving hard problems

A mental model is an internal representation of a system’s behavior. All software engineers construct mental models of how the system runs and fails. Mental models typically include information about relationships, interdependencies, and interactivity that allow for inferences. They can also help predict how a system will likely respond to different interventions.

For software engineers, this means mentally sifting through possible solutions, issues, and interactions to determine the most reasonable action. What is reasonable depends on assessing the action against the current and expected future conditions, goals, priorities, and available resources. In other words, to simulate how different choices will impact desired outcomes. A well-calibrated mental model can help engineers effectively simulate and be better prepared to assess the pros/cons of each and what risks may be involved.

But mental models can be - and often are - wrong. As noted in Behind Human Error, mental models are partial, incomplete, and flawed. This is not a criticism of the engineer. Instead, it acknowledges the complex and changing nature of large-scale software systems. No one person will have complete and current knowledge of the system. No one has a perfect understanding of the dependencies and interactions of a modern software system. Most software systems are simply too big, change too much, and too quickly for anyone’s knowledge to be consistently accurate.

Having poorly calibrated knowledge is not the problem. The problem is when you don’t know you have poorly calibrated knowledge. This means engineers must continually focus on model updating. A strategic resilience approach is cultivating a continual awareness of how current or stale your understanding of the situation may be. As a researcher studying how engineers respond to incidents, I constantly look for clues indicating how accurate the responders’ mental models are. In other words, is what they know or believe about a situation or a system correct? This is a signal that model updating is needed. A high-performing team can quickly identify when they’ve got it wrong and rapidly find data to update their understanding of the situation. Some approaches to continual revising include:

Call out the uncertainties and ambiguities: A technique that helps teams notice when their mental models are incorrect or differ is to ask clarifying questions like "What do you mean when you say this query syntax is wrong?" It’s a simple and direct question that my research has shown is not commonly asked. Explicitly asking creates opportunities for others to reveal what they are thinking and allows all involved to make sure they have the same understanding. This is especially crucial as situations are rapidly changing. Teams can develop shorthand ways of ensuring model alignment to avoid disrupting the incident response.

Developing a practice of explicitly stating assumptions and beliefs so that those around can track the line of reasoning and quickly identify an incorrect assumption or faulty belief. This seems so simple, but when you start doing this, you realize how much you "let slide" about inaccurate or faulty mental models in ourselves or others because it seems so small that it doesn’t seem worth revising, or time pressure prevents us from revising. A more junior engineer may be apprehensive about asking clarifying questions about a proposed deployment or hesitate to talk through their understanding of the risks of rolling back a change for fear of being wrong. The more senior engineer may not realize a gap in their mental model or may not want to publicly call out faulty knowledge.

Learn to be okay with being wrong: Software engineers must accept that their mental models will be wrong. Organizations need to normalize the practice of "being wrong". This shift means that the processes around model updating - like asking seemingly obvious questions - become an accepted and common part of working together. Post-incident learning reviews or pair programming are excellent opportunities to dig into each party’s mental models and assumptions about how the technology works, interacts, and behaves under different conditions.

Replanning: It’s not the plan that counts, it’s the ability to revise

Software engineers responding to a service outage are, for the most part, hard-wired to generate solutions and take action. Researchers Gary Klein, Roberta Calderwood, and Anne Clinton-Cirocco studied expert practitioners in various domains. He showed that anomaly recognition, information synthesis, and taking action were tightly coupled processes in human cognition. The cycle of perception and action is a continuous feedback loop, which means constant replanning based on the changing available information. The replanning gets increasingly tricky as time pressure increases, partly due to the coordination requirements.

For example, replanning in everyday work situations such as in a sprint planning meeting and deciding how to prioritize one feature over another. In this scenario, there is time to consider the implications of changing the work sequencing or priorities. In this situation, it is possible to reach out to any parties affected by the decision and account for their input on how the plan may impact them. It is relatively easy to reorganize the workflow with less disruption for everyone.

Contrast that with a high-severity incident where there may be potential data loss in a critical, widely used internal project management tool. The incident response team thinks the data loss may be limited to only a part of the organization. While there is a slight possibility they could recover this data, it would mean keeping the service down for another day, impacting more users. One team has a critical meeting with an important client and needs the service restored within the next hour. This meant responders had to determine the blast radius of impacted users, the extent of their data loss, and the implications to those teams while the clock was ticking. Time pressure makes any kind of mental or coordinative efforts more challenging, and replanning with limited information can have significant consequences as needed perspectives may be unavailable to weigh in, causing more stress to all involved and forcing unexpected shifts in priorities or undesirable tradeoffs.

In a recent study looking at tradeoff decisions during high-severity incidents, my colleague Courtney Nash and I found that successfully replanning decisions was inevitably "cross-boundary". A major outage often requires many different roles and levels of the organization to get involved. This meant that an understanding of the differing goals and priorities of each role was essential to being able to quickly replan without sacrificing anyone’s goals. Or, when goals and work need to be changed, the implications of doing so would be clearer to the replanning efforts. These findings and others from the resilience literature provide an important strategy for resilient replanning:

Create opportunities to broaden perspectives: Formal or informal discussions highlighting implicit perceptions and beliefs can influence how and when participants take action during an incident or work planning. They can use this information to revise inaccurate mental models, adjust policies and practices, and help organizations identify better approaches to team structure, escalation patterns, and other supporting workflows. A greater understanding of goals and priorities and how they may shift in response to different demands aids in prioritization during replanning. A crucial part of coping with degraded conditions is to assess achievable goals given the current situation and figure out which ones may need to be sacrificed or relaxed to sustain operations.

Reconfiguring: Adjusting to changing conditions

Surprises seldom occur when it is convenient to deal with them. Instead, organizations scramble to respond with whoever is available and whatever expertise, authority, or budget may be available. Organizations that flexibly and effectively use the given resources can support effective problem-solving and coordination even in challenging conditions. This can be simple things like having a widely accessible communication platform that doesn’t require special permissions, codes, or downloaded apps, allowing anyone who could help to join in the effort seamlessly. It may be more complex - such as an organization that promotes cross-training for adjacent roles.

Or it could be holding company-wide game days to be able to efficiently engineers from multiple teams on a significant outage because they have common ground - they know each other, have some familiarity with different parts of the system than they usually work on, and can rely on their shared experiences to accurately predict who may have appropriate skills to perform complex tasks. Just like you might add, delete, or move resources within your network configuration, a strategy of dynamic reconfiguration of people and software helps resilience by moving expertise and capabilities to where they are needed while minimizing any impacts of degraded performance in other areas. A resilient strategy for reconfiguration in software organizations includes:

Cultivating cross-boundary awareness: Reconfiguring allows an organization to share resources more efficiently when there is accurate knowledge about the current state of the goals, priorities, and work underway of adjacent teams or major initiatives within the organization. Research looking at complex coordination requirements has shown better outcomes for real-time reconfiguring when the parties have a reasonably calibrated shared mental model about the situation and the context for the decision. This enables each participant to bring their knowledge to bear quickly and effectively, to support collaborative cross-checking (essentially vetting new ideas relative to different perspectives) and allows for reciprocity (being able to lend help or relax constraints) across teams or organizations.

Maintaining some degree of slack in the system: Modern organizations are fixated on eliminating waste and running lean. But what is considered inefficient or redundant before an incident is often recognized as critical, or at least necessary, in hindsight. In many incidents I’ve studied, Mean Time To Repair (MTTR) is usually reduced by engineers proactively joining a response even when they are not on-call. This additional capacity, not typically acknowledged or accounted for when assessing the actual requirements for maintaining the system, is nonetheless critical. It is realized due to engineers’ professional courtesy to one another. It is highly stressful to be responsible for a challenging incident or deal with the pressure of a public outage. I’ve seen other engineers jump into the slack to assist even when putting babies to bed or taking vacations. Burnout, turnover, and changing roles are inevitable. Maintaining a slightly larger team than optimally efficient can help make the team more resilient by increasing communication, opportunities to build and maintain common ground, and cross-training for new skills.

Reviewing performance: Continuous learning supports continued performance

There is a difference between how we think work gets done and how work actually gets done. Learning review techniques that focus on what happened, not what was clear after the fact, helps to show how the system behaves under varying conditions and how organizations and teams function in practice. Discussing the contributing factors to the failure, hidden or surprising interdependencies, and other technical details should also include details about organizational pressures, constraints, or practices that helped or hindered the response. This is true even around the "small stuff", like how an engineer noticed a spike in CPU usage on their day off or why a marketing intern was the only one who knew an engineering team was planning a major update the day before a critical feature launched. When the post-incident review broadens to include both the social and technical aspects of the event, the team can address both software and organizational factors, creating a more resilient future response.

Some strategies for enabling continuous learning to support resilience include:

Practice humility: As mentioned before, inaccurate or incomplete mental models are a fact of life in large-scale distributed software system operations. Listening to and asking clarifying questions helps to create micro-learning opportunities to update faulty mental models (including your own!)

Don’t assume everyone is on the same page: Where possible, always start with the least experienced person’s understanding of a particular problem, interaction, or situation and work up from there, adding technical detail and context as the review progresses. This gives everyone a common basis of understanding and helps highlight any pervasive but faulty assumptions or beliefs.

Make the learnings widely accessible: Importantly, organizations can extend learning by creating readable and accessible artifacts (including documents, recordings, and visualizations) that are easily shared, that allow for publicly asking and answering questions to promote a culture of knowledge sharing, and are available to multiple parties across the organization even non-engineering roles. A narrative approach that "tells the story" of the incident is engaging and helps the reader understand why certain actions or decisions made sense at the time. It’s a subtle but non-trivial framing. It encourages readers to be curious and promotes empathy, not judgment.

Resilience takeaways

Like any other key driver of performance, resilience requires investment. Organizations unwilling or unable to invest in taking a systems approach can reallocate resources to resilient performance in small but repeatable ways by maximizing the types of activities, interactions, and practices that allow for REvision, REcognition, REplanning, REconfiguring, and REviewing. In doing so, we can enable software teams to coordinate and collaborate more effectively under conditions of uncertainty, time pressure, and stress to improve operational outcomes.

About the Author

Rate this Article

Adoption
Style

BT