Key Takeaways
- Covid-19 was a surprising and disruptive event for many organizations. The ability to sustain ongoing operations during disruptions is a resilient capability.
- While the term resilience has many definitions, technology companies should aim for sustained adaptability over merely bouncing back or withstanding pressures.
- The ongoing ability to adapt allows for organizations to meet new challenges and exploit new opportunities that emerge as a result of the changing landscape.
- Taking an integrated systems approach involves both structured systems design choices (organizational, process & tool design) as well as enablement of front-line practitioners by providing access to the right information, tools, and practices, and providing sufficient autonomy to adjust performance.
- Organizational performance improvement efforts - like chaos engineering and incident analysis- can be used to enhance safe adaptation by supporting the flow of information across the organization and creating micro-learning loops to update engineers’ mental models of how their services perform under different conditions.
To most software organizations,Covid-19 represents a fundamental surprise- a dramatic surprise that challenges basic assumptions and forces a revising of one’s beliefs (Lanir, 1986).
While many view this surprise as an outlier event to be endured, this series uses the lens of Resilience Engineering to explore how software companies adapted (and continue to adapt), enhancing their resilience. By emphasizing strategies to sustain the capacity to adapt, this collection of articles seeks to more broadly inform how organizations cope with unexpected events. Drawing from the resilience literature and using case studies from their own organizations, engineers and engineering managers from across the industry will explore what resilience has meant to them and their organizations, and share the lessons they’ve taken away.
The first article starts by laying a foundation for thinking about organizational resilience, followed by a second article that looks at how to sustain resilience with a case study of how one service provider operating at scale has introduced programs to support learning and continual adaptation. Next, continuing with case examples, we will explore how to support frontline adaptation through socio-technical systems analysis before shifting gears to look more broadly at designing and resourcing for resilience. The capstone article will reflect on the themes generated across the series and provide guidance on lessons learned from sustained resilience.
As a researcher, my job is to study software engineers and the companies they work in to surface insights about the problems they face, the solutions they pursue and to track the trends that impact both day-to-day operations of individual companies and the industry more broadly. For the last 4 years, I’ve worked closely with teams across the industry to examine sources of resilience in software engineering and advise on training and tool design to support practitioners, and develop work practices and strategic initiatives to help organizations safely and reliably work at speed and scale. At the outset of the pandemic, myself and other colleagues from the research community quickly recognized that the well-established body of knowledge from over 40 years of integrated systems engineering and, more recently, resilience engineering was directly applicable to helping organizations cope with the surprising event.
Looking back over the last nine months, it appears that the organizations that have been able to cope with the conditions fall into one of four quadrants. There were those poorly set up to cope with little to no resources, which promptly collapsed [1], [2]. There were companies that may have been poorly set up to adapt but had substantial reserves (both money and human resources) to absorb the costs of any unpreparedness [1], [2]. There were organizations that had foresight and preparedness to build the capacity to adjust that they imagined one day they might need. The fourth type are those who may not have been prepared but were able to adjust [1], [2]. This article uses examples from the past year and the lens of Resilience Engineering to examine the ability for site reliability engineers (and their company’s) to adapt.
Specifically, in this article we will look at what an organization can do structurally - through organizational design, tooling, work practices and procedures - during (and prior to) a surprising and disruptive event that establishes the conditions that help engineering teams adapt in practice and in real time as the disruptive event occurs. Organizations can create the conditions for adaptation through how they structure the system of work surrounding individuals and teams - in doing so, they either expand or constrain the potential for safe adaptation. We’ll explore an example of this by looking at two typically independent performance improvement initiatives - incident analysis and chaos engineering - that, when integrated both improve system performance, in addition to improving the capacity of individuals within the organization to recognize and cope with problems in real time.
In this way, developing adaptive capacity and resilience are two sides of the same coin - the system must be oriented towards changing in real time for the engineers to be able to anticipate and adjust performance relative to the problem demands, be those a service outage or a pandemic.
What is a resilient response to Covid-19?
To set the stage, it’s worth acknowledging that there are many overlapping definitions for what it means to be resilient to unexpected disruptions (Woods, 2015). Two common definitions- resilience as the ability to rebound back to a baseline and resilience as the ability to withstand pressures or disruptions- fall short in describing the ways that Covid-19 has forced organizations to adjust.
This is because as the pandemic continues, there has been no ‘baseline’ to return to and the continued demands of coping with Covid have eroded the ability to continually withstand pressures without also involving some other form of change to be able to do so. Another way of saying this is that, in experiencing a widespread and sustained disruptive event, organizations themselves are fundamentally changed, so a definition of resilience should then include indications of how the organization has changed.
In this way, a broader definition is needed that can capture the changing performance capabilities. For example, as shutdowns of core societal functions - like office buildings, schools and childcare or eldercare - became widespread, many engineering teams carried on their work, even while they handled the additional demands inherent in the upheaval of normal life (from concurrent caregiving and teaching responsibilities as well as navigating activities of daily living in a pandemic).
It is this ability for extended performance and sustained adaptability that represents another form of resilient capacity. Extended performance means reorganizing existing resources to meet new demands or capitalizing on emergent opportunities to reach new markets. Sustained adaptability means people are able to "adapt in the face of variation, but much more importantly, are able to sustain adaptability as the forms and sources of [variation] continue to change ... over longer cycles" (Woods, 2018).
While one might argue that an Agile approach to software development is the same as resilience - since at its core it is about iteration and adaptation. However, Agile methods do not guarantee resilience or adaptive capacity by themselves.
Instead, a key characteristic of resilience lies in an organization’s capacity to put the ability to adapt into play across ongoing activities in real time; in other words, to engineer resilience into their system by way of adaptive processes, practices, coordinative networks in service of supporting people in making necessary adaptations.
Adaptability, as a function of day-to-day work, means to revise assessments, replan, dynamically reconfigure activities, reallocate & redeploy resources as the conditions and demands change. Each of these "re" activities belies an orientation towards change as a continuous state. This seems self-evident - the world is always changing and the faster the speed and greater the scale - the more likely changes are going to impact your plans and activities. However, many organizations do not recognize the pace of change until it’s too late. Late stage changes are more costly - both financially at the macro level and attentionally for individuals at a micro-level. Think about it - the earlier you recognize a legacy system is causing substantial disruption to your teams or that a competitor is moving into your market or that the customer needs are shifting - the easier it is to defensively adjust your strategies to maintain performance. Wait too long and the financial costs of developing and integrating new tooling or regaining market share increase. In addition, the additional demands of longer hours and more stressful or intensive workloads have real "costs" to your workforce.
Despite the criticality of early detection (and the implications) of disruptive events, it is not an inconsequential task to be able to recognize when conditions have changed substantially enough that the organization’s trajectory is no longer in alignment with being able to sustain operations through the disruption. While an engineer or manager may recognize the need to replan and reallocate resources, often structural elements - such as organizational bureaucracy, silos or the lack of access to tools that can support the activities needed to adapt - hinder their ability to do so. In the current context, when we’ve been coping with an ongoing event, it can be difficult to detect new disruptions, and organizations need to be particularly sensitive.
While important in the short-term, supporting the ability for sustained adaptability during your pandemic response can also become a strategic advantage as we transition back to "normal times". Investments made now can enhance organizational performance by making your organization more sensitive to changing market demands and your teams better able to adjust their performance in real time. By analyzing examples from the response to the pandemic, framed as adaptive capacity, we can lay the foundation for designing and managing a more adaptive organization long after the pandemic ends.
Coping well with Covid-19 conditions through adaptation
As previously noted, some companies without substantial financial or human resources have weathered the pandemic well. Those who have managed to adjust to the new normal exhibit sustained adaptability as they make ongoing adjustments to meet new or changed demands (Firestone, 2020).
Many leaned on technology to handle perhaps the most significant adaptation - the shift from collocated to distributed operations. The relatively smooth transition to virtual operations is largely a testament to the advances in distributed work tooling over the last five years. The capabilities of software platforms, use of Cloud technology, and widespread accessibility has allowed tools like Zoom, Slack and others to scale rapidly and, largely, reliably has supported organizations with the foresight to have invested in making these technologies available to their workforce.
Enabling the workplace to move to the home or mixed home-office patterns without substantial disruptions was a starting point. However, sustaining this adaptation requires organizations to be thinking ahead and continuing to develop the infrastructure to support their teams to face ongoing demands - some of which may not be clearly defined at this point.
Supporting the ability to adjust in real time takes anticipatory thinking. Anticipatory thinking is the "process of recognizing and preparing for difficult challenges, many of which may not be clearly understood until they are encountered." (Klein, Pin & Snowden, 2007; p.1). A very simple example of using anticipatory thinking to dynamically reconfigure is with organizations who run 24/7 operations. Minimizing downtime in these operations means addressing the difficult challenge of how to have the right collection of skill sets available to deal with an undefined problem at a moment’s notice.
Therefore, they invested in tooling that supports rapid assembly for cognitively demanding distributed work by supporting teams with tools for monitoring, alerting, web conferencing and real time messaging. They developed schedules that distribute the load of carrying the pager across their (often scarce) highly skilled responders. They enabled adaptive responses by being able to rapidly recruit needed people in real time and quickly bring them up to speed depending on the types of problems encountered. They used distributed tools to maintain continued coherence of emerging or existing problems, to be sensitive to changing demands and ensure smooth handoffs between various parts of the business.
The organizations experienced with distributed teaming had realized the value in being able to connect their sources of expertise across space and time - a capability that was critical in a suddenly remote world. Their investments in developing this capacity within their site reliability and devops teams could more easily scale across organizational units not typically used to working in a distributed fashion.
Other organizations may have developed similar capabilities for distributed work in order to integrate their global operations or to be able to hire specialized skill sets that may not have existed in their local markets. Whatever the reason, investments in alternate work arrangements had generated an additional source of resilience to cope with the unexpected demands of the pandemic.
The use of anticipatory thinking can be applied more strategically - and to the kinds of disruptions that are potentially existential. An example of a strategic dynamic reconfiguration is a ridesharing app whose business evaporates overnight when the lockdown begins that pivots to utilize the platform and its driver fleet to begin food delivery services. In other words, they redirected available resources across parts of the business that may have previously been loosely coupled and willingly degraded some aspect of performance to reallocate efforts into building out new features that can help meet the new conditions. Woods (2018) calls this extended performance "graceful extensibility" to describe how an organization stretches as it approaches the boundaries of the capabilities it was designed for and extends those boundaries to include new forms of organization or activities.
Technology companies that utilize DevOps or Agile methodologies and Cloud computing are particularly well positioned to adapt in this manner as many have business models based on being sensitive to changing customer needs, iteration, and adjustable scale. While insufficient on their own, these preconditions lend themselves to becoming and sustaining an adaptive organization. Relatedly, an organization must not only design for adaptability, but must support and enable their most flexible elements - their people.
The resilient components in the socio-technical system
Organizations that have not designed for, or invested in, the capabilities for adaptation but have nonetheless coped well with Covid have done so almost certainly because of the willingness and flexibility of its workforce to continue work. This is not an inconsequential point because, in many cases, operational continuity happened in spite of the organization, not because of it. In the years preceding the pandemic many companies rescinded remote work policies and invested in centralized office buildings and en-suite teleconference capabilities even while the technology to support high bandwidth distributed work continued to improve. As mentioned previously, collaborative technologies afforded many of the joint activities of collocated work to continue - albeit with a higher cognitive cost to the people involved (Maguire, 2020)- across geographically dispersed teams.
In March many organizations who were unprepared for large scale distributed work scrambled to adjust at scale; negotiating enterprise-licensing agreements for basic remote work tools on the fly, delivering company-wide training to bring people up to speed on the tools, and revising policies and work practices to accommodate new configurations. Concurrent with this, there were several examples of bottom up efforts that smoothed the transition and kept teams functioning through the uncertainty through facilitated research sessions conducted in the spring (Maguire & Jones, 2020). In early days, with no company guidance, software engineers across multiple organizations were able to quickly seek out and spin up the tools and new methods to coordinate newly remote teams in order to sustain operations - all from their couches, kitchen tables and home offices. In organizations unprepared to support distributed work, locally adaptive strategies prevailed with small scale innovations that met the immediate needs of teams. Those who had been remote workers coached and advised their peers in adjusting to distributed life by developing blog posts and hosting web chats. The adaptability of their workforce enabled organizations to harness this innovation and continue deploying code and managing their services without significant disruptions.
However, not all adaptation is appropriate under conditions of uncertainty. Locally adaptive strategies can be globally maladaptive (Woods & Branlat, 2011). Safe adaptation depends on being well calibrated to the changing conditions. Being well calibrated is a function of recognizing change as a continuous state for modern operations.
Change as a continuous state
Earlier, it was said that the world is continuously changing around us - an obvious statement with perhaps less obvious implications. In a rapidly evolving situation, how individuals and organizations recognize change and consider the trajectories for change has profound implications for how ready to respond they are. Think about driving on a busy highway during rush hour where traffic is moving quickly, and people are constantly changing lanes and merging on and off of the road. Those drivers who recognize that the increased volume of cars operating in closer proximity increases the possibility for accidents and necessitates greater monitoring of others movements who more closely watch those around them, will monitor the traffic ahead for unexpected slowdowns and may preemptively take steps to minimize their risks.
Similarly, reliability engineers who recognize changing system dynamics - such as increased load on the system or a highly critical upcoming event such as a database migration - will adjust their monitoring to watch for subtle variation that could be evidence of a larger disruption - all so they are prepared to act and adapt in real time. Being well calibrated means monitoring relative to the state of the changes noticed.
This is not an insignificant point. Even in periods of relatively low rates of change, there’s evidence to suggest that maintaining reliability of dynamic digital service systems requires near continuous monitoring. As Maguire (2020) notes:
This is completed by continuous, small, lightweight ‘checks’ that happen regularly during work hours and even intermittently during off-hours. Many of the engineers studied push monitoring alerts to their smartphones, and even in off-hours, will glance at them regularly. When the alert indicates expected performance, the engineer ignores it. When the alert is unusual, even if it does not trigger a page out (meaning it hasn’t reached a certain predetermined threshold for paging responders), engineers will begin formulating hypotheses about where the source(s) of the problem come from.
This may be simply thinking it over and trying to connect their knowledge of recent or expected events (‘are we running an update this weekend?’) or involve gathering more information (checking dashboards, looking in channel to see if there are any reports).(p.150)
But unlike traffic, which may subside, allowing drivers to revert to less frequent shoulder checks or glances in the mirrors, constant monitoring is an inherent characteristic of continuous deployment systems operating at speed and scale. Engineers need to be continuously probing the system to remain well calibrated so their adaptations - the need for which often arises unexpectedly - are safe.
It’s because of this that the code freeze put in place by many organizations to limit perceived exposure to additional risk (Maguire & Jones, 2020; Cook, 2020; McLaughlin, 2020) is misguided. This strategy has an unintended effect of making engineers less familiar with the state of the system as they have less recency in the code base and their incident response skills are unpracticed during this period.
A fundamental principle of modern continuous deployment/ continuous integration systems is that change is a constant. Therefore, organization-level adaptations such as instituting a code freeze, while intended to limit risk and reduce load on engineers to track changes during a period of distraction, can actually backfire and increase the effort required for remaining well calibrated to safely adapt downward, creating additional demands on responders. These additional demands can overload engineers when they happen during an incident, as additional effort is expended to reorient themselves with the current state of the system and the code.
In summarizing the additional and often invisible effort, I noted "To be clear, I am not suggesting this should become the standard, I’m suggesting it is the standard" (Maguire, 2020, p. 150). Change is a constant - the answer to safer operations is not to attempt to stop change, it’s to design better practices and instrument your systems with better tooling to enhance monitoring so your engineers are better equipped, when required, to be well calibrated to be able to safely adapt.
Safely adapting in the face of disruption
Adaptation under conditions of uncertainty requires constant calibration and recalibration. Seeking out information about the system - by monitoring dashboards, reviewing log files, checking code deploys - are all forms of calibration. In doing so, engineers seek to validate whether their mental model (their internal representation of the system and its performance boundaries) is adequately calibrated to the actual state of the operating system. This is an ongoing process whose frequency fluctuates according to the rate of change as described in the driving example. When the rate of change exceeds the rate at which model updating happens, engineers’ perspective becomes stale and the ability to safely adapt decreases.
An engineer who may not be monitoring the system as carefully as conditions of change dictate is only one example of slow and stale processes of calibration. Some examples of organizational processes falling behind the pace of change are found in runbooks that don’t get updated or monitoring systems thresholds that might not get calibrated as system performance boundaries change - making them either too sensitive and alerting when there is no issue or not sensitive enough, failing to alert responders when a problem is occurring. This is not unusual. For example, retrospectives or post mortems are system re-calibration points whereby updating runbooks or calibrating tooling are reasonable steps to take to minimize future impacts of similar incidents. However, many corrective actions or well-intentioned post-incident activities end up in a backlog as ongoing organizational demands take precedence, making perpetuating stale models about how the system functions.
There are two differing responses to this common problem. The first is to enhance technical robustness by stressing organizational discipline in cleaning up the backlog by reprioritizing work or completing lower value work tickets - which takes resources away from other priorities. While it’s difficult to argue that well-prioritized work and resolving outstanding issues is not an appropriate focus, it can become unwieldy as the amount of work exceeds the resources to complete it, creating more effort to manage increasingly dated tickets that may or may not be relevant. One prominent video streaming platform takes a different approach: backlog tickets "age out" after a month on the board and if they haven’t been resolved by then they account for them as part of the inevitable imperfections in the system and assume if important enough, they will resurface. While working to minimize the backlog remains an important function to appropriately manage, it needs to be considered relative to the changing conditions to ensure backlogged items are still relevant or latent issues are not neglected because of more recent demands.
A second focus for organizations to aid in adaptation is to enhance the human and social capabilities by prioritizing creating and sustaining information-rich short-cycle feedback loops to aid engineers with model updating (Leveson et al, 2006). In effect, this approach acknowledges that while formal protocols and tooling are important to supporting continuous deployment operations, it will always be in service of supporting engineers to cope with often novel and ambiguous problems.
In complex adaptive systems, formalized system models (like runbooks and organizational processes) - like engineers’ mental models - are inevitably partial and incomplete in some form. However, unlike runbooks and tooling, engineers are constantly calibrating. For example, in the course of their daily work, engineers are adding new knowledge about how the system works under varying conditions and how it interacts with other dependent components - two critical areas of expertise that can influence service recovery efforts (Allspaw, 2015). Because each responder interacts with different parts of the system and the code base, their knowledge is going to be slightly (or significantly) different than their colleagues (Grayson, 2018).
This variability is a feature, not a bug. Collectively, it allows for a broader set of experiences and knowledge to be brought to bear during an incident. Organizations can aid safe adaptation by structuring interactions that enable diverse areas of knowledge to interact more frequently (through cross-team/cross-boundary knowledge sharing) and by increasing the amount of exposure to these kinds of opportunities to learn about the system.
One method of creating a more resilient and adaptive socio-technical system is by more closely coupling sources of information-rich data that can help model updating and monitoring with short-cycle feedback loops that help detection and recognition of anomalous change. An integrated approach to two common (and often independently managed) programs - incident analysis and chaos engineering - is used to illustrate this concept.
Chaos engineering & incident analysis should be tightly integrated: one is a program that purposefully introduces disruptions into the system to see how it responds, while the other is focused on understanding the implications of disruptions on the system. Integration of these programs can maximize the investment of time and effort. However, these are often separate organizational initiatives that may be managed or utilized by different engineering teams with few interactions. Findings from these programs are often only locally distributed and poorly engaged with.
Figure 1
Because these programs are managed independently, the return on investment is low and very little value is realized. Performance improvements are limited because of:
- a narrow focus on what broke, not how it broke or under what conditions it degraded
- a focus on human error that creates a fear of being blamed
- limited distribution of the findings and poor cross-functional interactions during retrospectives that can enable teams to recognize how differing perspectives of the system can develop across different parts of the business
- the lack of feedback loops to integrate the lessons learned more effectively
Therefore, despite often substantial investments in these programs and practices, organizations receive very little return on their investments.
Designing for greater integrations between these programs can maximize benefits and show more immediate and long-term value.
By creating information rich, short-cycle feedback loops across the programs, performance improves because when an incident happens:
- Investigations are about learning, not blame
- Investigation reports contain important information about how the system functions and can be used for training and to increase cross-functional learning
- expertise is shared across the team
- There is better awareness of specialized skill sets across responders
- Greater shared knowledge can aid in handling new kinds of failures
When chaos experiments are run:
- They target vulnerabilities identified in incident reports and become better opportunities for learning
- Teams gain greater confidence in their skills & knowledge through practice
- Individual responders gain more specialized knowledge around vulnerabilities to help in future incidents
- They can help prioritize technical work arising from incident analysis
Greater integration of these kinds of programs makes better use of the stretched organizational resources, provides immediate benefit to individual engineers and increases opportunities for cross-functional collaboration to break down silos and build trust across teams - all characteristics needed for safely adapting during high pressure situations.
Aiding model updating of both your individual engineers, across teams and business units more broadly is a critical function for safe adaptation. Efficient incident response that limits service interruption results from well calibrated engineering teams that can coordinate effectively with other parts of the organization who themselves are calibrated to the demands and potential implications of failures. Another way of saying this is, while runbooks and automated processes are important, your engineering team remains the most efficient and adaptable part of your system. Organizations increase their resilience by making investment in helping their engineers develop broader and more accurate knowledge that is able to be quickly recalled to novel problems. When this broad base of knowledge is distributed across their teams it provides a more complete basis for the common ground needed to work effectively in high tempo, cognitively demanding incidents. The distributed knowledge base, in a well coordinated team, serves to reduce errors and generates more insights during real time incident response (Maguire, 2020).
Conclusion
The scale and severity of the disruptions brought about by Covid-19 is a rare occurrence. However, surprise and disruption are a normal part of operating critical digital services at speed and scale. The lessons learned from how to adapt to the pandemic have relevance for companies who rely on being resilient to everyday disruption. These lessons should be used to both structurally design for adaptive capacity by ensuring organizational elements like team resourcing and performance improvement programs are integrated to enable front-line adaptation by providing short-cycle "micro-learning" opportunities and authority to adapt to changing conditions in real time.
Organizations who are able to anticipate abnormal variation and quickly adjust to change are better positioned to, at a minimum, rebound back to their baseline and withstand pressures when faced with unexpected demands. For those seeking to go further, being poised to adapt and designing for safe adaptation represents an opportunity to adopt resilience engineering for competitive advantage. The next article in this series extends the concept of designing systems that enhance resilience by examining a systematic approach to learning by a large global digital service company.
References
- Allspaw, J. (2015). Trade-Offs under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages.
- Firestone, K. (2020) Op-ed: Companies that adapted to Covid-19 could be the winners after the vaccine as well.
- Grayson, M. R. (2018). Approaching Overload: Diagnosis and Response to Anomalies in Complex and Automated Production Software Systems [Masters in Integrated Systems Engineering]. The Ohio State University.
- Klein, G., Snowden, D., Lock Pin, C. (2007). Anticipatory Thinking:
- Lanir, Z. (1986). Fundamental surprise. Eugene, OR: Decision Research.
- Leveson, N., Dulac, N., Zipkin, D., Cutcher-Gershenfeld, J., Carroll, J., & Barrett, B. (2006). Engineering resilience into safety-critical systems. Resilience engineering: Concepts and precepts, 95-123.
- Maguire, L. (2020). Controlling the Costs of Coordination in Large-Scale Distributed Software Systems. (Electronic Thesis or Dissertation).
- Maguire, L. & Jones, N. (2020). Learning from Adaptations to Coronavirus. Learning from Incidents blog. Retrieved from: https://www.learningfromincidents.io/blog/learning-from-adaptations-to-coronavirus
- McLaughlin, K. (2020) JPMorgan, Goldman Order Software ‘Code Freezes’ Around Election.
- Narayanan, Veekay K; Nath (1993), Organization theory : a strategic approach, Irwin, p. 29.
- Woods, D. D., & Branlat, M. (2011). Basic patterns in how adaptive systems fail. Resilience engineering in practice, 2, 1-21.
- Woods, D. D. (2015). Four concepts for resilience and the implications for the future of resilience engineering. Reliability Engineering & System Safety, 141, 5-9.
- Woods, D. D. (2018). The theory of graceful extensibility: basic rules that govern adaptive systems. Environment Systems and Decisions, 38(4), 433-457.
About the Author
Laura Maguire leads the research program at Jeli.io, where she studies software engineers keeping distributed, continuous deployment systems reliably functioning and helps to translate those findings into a product that is advancing the state of the art of incident management in the software industry. Maguire has a Master’s degree in Human Factors & Systems Safety, a PhD in Integrated Systems Engineering from the Ohio State University and extensive experience working in industrial safety & risk management.
To most software organizations,Covid-19 represents a fundamental surprise- a dramatic surprise that challenges basic assumptions and forces a revising of one’s beliefs (Lanir, 1986).
While many view this surprise as an outlier event to be endured, this series uses the lens of Resilience Engineering to explore how software companies adapted (and continue to adapt), enhancing their resilience. By emphasizing strategies to sustain the capacity to adapt, this collection of articles seeks to more broadly inform how organizations cope with unexpected events. Drawing from the resilience literature and using case studies from their own organizations, engineers and engineering managers from across the industry will explore what resilience has meant to them and their organizations, and share the lessons they’ve taken away.
The first article starts by laying a foundation for thinking about organizational resilience, followed by a second article that looks at how to sustain resilience with a case study of how one service provider operating at scale has introduced programs to support learning and continual adaptation. Next, continuing with case examples, we will explore how to support frontline adaptation through socio-technical systems analysis before shifting gears to look more broadly at designing and resourcing for resilience. The capstone article will reflect on the themes generated across the series and provide guidance on lessons learned from sustained resilience.