Laura Maguire talked at QCon London [slides] about how the coordinative efforts during outages cause a high cognitive cost. Maguire found out that coordination during anomaly response is difficult, that existing models can undermine speedy resolution, and that the strategies to control the cost of coordination are adaptive to the type of incident. Moreover, tooling has additional costs of coordination, even when it's intended to reduce them.
Maguire was part of the SNAFUcatchers consortium, which is a consortium composed of many organizations that are interested in resilience engineering. She had access to data that allowed her to explore the hidden costs of coordination during outages when studying the incident command system (ICS) model. ICS is a standardized approach to manage emergency responses to an incident by having a hierarchy within the incident responders–the people who're solving the incident. The basic flow is as simple as figuring out the problem, repairing it, and moving on. However, this model is not a good fit for systems that continuously change.
Incident response in software engineering is very different from other worlds, mainly because the software is complex and lives in a continually changing environment. Therefore, in software, there's the implicit need to learn about how systems work continually. Consequently, the type of failures organizations face are often quite challenging with broad consequences, and they require multiple forms of expertise. So, incidents need different people to handle events, but they also have a high cost in terms of attention, said Maguire. And, the ICS has hidden costs that organizations usually see as a burden.
So, according to Maguire, the alternative of the ICS is:
The ability to seamlessly synchronize activities in a larger joint effort is quite meaningful. And we can see that if it typically runs smoothly, but each agent in this sort of distributed network can more fluidly adapt and adjust to the demands, we can lower the costs of coordination.
Maguire calls this ability "adaptive choreography," which is about being able to adjust how coordination happens dynamically. The role of the incident commander is still essential. There are times where decisions need to be made quickly. The team needs to know who that centralized authority is, one that has the bigger picture in mind, said Maguire.
So, after evaluating many high-performance teams, Maguire found out that:
In high-pressure events, very rapid, but straightforward interactions amongst the (incident) responders typically worked really well. And that's because they fulfilled the functional requirements of coordination, as they were carrying out their tasks. They're able to anticipate what needed to be done next, and they were able to take the initiative to do it. They listened in on what others were doing and they were able to better sequence the timing of those actions. So, they were able to provide input into critical decisions and point out potential threats and the implications of different courses of action.
Moreover, Maguire said that the tools a team uses to solve incidents have a substantial cost of coordination. For instance, a person having lag or delays in the web conference call can add additional cognitive demand. Or, the time spent selecting the tool that better fits at the moment, because it could mean that the organization has to adapt to a new form of coordinating after adopting a particular tool. All of these problems represent the hidden costs of coordination, and Maguire said that it's important to acknowledge these additional costs for cognitive and coordination demands.
Finally, Maguire closed by saying:
I believe that software engineering could lead a step-change in incident management practices, and it's my hope that you'll continue to push the boundaries of what is possible in how we coordinate.