Applying resilience throughout the incident lifecycle by taking a holistic look at the sociotechnical system can help to turn incidents into learning opportunities. Resilience can help folks get better at resolving incidents and improve collaboration. It can also give organizations time to realize their plans.
Vanessa Huerta Granda gave a talk about a culture of resilience at QCon New York 2023.
Often, organizations don’t really do much after resolving the impact of an incident, Huerta Granda argued. Some organizations will attempt to do a post-incident activity, traditionally a root cause analysis or 5 Why’s, and some teams will do a postmortem. Either way, it’s usually focused on figuring out a root cause and preventing it from ever happening again, she said.
Huerta Granda mentioned reasons why folks aren’t doing activities for deeper learning:
- The skills required to successfully apply resilience into your culture are not traditional engineering skills; it’s communication skills, analytics, presenting information and convincing people, getting folks to talk to each other.
- You need time and training to get good at this and organizations often don’t give their engineers the bandwidth for this.
- Many organizations will stop at the step of incident response without going into becoming a learning organization.
- Some organizations are stuck in the old-fashioned pattern of thinking that all outages are there because of a root cause without focusing on the sociotechnical systems.
We can apply resilience throughout the incident lifecycle by taking a holistic look at the sociotechnical system, Huerta Granda said. She mentioned that we have to understand that an incident is never "release a bug - revert the bug - everything is back to normal". Instead, think through the conditions that led to the incident happening the way it did. What did people think was happening? What tools did they have available? How did they collaborate and communicate? This paints a fuller picture of our systems and helps us in the future, Huerta Granda said.
Resilience can help folks get better at resolving incidents, better at understanding what is happening and how to more effectively collaborate with each other, Huerta Granda mentioned.
For the organization, when folks aren’t stuck in a cycle of incidents they will have time to complete the plans the organization has in their roadmap, she said.
To foster a culture of resilience, we need to give people the time to talk to each other, to be curious to look past the technical root cause and into the contributing factors around the experience of an incident, Huerta Granda concluded.
InfoQ interviewed Vanessa Huerta Granda about learning from incidents.
InfoQ: How big can the costs of incidents be?
Vanessa Huerta Granda: It can be huge; it can erode the trust your customers have in you, depending on the industry companies can lose their licenses because of incidents. And then there’s the cost it has on your culture, when folks are constantly stuck in a cycle of incidents, they’re not going to have the bandwidth to be creative engineers.
InfoQ: What tips do you have for creating action items?
Huerta Granda: Some tips are:
- They need to be decided by the people actually doing them.
- Management needs to be ok with giving folks time to complete them.
- They should move the needle.
- Always have an owner and a due date (so we know they can be completed).
- It’s ok giving people an out.
Giving people an out means that action items should not be set in stone. If the owner of an item tries a fix and realises it doesn’t work or it will take way longer to complete, they can decide it’s not the best course of action. In that case, let them close the action items with an explanation of the work done.
InfoQ: How can we gain cross-incident insights?
Huerta Granda: You need to focus on individual incident insights first. When you have a body of work, then look at commonalities between your incidents; you may ask yourself ,"Are the incidents that take longer to resolve related to a particular technology? Are folks aware of the observability tools that are available?" Once you have found the data you want to share, make sure to always add context to the data that you are providing, this takes "data" into "insights".