InfoQ Homepage Incident Response Content on InfoQ

Articles

RSS Feed

Newer Older

Culture & Methods

Exploring the Unintended Consequences of Automation in Software

This article lays out some of the common assumptions and misconceptions about automation and its role in software (and software incidents), what our research has found regarding how automation shows up in software incidents, and some ideas around how people can better design automated tools to help people better handle software incidents.

Courtney Nash
on Oct 10, 2025
Culture & Methods

The Incident Lifecycle: How a Culture of Resilience Can Help You Accomplish Your Goals

Don’t get stuck with overwhelmed systems that can cause an outage, like what happened with Taylor Swift concert tickets. Build organizational resilience to incidents through improved coordination and communication during the response, and blameless reviews, root cause analysis, and insightful communication afterward to enable meaningful change.

Vanessa Huerta Granda
on Jan 11, 2024
Culture & Methods

Tips on How Staff Engineers Can Impact Incidents

Staff engineers can influence behaviors during and after incidents by modeling transparency and questioning assumptions to strengthen engineering culture. As incident commanders, they can coordinate workstreams, communicate with stakeholders, and prevent responder burnout. In retrospectives, staff engineers can improve model root cause analysis to improve underlying cultural issues.

Erin Doyle
on Dec 29, 2023
DevOps

Moving Past Simple Incident Metrics: Courtney Nash on the VOID

The Verica Open Incident Database (VOID) is assembling publically available software-related incident reports. InfoQ talks with Courtney Nash about their recent findings including how MTT* metrics may not be beneficial, the average time to incident resolution, and the importance of studying near-miss reports.

Courtney Nash Matt Campbell
on Feb 14, 2023
DevOps

Building an Effective Incident Management Process

A good incident management framework can help organizations manage the chaos of an outage more effectively leading to shorter incident durations and tighter feedback loops. This article introduces the components necessary for a healthy incident management process.

Anil Kumar Ravindra Mallapur
on Oct 04, 2022
DevOps

The Hows and Whys of Effective Production-Readiness Reviews

At QCon Plus November 2021, Nora Jones, CEO and founder of Jeli, talked about how to build production readiness reviews (PRR) with emphasis on context and psychological safety. Her talk focused on the particulars of a PRR process that relates to incidents.

Nora Jones
on Sep 15, 2022
DevOps

Analyzing Incident Data across Organizations: Courtney Nash on the VOID

The Verica Open Incident Database (VOID) is assembling publically available software-related incident reports. InfoQ talks with Courtney Nash on their recent findings including how MTT* metrics may not be beneficial, the average time to incident resolution, and the importance of studying near-miss reports.

Courtney Nash Matt Campbell
on Jun 28, 2022
DevOps

DevOps and Cloud InfoQ Trends Report – June 2022

This article summarizes how we see the "cloud computing and DevOps" space in 2022, which focuses on fundamental infrastructure and operational patterns, the realization of patterns in technology frameworks, and the design processes and skills that a software architect or engineer must cultivate.

Steef-Jan Wiggers Matt Campbell Lena Hall Renato Losio Daniel Bryant Feynman Zhou Mostafa Radwan Shaaron A Alvares
on Jun 21, 2022
DevOps

How to Best Use MTT* Metrics to Optimize Your Incident Response

Selecting the correct MTT* metric to improve your incident response is important. If the wrong metric is chosen, the improvements may get lost in the noise of a multivariable equation. This article reviews the various MTT* metrics available and discusses the best scenarios for selecting each one.

Alex Ewerlöf
on Mar 17, 2022
DevOps

DevOps and Cloud InfoQ Trends Report - July 2021

This article summarizes how we see the "cloud computing and DevOps" space in 2021, which focuses on fundamental infrastructure and operational patterns, the realization of patterns in technology frameworks, and the design processes and skills that a software architect or engineer must cultivate.

Matt Campbell Steef-Jan Wiggers Shaaron A Alvares Helen Beal Daniel Bryant Lena Hall Rupert Field Aditya Kulkarni Jared Ruckle Renato Losio Holly Cummins
on Jul 19, 2021
Culture & Methods

Designing & Managing for Resilience

The fourth article in a series on how software companies adapted and continue to adapt to enhance their resilience explores the strategies used by engineering leaders to help create the conditions for sustained resilience. It provides stories, examples, and strategies towards designing an organizational structure to support resilient performance and managing for resilience.

Laura Maguire
on Apr 15, 2021
DevOps

Piercing the Fog: Observability Tools from the Future

Visibility into those distributed systems and how they are performing is challenging. Despite all the observability tools available for site reliability, debugging remains incredibly difficult, and many SREs would agree that their debugging processes have only marginally improved. This article explores how observability for troubleshooting could be done from the user’s point of view.

Srinath Perera
on Feb 10, 2021

Newer Articles

Older Articles

InfoQ Software Architects' Newsletter

Articles