InfoQ Homepage Incident Response Content on InfoQ
-
The Incident Lifecycle: How a Culture of Resilience Can Help You Accomplish Your Goals
Don’t get stuck with overwhelmed systems that can cause an outage, like what happened with Taylor Swift concert tickets. Build organizational resilience to incidents through improved coordination and communication during the response, and blameless reviews, root cause analysis, and insightful communication afterward to enable meaningful change.
-
Tips on How Staff Engineers Can Impact Incidents
Staff engineers can influence behaviors during and after incidents by modeling transparency and questioning assumptions to strengthen engineering culture. As incident commanders, they can coordinate workstreams, communicate with stakeholders, and prevent responder burnout. In retrospectives, staff engineers can improve model root cause analysis to improve underlying cultural issues.
-
Moving Past Simple Incident Metrics: Courtney Nash on the VOID
The Verica Open Incident Database (VOID) is assembling publically available software-related incident reports. InfoQ talks with Courtney Nash about their recent findings including how MTT* metrics may not be beneficial, the average time to incident resolution, and the importance of studying near-miss reports.
-
Building an Effective Incident Management Process
A good incident management framework can help organizations manage the chaos of an outage more effectively leading to shorter incident durations and tighter feedback loops. This article introduces the components necessary for a healthy incident management process.
-
The Hows and Whys of Effective Production-Readiness Reviews
At QCon Plus November 2021, Nora Jones, CEO and founder of Jeli, talked about how to build production readiness reviews (PRR) with emphasis on context and psychological safety. Her talk focused on the particulars of a PRR process that relates to incidents.
-
Analyzing Incident Data across Organizations: Courtney Nash on the VOID
The Verica Open Incident Database (VOID) is assembling publically available software-related incident reports. InfoQ talks with Courtney Nash on their recent findings including how MTT* metrics may not be beneficial, the average time to incident resolution, and the importance of studying near-miss reports.
-
DevOps and Cloud InfoQ Trends Report – June 2022
This article summarizes how we see the "cloud computing and DevOps" space in 2022, which focuses on fundamental infrastructure and operational patterns, the realization of patterns in technology frameworks, and the design processes and skills that a software architect or engineer must cultivate.
-
How to Best Use MTT* Metrics to Optimize Your Incident Response
Selecting the correct MTT* metric to improve your incident response is important. If the wrong metric is chosen, the improvements may get lost in the noise of a multivariable equation. This article reviews the various MTT* metrics available and discusses the best scenarios for selecting each one.
-
DevOps and Cloud InfoQ Trends Report - July 2021
This article summarizes how we see the "cloud computing and DevOps" space in 2021, which focuses on fundamental infrastructure and operational patterns, the realization of patterns in technology frameworks, and the design processes and skills that a software architect or engineer must cultivate.
-
Designing & Managing for Resilience
The fourth article in a series on how software companies adapted and continue to adapt to enhance their resilience explores the strategies used by engineering leaders to help create the conditions for sustained resilience. It provides stories, examples, and strategies towards designing an organizational structure to support resilient performance and managing for resilience.
-
Piercing the Fog: Observability Tools from the Future
Visibility into those distributed systems and how they are performing is challenging. Despite all the observability tools available for site reliability, debugging remains incredibly difficult, and many SREs would agree that their debugging processes have only marginally improved. This article explores how observability for troubleshooting could be done from the user’s point of view.
-
Adaptive Frontline Incident Response: Human-Centered Incident Management
The third article in a series on how software companies adapted and continue to adapt to enhance their resilience zeros in on the sources that comprise most of your company’s adaptive resources: your frontline responders. In this article, we draw on our experiences as incident commanders with Twilio to share our reflections on what it means to cultivate resilient people.