InfoQ Homepage Incident Response Content on InfoQ

News

RSS Feed

Newer Older

Development

How CNAME Ordering in RFC Specs Caused Cloudflare 1.1.1.1 Outage

In a recent article titled "What came first- the CNAME or the A record?" Cloudflare explains how an unclear RFC specification caused the popular Cloudflare’s 1.1.1.1 service to break. After identifying the breakage and the ambiguity in older DNS standards regarding record order, Cloudflare proposes a clarified specification.

Renato Losio
on Feb 07, 2026
DevOps

Cloudflare Launches "Code Orange: Fail Small" Resilience Plan after Multiple Global Outages

Cloudflare recently published a detailed resilience initiative called Code Orange: Fail Small, outlining a comprehensive plan to prevent large-scale service disruptions after two major network outages in the past six weeks.

Craig Risi
on Jan 16, 2026
DevOps

How Authress Designed for Resilience and Survived a Major AWS Outage

Identity and authentication services company Authress shared its strategy to stay operational during major cloud infrastructure outages like the massive October 2025 AWS outage that disrupted many major services. According to Authress CTO Warren Parad, the company's resilience architecture relies on strategies like multi-region deployment and minimizing reliance on AWS control plane services.

Sergio De Simone
on Dec 28, 2025
DevOps

AWS Debuts “DevOps Agent” to Automate Incident Response and Improve System Reliability

AWS recently announced the public preview of AWS DevOps Agent, a new "frontier agent" that aims to help organizations react more quickly to production incidents, identify root causes, and proactively strengthen system reliability.

Craig Risi
on Dec 17, 2025
DevOps

Report Finds LLMs Not Yet Ready to Replace SREs in Incident Management

A study by ClickHouse found that large language models (LLMs) can't yet replace Site Reliability Engineers (SREs) for tasks such as finding the root causes of incidents. The study tested five leading models against real-world observability data to determine whether AI could autonomously identify production issues.

Matt Saunders
on Sep 27, 2025
DevOps

PagerDuty's Kafka Outage Silences Alerts for Thousands of Companies

PagerDuty, the incident management platform used by thousands of organisations to alert them to problems on their systems, suffered a major outage itself on 28th August, 2025. In a comprehensive outage report, the company detailed the scope of the problem, the customer impact, and how it is working to prevent a recurrence.

Matt Saunders
on Sep 16, 2025
Architecture & Design

Datadog Employs LLMs for Assisting with Writing Accident Postmortems

Datadog combined structured metadata from its incident management app with Slack messages to create an LLM-driven functionality assisting engineers in composing incident postmortems. While working on this solution, the company dealt with the challenges of using LLMs outside of the interactive dialog systems and ensuring that high-quality content was produced.

Rafal Gancarz
on Apr 13, 2025
DevOps

How SREs and GenAI Work Together to Decrease eBay's Downtime: an Architect's Insights at KubeCon EU

During his KubeCon EU keynote, Vijay Samuel, Principal MTS Architect at eBay, shared his team’s experience of enhancing incident response capabilities by incorporating ML and LLM building blocks. They realised that GenAIs are not a silver bullet but can help engineers through complex incident investigations through logs, traces, and dashboard explanations.

Olimpiu Pop
on Apr 05, 2025
DevOps

Atlassian Announces Opsgenie Consolidation into JIRA Service Management

Atlassian recently announced that it is consolidating its IT Operations offering and transitioning Opsgenie’s capabilities into JIRA Service Management and Compass.

Aditya Kulkarni
on Mar 29, 2025
DevOps

How Locking, Saturation and CDN Network Issues Brought down Canva

The Canva engineering team recently published their post-mortem on the outage they experienced last November, detailing the API Gateway failure and the lessons learned during the incident.

Renato Losio
on Feb 08, 2025
DevOps

Cloudflare Experiences Major Incident in November, Resulting in Log Loss

Cloudflare has recently confirmed that on November 14th they experienced an incident affecting Cloudflare Logs with 55% of logs during a 3.5-hour period being lost. The incident impacted most customers using the service, with a misconfiguration triggering a cascading series of system failures and exposing weaknesses in handling unexpected spikes in demand.

Renato Losio
on Dec 07, 2024
DevOps

Grafana Frees up Engineers to Fix Problems with Improved Incident Management

Grafana Labs, a leading provider of observability solutions, has unveiled significant enhancements to its Incident Response and Management (IRM) platform. These changes help teams manage and respond to incidents more effectively by streamlining incident management processes and reducing response times.

Matt Saunders
on May 15, 2024
DevOps

Grafana Introduces ML Tool Sift to Improve Incident Response

Grafana Labs has introduced "Sift," a feature for Grafana Cloud designed to enhance incident response management (IRM) by automating system checks and expediting issue resolution. Sift automates various aspects of incident investigation. Sift provides valuable insights into potential issues within Kubernetes environments, helping engineers focus on resolving incidents.

Matt Saunders
on Sep 28, 2023
Culture & Methods

How Resilience Can Help to Get Better at Resolving Incidents

Applying resilience throughout the incident lifecycle by taking a holistic look at the sociotechnical system can help to turn incidents into learning opportunities. Resilience can help folks get better at resolving incidents and improve collaboration. It can also give organizations time to realize their plans.

Ben Linders
on Jun 15, 2023
DevOps

Can MTTR Be an Effective Business Metric?

In a recent blog post, Sidu Ponnappa shared how MTTR should be a key business metric to measure engineering efficiency. Ponnappa notes that only tracking uptime provides no goals to target for improvements. In a recent talk at SREcon22, Courtney Nash, senior research analyst at Verica, shared that MTTR can misrepresent what is actually happening during incidents and can be an unreliable metric.

Matt Campbell
on Oct 26, 2022

Newer News

Older News

InfoQ Software Architects' Newsletter

News