InfoQ Homepage Incident Response Content on InfoQ

Articles

RSS Feed

Newer Older

Culture & Methods

Adaptive Frontline Incident Response: Human-Centered Incident Management

The third article in a series on how software companies adapted and continue to adapt to enhance their resilience zeros in on the sources that comprise most of your company’s adaptive resources: your frontline responders. In this article, we draw on our experiences as incident commanders with Twilio to share our reflections on what it means to cultivate resilient people.

Emily Ruppe Ryan McDonald
on Feb 05, 2021
Culture & Methods

Learning from Incidents

Jessica DeVita (Netflix) and Nick Stenning (Microsoft) have been working on improving how software teams learn from incidents in production. In this article, they share some of what they’ve learned from the research community in this area, and offer some advice on the practical application of this work.

Jessica DeVita Nick Stenning
on Jan 27, 2021
Culture & Methods

Shifting Modes: Creating a Program to Support Sustained Resilience

The second article in a series on how software companies adapted and continue to adapt to enhance their resilience explores how organizations can shift to a Learn & Adapt safety mode and compares the traits of an organization that is well poised for successfully persisting this mode shift. This shift will not only make them safer but will also give them a competitive advantage.

Alex Elman
on Jan 11, 2021
Culture & Methods

Meeting the Challenges of Disrupted Operations: Sustained Adaptability for Organizational Resilience

The first article in a series on how software companies adapted and continue to adapt to enhance their resilience starts by laying a foundation for thinking about organizational resilience. It looks at what organizations can do structurally during surprising and disruptive events to establish conditions that help engineering teams adapt in practice and in real time as disruptive events occur.

Laura Maguire
on Dec 31, 2020
Culture & Methods

Q&A on the Book Techlash

The book Techlash by Ian Mitroff and Rune Storesund explains why companies need to become socially responsible by considering the potential negative outcomes of technology. It explains how proactive crisis management can help prevent a crisis by the early detection and correction of deviations from expected conditions.

Ben Linders Ian Mitroff Rune Storesund
on Aug 26, 2020
DevOps

Failover Conf Q&A on Building Reliable Systems: People, Process, and Practice

One of the biggest engineering challenges associated with maintaining or increasing the reliability of a system is knowing where to invest time and energy. InfoQ recently sat down with several engineers and technical leaders who are involved with the upcoming Failover Conf virtual event, and asked their opinion on the best practices for building and running reliable systems.

Angel Rivera Tiffany Jachja Heidi Waterhouse Jim Walker Dave Nielsen Laura Hofmann
on Apr 20, 2020
DevOps

Crafting a Resilient Culture: Or, How to Survive an Accidental Mid-Day Production Incident

While working at Etsy, Ryn Daniels accidentally upgraded Apache on every single server that was running it, which caused a production incident. Explore lessons learned in this article, including that although automation and orchestration can be great, you should make sure you understand what’s happening under the hood and what to do if your automation goes awry.

Ryn Daniels
on Mar 06, 2019

Newer Articles

Older Articles

Unlock the full InfoQ experience

Don't have an InfoQ account?

Topics

Expanding Swift from Apps to Services

Engineering Speed at Scale — Architectural Lessons from Sub-100-ms APIs

Beyond the Warehouse: Why BigQuery Alone Won’t Solve Your Data Problems

Scaling to 100+ as a Director: Lessons From Growing Engineering Organizations

From Alert Fatigue to Agent-Assisted Intelligent Observability

Helpful links

Choose your language

Articles

Adaptive Frontline Incident Response: Human-Centered Incident Management

Learning from Incidents

Shifting Modes: Creating a Program to Support Sustained Resilience

Meeting the Challenges of Disrupted Operations: Sustained Adaptability for Organizational Resilience

Q&A on the Book Techlash

Failover Conf Q&A on Building Reliable Systems: People, Process, and Practice

Crafting a Resilient Culture: Or, How to Survive an Accidental Mid-Day Production Incident

How CNAME Ordering in RFC Specs Caused Cloudflare 1.1.1.1 Outage

Expanding Swift from Apps to Services

Google Pushes for gRPC Support in Model Context Protocol

LinkedIn Leverages GitHub Actions, CodeQL, and Semgrep for Code Scanning

LinkedIn Re-Architects Service Discovery: Replacing Zookeeper with Kafka and xDS at Scale

GitHub Reworks Layered Defenses after Legacy Protections Block Legitimate Traffic

Getting Feedback from Test-Driven Development and Testing in Production

Scaling to 100+ as a Director: Lessons From Growing Engineering Organizations

The Technical Founder's Path: Code, Leadership, and Balance

Cloudflare Demonstrates Moltworker, Bringing Self-Hosted AI Agents to the Edge

Google Supercharges Gemini 3 Flash with Agentic Vision

Conductor Quantum Introduces Coda, a Natural Language Interface for Quantum Computing

Datadog Integrates Google Agent Development Kit into LLM Observability Tools

From Alert Fatigue to Agent-Assisted Intelligent Observability

Etleap Launches Iceberg Pipeline Platform to Simplify Enterprise Adoption of Apache Iceberg

QCon London

QCon AI Boston

QCon San Francisco

InfoQ Software Architects' Newsletter

Articles