InfoQ Homepage Incident Response Content on InfoQ
-
Learning from Incidents
Jessica DeVita (Netflix) and Nick Stenning (Microsoft) have been working on improving how software teams learn from incidents in production. In this article, they share some of what they’ve learned from the research community in this area, and offer some advice on the practical application of this work.
-
Shifting Modes: Creating a Program to Support Sustained Resilience
The second article in a series on how software companies adapted and continue to adapt to enhance their resilience explores how organizations can shift to a Learn & Adapt safety mode and compares the traits of an organization that is well poised for successfully persisting this mode shift. This shift will not only make them safer but will also give them a competitive advantage.
-
Meeting the Challenges of Disrupted Operations: Sustained Adaptability for Organizational Resilience
The first article in a series on how software companies adapted and continue to adapt to enhance their resilience starts by laying a foundation for thinking about organizational resilience. It looks at what organizations can do structurally during surprising and disruptive events to establish conditions that help engineering teams adapt in practice and in real time as disruptive events occur.
-
Q&A on the Book Techlash
The book Techlash by Ian Mitroff and Rune Storesund explains why companies need to become socially responsible by considering the potential negative outcomes of technology. It explains how proactive crisis management can help prevent a crisis by the early detection and correction of deviations from expected conditions.
-
Failover Conf Q&A on Building Reliable Systems: People, Process, and Practice
One of the biggest engineering challenges associated with maintaining or increasing the reliability of a system is knowing where to invest time and energy. InfoQ recently sat down with several engineers and technical leaders who are involved with the upcoming Failover Conf virtual event, and asked their opinion on the best practices for building and running reliable systems.
-
Crafting a Resilient Culture: Or, How to Survive an Accidental Mid-Day Production Incident
While working at Etsy, Ryn Daniels accidentally upgraded Apache on every single server that was running it, which caused a production incident. Explore lessons learned in this article, including that although automation and orchestration can be great, you should make sure you understand what’s happening under the hood and what to do if your automation goes awry.