InfoQ Homepage Incident Response Content on InfoQ
-
Google Cloud Incident Root-Cause Analysis and Remediation
Google disclosed its root-cause analysis of an incident affecting a few of its Cloud services that increased error rates between 33% and 87% for about 32 minutes, along with the steps they will take to improve the platform performance and availability.
-
What Resiliency Means at Sportradar
Pablo Jensen, CTO at Sportradar, talked about practices and procedures in place at Sportradar to ensure their systems meet expected resiliency levels, at this year's QCon London conference. Jensen mentioned how reliability is influenced not only by technical concerns but also organizational structure and governance, client support, and requires on-going effort to continuously improve.
-
Post-Mortems Trends and Behaviors
Eric Siegler presented his findings at Velocity from analyzing data from 1000 post-mortems ran by 125 different organizations over a six month period. Main trends include the prevalence of blameless post-mortems; the fact that only 1 in 100 post-mortems refer to "human error"; and that analyzing the lifecycle of incidents can provide useful insights on weaknesses in the incident response process.
-
Q&A with Sanjeev Sharma on His DevOpsDays NZ Keynote
Raf Gemmail speaks with IBM's Sanjeev Sharma about his upcoming DevOpsDays NZ closing keynote on the DevOps and SRE lessons we can learn from Apollo 13.
-
Handling Incidents and Outages
David Mytton, CEO at Server Density, shared with the devopsdays Amsterdam 2015 crowd how they handle incidents and outages. The process is grounded on a key set of principles: frequent public updates; exhaustive logging of the response activities; team effort and effective escalation. Server Density draws a lot of inspiration from the aviation industry, renowned for its safety procedures.