InfoQ Homepage Incident Response Content on InfoQ
-
Netflix Presents Telltale, an Application Health Monitoring Tool
The Netflix Engineering team recently blogged about Telltale, a monitoring and alerting tool that utilizes a variety of data sources to learn the typical health of an application. Telltale shows only the relevant data from application. There's also information about important events, such as nearby deployments and regional traffic evacuations.
-
GitHub Availability Report: Monthly Report Examining Incidents
Going beyond publishing the post mortem of major incidents, GitHub recently introduced the Availability Report. This report will not only have a description of incidents but also highlight what is being done to advance GitHub's engineering systems and practices.
-
Cloudflare’s 27 Minutes Outage Explained
Cloudflare recently suffered a partial outage, which lasted for 27 minutes. This outage caused 50% of traffic drop across the network.
-
Incident Management During Remote Work
Michael Fisher, a technology enthusiast and group product manager at OpsRamp, recently blogged about how IT operations and DevOps teams can take a problem-first approach towards the incident management process. On the same lines, Dr. Laura Maguire and Nora Jones wrote about similar challenges as the world reacts to COVID-19.
-
GitHub Was down Multiple Times Last February: Here's Why
GitHub completed its internal investigation about what caused multiple service interruptions that affected its service last February for over eight hours. The root cause for this was a combination of unexpected database load variation and database configuration issues.
-
Improving Incident Management through Role Assignments and Game Days
John Arundel, principal consultant at Bitfield Consulting, shared his thoughts on how to ensure incidents are handled smoothly and quickly. He suggests assigning specific roles to each team member responding to the incident. Red team versus blue team exercises can also be leveraged to ensure the team is prepared to respond accurately and quickly.
-
Exploring Costs of Coordination During Outages with Laura Maguire at QCon London
Laura Maguire talked at QCon London about how the coordinative efforts during outages cause a high cognitive cost. Maguire found out that coordination during anomaly response is difficult, that existing models can undermine speedy resolution, and that the strategies to control the cost of coordination are adaptive to the type of incident. Moreover, tooling has additional costs of coordination.
-
Netflix Open Sources Crisis Management Orchestration Tool
Netflix announced the release of Dispatch, their crisis management orchestration framework. Dispatch integrates with existing tools such as Jira, PagerDuty, and Slack to streamline the crisis management process. Dispatch includes integration endpoints for adding in support for additional tooling.
-
Involving Engineers in Incident Management: QCon London Q&A
Learning from past incidents can increase engineers' confidence in handling live incidents and convincing them to join the on-call team. Samuel Parkinson spoke about how we can benefit from past incidents and encourage engineers to get involved in incident management at Qcon London 2020.
-
OpsRamp Introduces AI-Driven Suggestions for Incident Remediation
OpsRamp, a SaaS platform for hybrid infrastructure discovery, monitoring, management and automation has launched OpsQ Recommend Mode, a capability for incident remediation. OpsQ Recommend Mode provides predictive analytics to digital operations teams with the goal of reducing Mean Time to Resolution (MTTR).
-
How Did Things Go Right? Learning More from Incidents at Netflix: Ryan Kitchens at QCon New York
At QCon New York, Ryan Kitchens presented “How Did Things Go Right? Learning More from Incidents”. Key takeaways from the talk included: recovery is better than prevention; an incident occurs when there is a “perfect storm” of events -- there is no root cause; “stop reporting on the nines”, as user happiness is more important; and there is value in learning how things go right.
-
Splunk Releases Splunk Connected Experiences and Splunk Business Flow
Data analytics organisation, Splunk, recently released Splunk Connected Experiences which delivers insights through augmented reality (AR), mobile devices like Apple TV, and mobile applications. They also released Splunk Business Flow which enables business operations professionals to gain insights across their customer journeys and business processes.
-
Scaling, Incident Management and Collaboration at New York Times Engineering
The New York Times Engineering Team wrote about their approach to scaling and incident management against the backdrop of increased traffic during the November 2018 US midterm elections.
-
OpsRamp Announces Improved Service Centricity, AIOps and Cloud Monitoring
OpsRamp, a service-centric AIOps software-as-a-service (SaaS) platform for the hybrid enterprise, has announced new topology maps, enhanced artificial intelligence for IT operations (AIOps) features and new monitoring capabilities for cloud native workloads.
-
Atlassian Announces Solutions for Incident Management
Atlassian announced on September 4 that they have launched a new product called Jira Ops and that they will acquire OpsGenie. Organizations can use Jira Ops for resolving incidents and doing post-mortems to learn from them. OpsGenie adds prompt and reliable alerting to Jira Ops.