InfoQ Homepage Incident Response Content on InfoQ

News

RSS Feed

Newer Older

DevOps

GitHub Was down Multiple Times Last February: Here's Why

GitHub completed its internal investigation about what caused multiple service interruptions that affected its service last February for over eight hours. The root cause for this was a combination of unexpected database load variation and database configuration issues.

Sergio De Simone
on Mar 31, 2020
DevOps

Improving Incident Management through Role Assignments and Game Days

John Arundel, principal consultant at Bitfield Consulting, shared his thoughts on how to ensure incidents are handled smoothly and quickly. He suggests assigning specific roles to each team member responding to the incident. Red team versus blue team exercises can also be leveraged to ensure the team is prepared to respond accurately and quickly.

Matt Campbell
on Mar 25, 2020
DevOps

Exploring Costs of Coordination During Outages with Laura Maguire at QCon London

Laura Maguire talked at QCon London about how the coordinative efforts during outages cause a high cognitive cost. Maguire found out that coordination during anomaly response is difficult, that existing models can undermine speedy resolution, and that the strategies to control the cost of coordination are adaptive to the type of incident. Moreover, tooling has additional costs of coordination.

Christian Melendez
on Mar 13, 2020
DevOps

Netflix Open Sources Crisis Management Orchestration Tool

Netflix announced the release of Dispatch, their crisis management orchestration framework. Dispatch integrates with existing tools such as Jira, PagerDuty, and Slack to streamline the crisis management process. Dispatch includes integration endpoints for adding in support for additional tooling.

Matt Campbell
on Mar 12, 2020
Culture & Methods

Involving Engineers in Incident Management: QCon London Q&A

Learning from past incidents can increase engineers' confidence in handling live incidents and convincing them to join the on-call team. Samuel Parkinson spoke about how we can benefit from past incidents and encourage engineers to get involved in incident management at Qcon London 2020.

Ben Linders
on Mar 04, 2020
DevOps

OpsRamp Introduces AI-Driven Suggestions for Incident Remediation

OpsRamp, a SaaS platform for hybrid infrastructure discovery, monitoring, management and automation has launched OpsQ Recommend Mode, a capability for incident remediation. OpsQ Recommend Mode provides predictive analytics to digital operations teams with the goal of reducing Mean Time to Resolution (MTTR).

Helen Beal
on Feb 26, 2020
DevOps

How Did Things Go Right? Learning More from Incidents at Netflix: Ryan Kitchens at QCon New York

At QCon New York, Ryan Kitchens presented “How Did Things Go Right? Learning More from Incidents”. Key takeaways from the talk included: recovery is better than prevention; an incident occurs when there is a “perfect storm” of events -- there is no root cause; “stop reporting on the nines”, as user happiness is more important; and there is value in learning how things go right.

Daniel Bryant
on Jul 05, 2019
DevOps

Splunk Releases Splunk Connected Experiences and Splunk Business Flow

Data analytics organisation, Splunk, recently released Splunk Connected Experiences which delivers insights through augmented reality (AR), mobile devices like Apple TV, and mobile applications. They also released Splunk Business Flow which enables business operations professionals to gain insights across their customer journeys and business processes.

Helen Beal
on May 31, 2019
DevOps

Scaling, Incident Management and Collaboration at New York Times Engineering

The New York Times Engineering Team wrote about their approach to scaling and incident management against the backdrop of increased traffic during the November 2018 US midterm elections.

Hrishikesh Barua
on Mar 02, 2019
DevOps

OpsRamp Announces Improved Service Centricity, AIOps and Cloud Monitoring

OpsRamp, a service-centric AIOps software-as-a-service (SaaS) platform for the hybrid enterprise, has announced new topology maps, enhanced artificial intelligence for IT operations (AIOps) features and new monitoring capabilities for cloud native workloads.

Helen Beal
on Feb 05, 2019
Culture & Methods

Atlassian Announces Solutions for Incident Management

Atlassian announced on September 4 that they have launched a new product called Jira Ops and that they will acquire OpsGenie. Organizations can use Jira Ops for resolving incidents and doing post-mortems to learn from them. OpsGenie adds prompt and reliable alerting to Jira Ops.

Ben Linders
on Sep 20, 2018
Development

Google Cloud Incident Root-Cause Analysis and Remediation

Google disclosed its root-cause analysis of an incident affecting a few of its Cloud services that increased error rates between 33% and 87% for about 32 minutes, along with the steps they will take to improve the platform performance and availability.

Sergio De Simone
on Jul 26, 2018
DevOps

What Resiliency Means at Sportradar

Pablo Jensen, CTO at Sportradar, talked about practices and procedures in place at Sportradar to ensure their systems meet expected resiliency levels, at this year's QCon London conference. Jensen mentioned how reliability is influenced not only by technical concerns but also organizational structure and governance, client support, and requires on-going effort to continuously improve.

Manuel Pais
on Apr 06, 2018
DevOps

Post-Mortems Trends and Behaviors

Eric Siegler presented his findings at Velocity from analyzing data from 1000 post-mortems ran by 125 different organizations over a six month period. Main trends include the prevalence of blameless post-mortems; the fact that only 1 in 100 post-mortems refer to "human error"; and that analyzing the lifecycle of incidents can provide useful insights on weaknesses in the incident response process.

Manuel Pais
on Nov 29, 2017
DevOps

Q&A with Sanjeev Sharma on His DevOpsDays NZ Keynote

Raf Gemmail speaks with IBM's Sanjeev Sharma about his upcoming DevOpsDays NZ closing keynote on the DevOps and SRE lessons we can learn from Apollo 13.

Rafiq Gemmail
on Sep 27, 2017

Newer News

Older News

InfoQ Software Architects' Newsletter

News