InfoQ Homepage Site Reliability Engineering Content on InfoQ

News

RSS Feed

Newer Older

Development

GitHub Incident Analysis Shows How to Improve Service Reliability

On October 21, 2018, GitHub users experienced a degraded service during 24 hours due to an incident caused by routine maintenance work. This led to the display of outdated and inconsistent information and to the unavailability of webhooks and other internal services for 24 hours. GitHub post-incident report shows where things failed and suggests how to improve site reliability.

Sergio De Simone
on Nov 01, 2018
DevOps

Google Explains Why Others Are Doing SRE Wrong

Stephen Thorne, customer reliability engineer at Google, recently spoke at the DevOps Enterprise Summit London on what Site Reliability Engineering (SRE) is and why many organizations are failing to understand its basic premises and benefits.

Manuel Pais
on Jul 01, 2018
DevOps

Full Cycle Developers at Netflix: from Mindsets to Self-Service Tooling

The Netflix Tech Blog has shared the story of the “Edge Engineering” team’s journey of experimenting with approaches to building and operating services, which has culminated in “Full Cycle Developers”. This approach is showing promise with Netflix, where developers are responsible for certain operational aspects of service delivery, and are supported through a range of self-service tooling.

Daniel Bryant
on Jun 17, 2018
DevOps

From Darwin to DevOps: John Willis and Gene Kim Talk about Life after The Phoenix Project

IT Revolution recently published an audiobook with nearly eight hours of conversation between Gene Kim and John Willis; Beyond the Phoenix Project – the Origins and Evolution of DevOps.

Helen Beal
on May 23, 2018
Architecture & Design

Microservices and Site Reliability Engineering

A recent article talks about how the complexities introduced by microservices initially seem at odds with the concept of Site Reliability Engineering (SRE), and how companies such as Google are tackling that to ensure that whilst development groups can continue to embrace microservices, they and their SRE teams have the necessary tools and understandings to make them work well together.

Mark Little
on Apr 29, 2018
DevOps

What It Means to Be a Site Reliability Engineer According to a Survey from Catchpoint

Site Reliability Engineering intersects software engineering with IT Operations and is an approach created at Google in 2003 and described in detail in their 2016 book, Site Reliability Engineering, How Google Runs Production Systems. Digital experience intelligence provider, Catchpoint, surveyed 416 Site Reliability Engineers (SREs) with the goal of understanding what it means to be a SRE.

Helen Beal
on Apr 13, 2018
DevOps

How DevOps Principles Are Being Applied to Networking

Practices from the DevOps world are being adopted into managing networking services. Vendor hardware, configuration tools and deployment modes have eased programmable configuration and automation of network devices and functions.

Hrishikesh Barua
on Jan 22, 2018
DevOps

How ING Bank Does SRE

Janna Brummel and Robin van Zijll, from ING Netherlands, talked at the Velocity conference in London about how poor availability from their internet banking systems prompted the bank to implement an SRE culture. A centralized SRE team was set up in the Netherlands to provide tooling, consulting and education on reliability to product teams (known as BizDevOps squads internally).

Manuel Pais
on Dec 30, 2017
DevOps

Q&A with Sanjeev Sharma on His DevOpsDays NZ Keynote

Raf Gemmail speaks with IBM's Sanjeev Sharma about his upcoming DevOpsDays NZ closing keynote on the DevOps and SRE lessons we can learn from Apollo 13.

Rafiq Gemmail
on Sep 27, 2017
DevOps

Choose Your Own Adventure: Chaos Engineering at QCon New York 2017

Nora Jones, senior chaos engineer at Netflix, talked about chaos engineering at QCon New York 2017. She presents different stages of chaos engineering adoption and gives stories from her previous experiences at Jet and Netflix.

Pierre-Luc Maheu
on Aug 22, 2017

Newer News

Older News

Unlock the full InfoQ experience

Don't have an InfoQ account?

Topics

Expanding Swift from Apps to Services

[Video Podcast] Improving Valkey with Madelyn Olson

Building LLMs in Resource-Constrained Environments: A Hands-On Perspective

Scaling to 100+ as a Director: Lessons from Growing Engineering Organizations

From Alert Fatigue to Agent-Assisted Intelligent Observability

Helpful links

Choose your language

News

GitHub Incident Analysis Shows How to Improve Service Reliability

Google Explains Why Others Are Doing SRE Wrong

Full Cycle Developers at Netflix: from Mindsets to Self-Service Tooling

From Darwin to DevOps: John Willis and Gene Kim Talk about Life after The Phoenix Project

Microservices and Site Reliability Engineering

What It Means to Be a Site Reliability Engineer According to a Survey from Catchpoint

How DevOps Principles Are Being Applied to Networking

How ING Bank Does SRE

Q&A with Sanjeev Sharma on His DevOpsDays NZ Keynote

Choose Your Own Adventure: Chaos Engineering at QCon New York 2017

How CNAME Ordering in RFC Specs Caused Cloudflare 1.1.1.1 Outage

Expanding Swift from Apps to Services

Google Pushes for gRPC Support in Model Context Protocol

Uber Moves In-House Search Indexing to Pull-Based Ingestion in OpenSearch

[Video Podcast] Improving Valkey with Madelyn Olson

LinkedIn Leverages GitHub Actions, CodeQL, and Semgrep for Code Scanning

Getting Feedback from Test-Driven Development and Testing in Production

Scaling to 100+ as a Director: Lessons from Growing Engineering Organizations

The Technical Founder's Path: Code, Leadership, and Balance

Building LLMs in Resource-Constrained Environments: A Hands-On Perspective

Next Moca Releases Agent Definition Language as an Open Source Specification

Cloudflare Demonstrates Moltworker, Bringing Self-Hosted AI Agents to the Edge

Datadog Integrates Google Agent Development Kit into LLM Observability Tools

From Alert Fatigue to Agent-Assisted Intelligent Observability

Etleap Launches Iceberg Pipeline Platform to Simplify Enterprise Adoption of Apache Iceberg

QCon London

QCon AI Boston

QCon San Francisco

InfoQ Software Architects' Newsletter

News