InfoQ Homepage Site Reliability Engineering Content on InfoQ
-
Thoughtfully Training SRE Apprentices: Establishing Padawan and Jedi Matches
This article shares how Padawans and Jedis can inspire and teach us how to help people of a wide variety of backgrounds, ages, and experience levels to observe and understand failures in production. It covers practical lessons learned and shares how you can create and rollout a program for SRE Apprentices within your organization. It also shares feedback from the SRE Apprentices themselves.
-
DevOps and Cloud InfoQ Trends Report - July 2021
This article summarizes how we see the "cloud computing and DevOps" space in 2021, which focuses on fundamental infrastructure and operational patterns, the realization of patterns in technology frameworks, and the design processes and skills that a software architect or engineer must cultivate.
-
Site Reliability Engineering Experiences at Instana
With the popularity of distributed architectures, distributed databases, containers and container orchestrators, an approach that emphasizes automation and a culture of collaboration is a natural fit for modern day operations. Site Reliability Engineering takes engineering practices that have been established and proven in software engineering and applies them to the field of operations.
-
Shifting Modes: Creating a Program to Support Sustained Resilience
The second article in a series on how software companies adapted and continue to adapt to enhance their resilience explores how organizations can shift to a Learn & Adapt safety mode and compares the traits of an organization that is well poised for successfully persisting this mode shift. This shift will not only make them safer but will also give them a competitive advantage.
-
Failover Conf Q&A on Building Reliable Systems: People, Process, and Practice
One of the biggest engineering challenges associated with maintaining or increasing the reliability of a system is knowing where to invest time and energy. InfoQ recently sat down with several engineers and technical leaders who are involved with the upcoming Failover Conf virtual event, and asked their opinion on the best practices for building and running reliable systems.
-
Data-Driven Decision Making – Product Operations with Site Reliability Engineering
The Data-Driven Decision Making Series provides an overview of how the three main activities in the software delivery - Product Management, Development and Operations - can be supported by data-driven decision making. In Operations, SRE’s SLIs and SLOs can be used to steer the reliability of services in production.
-
How to Avoid Cascading Failures in Distributed Systems
Cascading failures are failures that involve some kind of feedback mechanism. In distributed software systems they generally involve a feedback loop where some event causes either a reduction in capacity, an increase in latency, or a spike of errors. Laura Nolan explores them using public accounts of real production incidents.
-
SLOs Are the API for Your Engineering Team
SLOs provide a simple common language for evaluating risk in terms of error budgets. SLOs save everyone involved both time and energy, which you can redirect toward more important things, like keeping your customers happy.
-
Sustainable Operations in Complex Systems with Production Excellence
Successful long-term approaches to production ownership and DevOps require cultural change in the form of production excellence. Teams are more sustainable if they have well-defined measurements of reliability, the capability to debug new problems, a culture that fosters spreading knowledge, and a proactive approach to mitigating risk.
-
Book Review: Site Reliability Engineering - How Google Runs Production Systems
"Site Reliability Engineering - How Google Runs Production Systems" is an open window into Google's experience and expertise on running some of the largest IT systems in the world. The book describes the principles that underpin the Site Reliability Engineering discipline. It also details the key practices that allow Google to grow at breakneck speed without sacrificing performance or reliability.