InfoQ Homepage Site Reliability Engineering Content on InfoQ

Articles

RSS Feed

Newer Older

Culture & Methods

Employing Team-Based Agile Coaching to Establish SRE in an Organization

Establishing SRE in a software delivery organization typically requires a socio-technical transformation. Operations teams need to learn how to provide a scalable SRE infrastructure to enable development teams to run their services efficiently. This paper presents how agile coaching has been employed to run an SRE transformation in a 25-teams strong product delivery organization.

Philipp Gündisch Vladyslav Ukis
on Aug 23, 2022
Culture & Methods

Establishing a Scalable SRE Infrastructure Using Standardization and Short Feedback Loops

This article explores an SRE implementation where the operations team builds and runs the SRE infrastructure and the development teams build and run the services leveraging the SRE infrastructure. This SRE solution enables the software delivery organization to scale the number of services in operation without linearly scaling the number of people required to operate the services.

Philipp Gündisch Vladyslav Ukis
on Jun 27, 2022
DevOps

DevOps and Cloud InfoQ Trends Report – June 2022

This article summarizes how we see the "cloud computing and DevOps" space in 2022, which focuses on fundamental infrastructure and operational patterns, the realization of patterns in technology frameworks, and the design processes and skills that a software architect or engineer must cultivate.

Steef-Jan Wiggers Matt Campbell Lena Hall Renato Losio Daniel Bryant Feynman Zhou Mostafa Radwan Shaaron A Alvares
on Jun 21, 2022
Mobile

InfoQ Mobile and IoT Trends Report 2022

This report summarizes the views of the InfoQ editorial team and of several practitioners from the software industry about emerging trends in a number of areas that we collectively label the mobile and IoT space. This is a rather heterogeneous space comprising devices and gadgets from smartphones to smart watches, from IoT appliances to smart glasses, voice-driven assistants, and so on.

Sergio De Simone Abhijith Krishnappa Tridib Bolar
on Feb 21, 2022
Culture & Methods

Improving Speed and Stability of Software Delivery Simultaneously at Siemens Healthineers

In this article, we focus on the software delivery process at Siemens Healthineers Digital Health. The process is subject to strict regulations valid in the medical industry. We show our journey of transforming the process towards speed and stability. Both measures improved at the same time during the transformation, confirming research from the “Accelerate” book.

Vladyslav Ukis
on Aug 24, 2021
Culture & Methods

Thoughtfully Training SRE Apprentices: Establishing Padawan and Jedi Matches

This article shares how Padawans and Jedis can inspire and teach us how to help people of a wide variety of backgrounds, ages, and experience levels to observe and understand failures in production. It covers practical lessons learned and shares how you can create and rollout a program for SRE Apprentices within your organization. It also shares feedback from the SRE Apprentices themselves.

Tammy Bryant Butow
on Aug 04, 2021
DevOps

DevOps and Cloud InfoQ Trends Report - July 2021

This article summarizes how we see the "cloud computing and DevOps" space in 2021, which focuses on fundamental infrastructure and operational patterns, the realization of patterns in technology frameworks, and the design processes and skills that a software architect or engineer must cultivate.

Matt Campbell Steef-Jan Wiggers Shaaron A Alvares Helen Beal Daniel Bryant Lena Hall Rupert Field Aditya Kulkarni Jared Ruckle Renato Losio Holly Cummins
on Jul 19, 2021
Culture & Methods

Site Reliability Engineering Experiences at Instana

With the popularity of distributed architectures, distributed databases, containers and container orchestrators, an approach that emphasizes automation and a culture of collaboration is a natural fit for modern day operations. Site Reliability Engineering takes engineering practices that have been established and proven in software engineering and applies them to the field of operations.

Bastian Spanneberg
on Apr 29, 2021
Culture & Methods

Shifting Modes: Creating a Program to Support Sustained Resilience

The second article in a series on how software companies adapted and continue to adapt to enhance their resilience explores how organizations can shift to a Learn & Adapt safety mode and compares the traits of an organization that is well poised for successfully persisting this mode shift. This shift will not only make them safer but will also give them a competitive advantage.

Alex Elman
on Jan 11, 2021
DevOps

Failover Conf Q&A on Building Reliable Systems: People, Process, and Practice

One of the biggest engineering challenges associated with maintaining or increasing the reliability of a system is knowing where to invest time and energy. InfoQ recently sat down with several engineers and technical leaders who are involved with the upcoming Failover Conf virtual event, and asked their opinion on the best practices for building and running reliable systems.

Angel Rivera Tiffany Jachja Heidi Waterhouse Jim Walker Dave Nielsen Laura Hofmann
on Apr 20, 2020
Culture & Methods

Data-Driven Decision Making – Product Operations with Site Reliability Engineering

The Data-Driven Decision Making Series provides an overview of how the three main activities in the software delivery - Product Management, Development and Operations - can be supported by data-driven decision making. In Operations, SRE’s SLIs and SLOs can be used to steer the reliability of services in production.

Vladyslav Ukis
on Mar 25, 2020
Architecture & Design

How to Avoid Cascading Failures in Distributed Systems

Cascading failures are failures that involve some kind of feedback mechanism. In distributed software systems they generally involve a feedback loop where some event causes either a reduction in capacity, an increase in latency, or a spike of errors. Laura Nolan explores them using public accounts of real production incidents.

Laura Nolan
on Feb 20, 2020

Newer Articles

Older Articles

InfoQ Software Architects' Newsletter

Articles