InfoQ Homepage Reliability Content on InfoQ
-
How Cell-Based Architecture Enhances Modern Distributed Systems
Cell-based architecture has emerged as a response to many challenges associated with distributed systems. It employs the bulkhead pattern to isolate failures to a fraction of the affected infrastructure footprint and prevent widespread impact. Cells can also help organize large architectures into domain-bound deployment and delivery units, which provides essential sociotechnical benefits.
-
Prepare to Be Unprepared: Investing in Capacity to Adapt to Surprises in Software-Reliant Businesses
Incidents are often perceived as extraordinary aberrations, unconnected to "normal" work. For over twenty years, the field of Resilience Engineering has aimed at flipping this approach around — by understanding what makes incidents so rare (relative to when and how they do not happen) and so minor (relative to how much worse they can be) and deliberately enhancing what makes that possible.
-
Data-Driven Decision Making - Software Delivery Performance Indicators at Different Granularities
Optimizing a software delivery organization is not a straightforward process standardized in the software industry. Getting the organization to analyze the data and act on it is a difficult undertaking. This article presents insights into how a socio-technical framework for optimizing a software delivery organization has been set up and brought to the point of regular use.
-
AIOps: Site Reliability Engineering at Scale
AIOps can simplify and streamline processes which can reduce the mental burden on employees while improving communication and collaboration between departments.
-
Assessing Organizational Culture to Drive SRE Adoption
SRE adoption is greatly influenced by the organizational culture at hand. This article describes how to assess the organizational culture in terms of production operations at the beginning of the SRE transformation. It provides a roadmap of small culture changes accumulating over time, and shows how the leadership facilitated the necessary culture changes
-
The Service and the Beast: Building a Windows Service that Does Not Fail to Restart
Windows Services play a key role in the Microsoft Windows operating system, and support the creation and management of long-running processes. When “Fast Startup” is enabled and the PC is started after a regular shutdown, though, services may fail to restart. The aim of this article is to create a persistent service that will always run and restart after Windows restarts, or after shutdown.
-
Building & Operating High-Fidelity Data Streams
At QCon Plus 2021 last November, Sid Anand, chief architect at Datazoom and PMC Member at Apache Airflow, presented on building high-fidelity nearline data streams as a service within a lean team. In this talk, Anand provides a master class on building high-fidelity data streams from the ground up.
-
Employing Team-Based Agile Coaching to Establish SRE in an Organization
Establishing SRE in a software delivery organization typically requires a socio-technical transformation. Operations teams need to learn how to provide a scalable SRE infrastructure to enable development teams to run their services efficiently. This paper presents how agile coaching has been employed to run an SRE transformation in a 25-teams strong product delivery organization.
-
Establishing a Scalable SRE Infrastructure Using Standardization and Short Feedback Loops
This article explores an SRE implementation where the operations team builds and runs the SRE infrastructure and the development teams build and run the services leveraging the SRE infrastructure. This SRE solution enables the software delivery organization to scale the number of services in operation without linearly scaling the number of people required to operate the services.
-
Building Tech at Presidential Scale
Dan Woods discusses the unique challenges of building and running tech for a presidential cycle. Woods also describes how ML was applied at foundational points to reduce operating costs and some of the architectural choices made.
-
Improving Speed and Stability of Software Delivery Simultaneously at Siemens Healthineers
In this article, we focus on the software delivery process at Siemens Healthineers Digital Health. The process is subject to strict regulations valid in the medical industry. We show our journey of transforming the process towards speed and stability. Both measures improved at the same time during the transformation, confirming research from the “Accelerate” book.
-
Site Reliability Engineering for Native Mobile Apps
In this article, we will describe how we can apply Site Reliability Engineering (SRE) principles to mobile app development. First, we will describe the key SRE tenets and what tools can be used to implement them. Then, we will delve into organization topology, i.e. how an organization can be designed to adopt SRE for mobile app development.