Key Takeaways
- The first step to roll out SRE in a software delivery organization is to assess the teams’ current operational capabilities. A classification of teams should result from this analysis. This classification is used to set up a bespoke strategy for team-based agile coaching to institutionalize SRE in the organization.
- The rollout of the new SRE infrastructure needs to be done in an agile manner by establishing tight feedback loops between the operations team implementing the infrastructure and the service owner teams being onboarded.
- Ideally, the infrastructure should always be only one step ahead of the teams, not more, to ensure the infrastructure features solve a real need by a team and are applicable to many teams.
- A team will undergo many mindset changes while maturing their SRE practice. It is rarely possible to make big steps quickly. Rather, a team needs to make their share of mistakes and learn from experience to increase maturity in doing operations.
- Team-based agile coaching is a very effective way to accompany the teams on their journey. Facilitating unique challenges with people, process and technology for each team is necessary to gradually improve the maturity in operations over time. There is no silver bullet!
Introduction
Establishing SRE in a software delivery organization typically requires a socio-technical transformation. Operations teams need to learn how to provide a scalable SRE infrastructure to enable development teams to run their services efficiently. This might be new to the operations teams who often have experience with doing operations themselves instead of enabling others to do so.
Development teams need to learn to go on call for their services to the extent agreed in the organization. This might be new to the development teams who are often used to developing software and handing it over to operations teams to run in production.
Product management needs to learn to be involved in operational decisions to enable efficient and timely detection of user experience deterioration and prioritization to restore it. This might be new to the product management who are often used to delegating all operational concerns to operations.
SRE offers a model to align the product delivery organization on operational concerns. Introducing it in an organization new to the topic is a large transformational endeavor. It requires team facilitation by skilled coaches.
In this paper, we present how agile coaching has been employed to run an SRE transformation. Our experience is based on the SRE transformation we ran in a 25-teams strong product delivery organization owning the Siemens Healthineers teamplay digital health platform.
Strategy
Strategically, we followed the SRE concept pyramid from "Establishing SRE Foundations" to establish SRE in the organization by climbing the pyramid from bottom to top.
The pyramid provides a roadmap for driving the SRE transformation in the organization. The higher a team can climb up the pyramid, the more sustainable their SRE practice becomes. The more teams can climb up the pyramid, the more sustainable the SRE practice becomes, yielding a de-facto standard of running services in the organization.
Assessing the teams’ operational capabilities
When starting an SRE transformation, the first step is to assess the current operational capabilities of development and operations teams. These can range from being unfamiliar with modern operations to running an advanced operations practice.
The operations teams might be running software based on IT parameters such as memory utilization, CPU usage and similar without deep knowledge of the product, let alone of the ways the product is actually used by customers. At the other end of the spectrum, the operations teams might own an easy-to-use and mighty infrastructure to enable development teams to do operations. Many different facets can reside in-between the two extremes: shared on-call duty between operations and development teams, exclusive on-call duty for all production services by the operations teams, etc.
The development teams might be completely unaware of running software except from solving issues in an ad-hoc manner when escalated by customers. This is a typical case for a traditional development organization that has never done operations before. At the other end of the spectrum, the development teams might be on-call for the services they own during and outside business hours. Also with the development teams, many different ownership facets can be between the two extremes: shared on-call duty between operations and development teams, on-call duty during business hours for the development teams, ownership of the operations infrastructure by a dedicated development team, etc.
The result of the assessment is a classification of operations and development teams in terms of operational capabilities maturity. In all likelihood, the classification will not turn out to be very diverse within a product delivery organization. Typically, there is a prevalent way operations is done within the organization. This is beneficial for the SRE coaches because it allows them to think about a general approach to transforming the teams toward SRE suitable for the organization at hand. SRE coaching is more directional than pure coaching, where coaches only guide using questions. However, SRE coaching is not mentoring either where the mentor is the guru to guide the students. It is somewhere in-between, leaning more towards coaching than mentoring. Oftentimes, SRE coaches explore the terrain with a given team using deep questions about the domain, the users, the tech etc. While SRE coaches have a generic goal in mind to institutionalize SRE in the team, they definitely do not have clear answers readily available for how to do so.
The SRE coaches at teamplay found a way to drive changes in the operations teams so that as a result of the transformation, the operations teams enabled the development teams to do service operations using a state of the art SRE infrastructure. By the same token, the SRE coaches found a way to drive changes in the development teams so that as a result of the transformation, the development teams go on-call for their services to the extent agreed with the operations teams.
Critically, the SRE coaches needed to work with the operations, development and product teams to establish a suitable on-call model for different services in the organization. Needless to say, all of that can only happen with a general support for SRE by the organizational leadership.
Assessment
When we started our SRE transformation, none of our teams had experience with SRE nor had a structured process to prioritize reliability vs. feature work. Given this, we created the following classification of development and operations teams:
- Operations teams
- A little experience in building infrastructure to enable development teams to do operations
- On-call for the infrastructure owned
- Development teams
- Category 1: "Too far away from production"
- Full focus on development
- No own production monitoring
- No on-call
- Fixing escalated production issues as they arrive
- Category 2: "Close to production"
- Adequate production awareness
- Own custom-built production monitoring with resource-based alerts
- Reactions to monitoring signals without a structured approach to on-call
- Fixing escalated production issues as they arrive
- Improving own monitoring and services as part of reactions to monitoring signals and fixing escalated production issues
- Category 3: "New team"
- Services not yet deployed to production
- Category 1: "Too far away from production"
The number of development teams in the organization was growing. We wanted to build up the SRE infrastructure quickly to put new teams on SRE at the beginning of their journey to production, in order to avoid the inevitably difficult SRE transformation down the road.
Using the SRE infrastructure
The SRE infrastructure is a set of tools, algorithms and visualization dashboards that together enable a team to operate their services reliably at scale in an efficient manner. The implementation, maintenance and operations of the SRE infrastructure requires a, possibly small, team of dedicated developers.
When building up the SRE infrastructure, we established a very tight working mode between the operations team doing the implementation and the development teams being onboarded on the infrastructure (see "Establishing a Scalable SRE Infrastructure Using Standardization and Short Feedback Loops"). We ensured that the operations team was always only one step ahead, not more, of the development teams in terms of feature implementation. Every implemented feature went into immediate use by the development teams and was iterated upon relentlessly based on feedback.
Getting the SRE infrastructure used requires development teams to be ready for SRE adoption. The teams need to understand the advantages of operating the services using SRE as opposed to other ways of doing so (e.g. ITIL, COBIT, modeling, ad-hoc operations - see "Establishing SRE Foundations" for a comparison). In order to familiarize the teams with the SRE methodology, concepts and infrastructure, we used team coaching as a core method.
Team coaching
According to the three team categories defined above, the development organization was divided into three respective groups of teams. The SRE coaches started individual coaching sessions with each team. One of the most important aspects turned out to be the involvement of the entire team incl. product owners in regular SRE coaching meetings. This way, the mindset of all roles could be shifted towards SRE at the same time. This is a necessary prerequisite for succeeding with SRE in a team.
Another important aspect is the assessment of the mindset progress the team has made regarding SRE. For a team that is just beginning to understand the concepts of service ownership, the coaching sessions focused on conveying the knowledge about their actual services performance in the cloud because these teams tend to overestimate their services performance. A team further ahead on the journey that already set up use-case based alerts can be directly accelerated by the introduction of SLOs to complete the service monitoring.
It is very important to understand the team's challenges at their current step on the road to SRE. Teams reluctant to SRE failed first, e.g. by having some serious outages, before drawing the right conclusions and stepping forward. Teams overestimating their service performance got flooded with alerts before realizing what their service performance looks like in reality. Let them fail but ensure the right conclusions are drawn and support the teams with the right information they need at their step. The shortest path is often not the fastest. If you coach the teams this way, it will dramatically speed up the progress in mindset change.
We started with basic introductions to SRE concepts, checking how logging is done and enabling logging as required by the SRE infrastructure. Following this, we defined initial SLOs, deployed alerts on the initial SLOs with the SRE infrastructure, obtained the SLO breaches and let the teams do an initial analysis of the breaches.
Through iteration, we coached the teams to calibrate the SLOs so that their breaches reflect the customer experience well. A broken SLO should mean a significantly impacted customer experience. Otherwise, the team will not have the appetite to investigate the SLO breach and bring the service back within the SLO. Finally, we introduced error budget-based decision-making to prioritize reliability investments using SRE visualization dashboards.
Standardizing logging
To begin with, we needed to standardize the naming conventions of services, endpoints and deployments. This is required for the SRE infrastructure to process the logs of all teams in a uniform manner. The logs should contain all the necessary information to identify which service produced them.
As a next step, we ensured that all the teams use the same logging facilities. We focused on a Microsoft Azure facility called Azure Monitor. Specifically, we use two Azure Monitor features: Application Insights and Log Analytics.
Application Insights offers fully automated tracing of http calls between services. The effort to activate it is minimal for the development teams. The depth of telemetry automation comes with a huge amount of enforced standardization for the logs which limits the degree of flexibility for custom logging.
By contrast, Log Analytics allows creating tables with a custom schema so that the attribute extraction of custom data is performed automatically at log ingestion time. Therefore, some effort in the log analysis can be saved but the http tracing capabilities are not present.
Therefore, depending on the use case, the teams can either use Application Insights or Log Analytics or both. The log query language is the same everywhere: Kusto Query Language (KQL). At teamplay we often use Application Insights and Log Analytics simultaneously. The Application Insights is used to trace all the http traffic for monitoring latency and availability SLOs. The Log Analytics stores custom logs for debugging and use case monitoring.
Another important aspect was to ensure that the teams use http return codes properly, according to their definition. For instance, returning 500 Internal Server Error should be done only when the error is genuinely unexpected. In cases of known errors, there are many other http return codes beyond 500 that can and should be used. Wrong use of http return codes lead to skewed availability error budget calculations as the unavailability is detected based on the specific error codes.
Setting SLOs
Selecting relevant SLIs and setting SLOs for them is at the heart of running services using SRE.
One of the biggest learnings from setting SLOs is that it is not straightforward to get the entire team to clearly articulate the customer for whose experience the SLO is optimized. The better this can be done, the more effective the SLO will be. Starting with a clearly articulated customer segment leads to an SLO that reflects the experience of that customer segment. If the SLO is broken, the experience of that customer segment is severely impacted. Following this, the people on-call understand the cause and effect relationship between the broken SLO and the resulting customer impact well. This, in turn, caters for a high motivation by the people on-call to quickly bring the service back within the SLO.
Our experience is that the teams initially set the SLOs too tightly for what their services can deliver in terms of reliability. Typically, the teams cannot defend the SLOs set in the first round. This results in lots of SLO breaches, whose analysis goes beyond the team’s cognitive capacity. Iteration is required to arrive at SLOs that satisfy three properties:
- The SLOs reflect well the experience of a clear customer segment
- The SLOs can be defended by the team
- The SLOs are agreed by all team members incl. product owner
Iterating on SLOs remains a key aspect of the operations process under SRE also after the initial rounds. Every time new features are added to a service, the team needs to check whether the existing SLOs can still be defended and whether new SLOs would make sense. Every time the service starts serving new customer segments, existing SLOs need to be reconsidered and the potential for new SLOs needs to be discussed. Acting this way requires a very long maturation process in teams.
Reacting to SLO breaches
After the SLO definition, getting the teams to react to the SLO breaches is the next big milestone on the SRE adoption journey. In fact, this is the moment where the rubber meets the road and the teams literally have to change their daily work habits. Until a team reacts to the SLO breaches at least on a daily basis, the service operations is not going to improve.
The retrieval of the SLO breaches was initially done using simple means such as e.g. dedicated Slack and Teams channels. As the maturity grows, the need for professional on-call management tools such as e.g. PagerDuty, as used by us, or OpsGenie, will emerge. The SRE infrastructure should support both options, at some point.
The professional on-call management tools require time intensive onboarding. In the initial stages of learning how to react to the SLO breaches, it is more important for a team to focus on changing the ways of working rather than learning how to use useful but rather complicated tools.
What the team needs to rethink is how to transition from being a development team to being a development & operations team. In each team, experimentation will be required along the following dimensions:
- What are the on-call hours for the team?
- Who in the team has the knowledge to go on-call?
- Can useful runbooks be created to spread the on-call knowledge within the team?
- Who is going on-call?
- What is the rotation of going on-call?
- What is the duration of an on-call shift?
- Can the people on-call do any development, grooming or planning work?
- How to determine the criticality of an SLO breach?
- Who can and how to decide on a hotfix rollout or rollback?
- Whom and how to contact if the person on-call cannot analyze an SLO breach?
The questions above demonstrate that a team undergoes a major change in ways of working when confronted with reacting to SLO breaches in a consistent manner.
Analyzing error budget depletion
The next step on the SRE adoption journey is not only to react to the SLO breaches on an ongoing basis, but also take into account the error budget depletion. Every time an SLO gets broken, it chips away a little bit of the error budget. The SRE infrastructure visualizes the error budget depletion. The teams analyzed it and took action. The action is threefold: short, mid and long-term. These are summarized in the table below.
Time horizon | Action |
Short-term | Ongoing reaction to the SLO breaches. |
Mid-term | Error budget depletion trend in the current error budget period. Decide and act according to the error budget policy. |
Long-term | Error budget depletion trend across the error budget periods. Decide and act according to the error budget policy. |
It is worth noting that the error budget analysis according to error budget policies is not possible with resource-based alerts because the underlying historical data is only available with the SRE infrastructure tracking the error budget depletion over time.
Our experience with the introduction of error budget policies in the teams is that it only works when a team has matured their SRE practice. Before that, the team is fully occupied with establishing the basic SRE foundations such as having SLOs reflecting customer experience well, a properly working on-call rotation, ongoing reactions to SLO breaches and up-to-date runbooks.
However, once the team has reached the necessary maturity, creating an error budget policy is well worth the effort! It tightens up the practice, brings up reliability prioritization discussions and puts the team on the path of data-driven continuous improvement in terms of reliability. The teams can set their own goals in a data-driven way, track progress and witness achievement over time.
Prioritizing reliability
Using the error budget depletion to make backlog prioritization in terms of reliability is one of the advanced steps on the SRE adoption journey. This is the step that maximizes the effectiveness of the SRE introduction. With this step, not only the unreliability gets detected and reacted to on an ongoing basis, but it is also systematically eliminated by the development teams.
This only happens in teams if the product owner has been fully involved in the SLO definition process and has seen the value of keeping the services within the SLOs. Given the understanding of the value, the reliability prioritization becomes way easier.
A visual example of a dashboard to support the reliability prioritization process can be seen below:
The red cells in the dashboard represent error budget periods where the error budget for the respective SLO was depleted prematurely before the end of the error budget period. The product owners can immediately focus on the red areas to prioritize reliability.
We gained experience with reliability prioritization using the error budget depletion in a couple of teams. Now, we are rolling out the topic in the entire organization as part of the overall software delivery continuous improvement programme using an indicators framework we described earlier.
Sustaining SRE
In addition to the team coaching described above, sustaining SRE requires cross-team fertilization. To that end, the SRE coaches need to establish structures for knowledge and practice sharing. Knowledge sharing can be facilitated well using lean coffee presentations and brown bag lunches. Practice sharing can be facilitated efficiently by setting up an SRE community of practice (CoP). This requires more involvement and dedicated leadership to keep the community alive over time. Done well, it brings interested people together to share tips, tricks, techniques, ways of working, problem solving methods and ideas they have researched on the Internet.
To further advance the sustainability of SRE, the topic needs to be kept on top of mind of the broader engineering community within the product delivery organization. This can be achieved using regular short article publications about SRE practices on an organization-wide engineering blog.
Finally, the SRE infrastructure adoption needs to be measured on a periodic basis. In "Establishing a Scalable SRE Infrastructure Using Standardization and Short Feedback Loops" we showed how this can be done. Decisions on further investments in SRE infrastructure development should be supported by the data analysis of the SRE infrastructure usage patterns.
Acknowledgements
We would like to acknowledge many teams and people in various roles at the Siemens Healthineers teamplay digital health platform who enabled and contributed to driving the SRE adoption at teamplay.