Key Takeaways
- Cultural and process changes, rather than changes in tooling alone, are necessary for teams to sustainably manage services.
- We can reduce pager noise with Service Level Objectives that measure customer experience and alerts for crucial customer-impacting failures.
- Observability, through collecting and querying tracing and events, allows teams to debug and understand their complex systems.
- Teams must be able to safely collaborate and discuss risks to work effectively.
- Services run more smoothly when quantitative risk analysis allows teams to prioritize fixes.
[Site reliability engineering (SRE)] and development teams need to be willing to address issues head on and identify points of tension that need resetting. Part of SRE’s job is to help maintain production excellence in the face of changing business needs.— Sheerin et al., The Site Reliability Workbook
Production ownership, putting teams who develop components or services on call for those services, is a best practice for ensuring high-quality software and tightening development feedback loops. Done well, ownership can provide engineers with the power and autonomy that feed our natural desire for agency and high-impact work and can deliver a better experience for users.
But in practice, teams often wind up on call without proper training and safeguards for their well-being. Suddenly changing the responsibilities of a team worsens service reliability, demoralizes the team, and creates incentives to shirk responsibilities. The team members will lack critical skills, struggle to keep up, and not have the time to learn what they're missing. The problem is one of responsibility without empowerment. Despite the instinct to spend money to solve the problem, purchasing more tooling cannot fill the gaps in our teams' capabilities or structures. Tools can only help us automate existing workflows rather than teach new skills or adapt processes to solve current problems.
In order to succeed at production ownership, a team needs a roadmap for developing the necessary skills to run production systems. We don't just need production ownership; we also need production excellence. Production excellence is a teachable set of skills that teams can use to adapt to changing circumstances with confidence. It requires changes to people, culture, and process rather than only tooling.
Silo-ing operations doesn't work
Development teams that outsource their operations instead of practicing production ownership struggle with misalignment of incentives around operational hygiene. Such teams prioritize faster feature development to the detriment of operability. The need for human intervention by operators grows faster than teams can provide. Operationally focused teams, who must keep the system running and scaling up, are left to pick up the mess no matter the human cost in distraction, late nights, and overtime shifts. A service that burns out all the humans assigned to operate it cannot remain functioning for long.
When dev and ops throw code at each other over a wall, they lose the context of both developer and operator needs. Releases require manual testing, provisioning requires human action, failures require manual investigation, and complex outages take longer to resolve due to escalations among multiple teams. Teams wind up either working with a heap of manual, tedious, error-prone actions or develop incorrect automation that exacerbates risk — for instance, they restart processes without examining why they stopped. This kind of repetitive break/fix work, or toil, scales with the size of the service and demands the team’s blood, sweat, and tears. In organizations that use a dedicated SRE team structure as well as in organizations with a more traditional sysadmin team, teams can quickly become operationally overloaded.
Over time, operability challenges will also slow down product development, as disconnected knowledge and tooling (between dev and ops) mean that the separate teams cannot easily understand and repair failures in production or even agree on what is more important to fix. Production ownership makes sense: close the feedback loop so that development teams feel the pain of their operability decisions, and bring operational expertise onto the team rather than silo-ing it.
Handing out pagers doesn't work.
Putting development teams on call requires preparation to make them successful. Technology companies split the roles of operations and development decades ago, aiming to find efficiencies through specialization. The modern process of forcing those two functions back together to improve agility cannot happen overnight. The operational tasks that the combined DevOps team now needs to do often feel distinct from the skills and instincts that dev teams are used to. The volume of alerts and manual interventions can be overwhelming, which reinforces the perspective that ops work isn't valuable. And reducing the volume of alerts and manual work requires a blend of both dev and ops skills so many developers resist attempts to introduce production ownership onto a team.
Many commercial tools claim to reduce the inherent friction of DevOps or production ownership but instead further scatter the sources of truth and increase the noise in the system. Unless there is a systematic plan to educate and involve everyone in production, every non-trivial issue will escalate to one of the team’s very few experts in production. Even if those experts were inclined to share rather than hoard information, the constant interruptions prevent them from writing thorough documentation. Continuous integration and delivery metrics are a trap if they focus solely on the rate at which software ships, rather than the quality of the software.
Time to detect and repair incidents will remain long if teams cannot understand how the system is failing in production. In order to confidently deploy software, teams must ensure that it does not break upon deployment to production. Otherwise, in an environment with a high change failure rate, teams' hard work gets repeatedly rolled back or fixed forward under stressful conditions. While it is fairer to distribute existing operational pain among everyone on a team, true relief from operational overload can only come from reducing the numbers of pager alerts and break/fix tickets.
Specialization in systems engineering has value and takes time to develop, and historically has been devalued. Product developers don't necessarily have a good framework for deciding what automation to write or bugs to fix for the greatest reduction in noise. Even if there were time to address toil and other forms of operational debt, it also requires planning in advance. An analogy is that toil is the operational interest payment on technical debt; paying the interest alone will not reduce the outstanding principal. Production ownership is a better strategy than having a disempowered operations team, but the level of production pain remains the same. Blending team roles certainly spreads the production pain more fairly, but there are better solutions, which reduce that pain instead. Both customers and the development teams benefit when we pay down technical and operational debt.
A better approach: Production excellence
Instead of trying to solve the problem of production ownership with tools or forced team integrations, a different people-centric approach is required: that of production excellence.
Production excellence is a set of skills and practices that allow teams to be confident in their ownership of production. Production-excellence skills are often found among SRE teams or individuals with the SRE title, but it ought not be solely their domain. Closing the feedback loop on production ownership requires us to spread these skills across everyone on our teams. Under production ownership, operations become everyone's responsibility rather than “someone else's problem”. Every team member needs to have a basic fluency in operations and production excellence even if it's not their full-time focus. And teams need support when cultivating those skills and need to feel rewarded for them.
There are four key elements to making a team and the service it supports perform predictably in the long term. First, teams must agree on what events improve user satisfaction and eliminate extraneous alerts for what does not. Second, they must improve their ability to explore production health, starting with symptoms of user pain rather than potential-cause-based exploration. Third, they must ensure that they collaborate and share knowledge with each other and that they can train new members. Finally, they need a framework for prioritizing which risks to remediate so that the system is reliable enough for customer needs can run sustainably.
1. Measure what matters
Making on-call periods tolerable starts with reducing noise and improving correlation between alerts and real user’s problems. Service-level objectives (SLOs), a cornerstone of SRE practice, help create a feedback loop for keeping systems working well enough to meet long-term user expectations. The core idea of SLOs is that failures are normal and that we need to define an acceptable level of failure instead of wasting development agility in pursuit of perfection. As Charity Majors says, "Your system exists in a continuous state of partial degradation. Right now. There are so many, many things wrong and broken that you don't know about. Yet." Instead of sending alerts for the unactionable noise in each system, we should address broader patterns of failure that risk user unhappiness.
Therefore, we need to quantify each user's expectations through the quality of experience delivered each time they interact with our systems. And we need to measure in aggregate how satisfied all our users are. We can begin to draft such a service-level indicator (SLI) by measuring latency and error rates, logging factors such as abandonment rates, and obtaining direct feedback such as user interviews.
Ideally, our product managers and customer-success teams will know what workflows users most care about, and what rough thresholds of latency or error rate induce support calls. By transforming those guidelines into machine-categorizable thresholds (e.g., “this interaction is good if it completed within 300 milliseconds and was served with an HTTP 200 code,” or “this record of data is sufficiently fresh if updated in the past 24 hours”), it becomes possible to analyze real-time and historical performance of the system as a whole. For example, we might set a target for our system of 99.9% of eligible events being successful over a three-month period: an error budget of one in 1000 events within each period.
By setting targets for performance and establishing a budget for expected errors over spans of time, it becomes possible to adjust systems to ignore small self-resolving blips and to send alerts only if the system significantly exceeds nominal bounds. If and only if it appears that the ongoing rate of errors will push us over the error budget in the next few hours will the system page a human. And we can experiment with our engineering priorities to see if there's spare tolerance and more speed available or we can refocus on reliability work if we're exceeding acceptable levels of failure. If a team chronically falls short of the error budget and users are complaining, perhaps the team needs to prioritize infrastructural fixes rather than new features.
A written team commitment to defend the SLO if it is in danger provides institutional support for reliability work. A SLO that strikes the right balance should leave a user shrugging off a rare error and retrying later rather than calling support and cancelling their contract because the service is constantly down.
SLOs can shift as user expectations change and as different tradeoffs between velocity and reliability become possible. But it is far better to focus on any user-driven SLO, no matter how crude, than to have far too many metrics devoid of correlation with user impact deafen us.
2. Debug with observability
Having eliminated lower-level alerts, what should we do if our system determines that our service is not functioning according to its SLOs? We need the ability to debug and drill down to understand which subsets of traffic are experiencing or causing problems. Hypothesize about how to mitigate and resolve them, then carry out the experiment and verify whether things have returned to normal.
The ability to debug is closely related to the idea of observability, of having our system produce sufficient telemetry to allow us to understand its internal state without needing to disturb it or modify its code. With sufficient knowledge of our system and its performance, we hope to test hypotheses to explain and resolve the variance in outcomes for users experiencing problems. What's different about the requests now taking 500 milliseconds versus the ones taking 250 milliseconds? Can we close the gap by identifying and resolving the performance problem that causes a subset of requests to take longer?
It is insufficient to merely measure and gather all the data. We must be able to examine the data and its context in new ways and use it to diagnose problems. We need observability because we cannot debug systems from only what we've thought in advance to measure. Complex systems failures often arise from new combinations of causes rather than from a finite set of fixed causes. As a corollary, we often cannot reproduce production failures in smaller-scale staging environments and cannot predict them in advance.
Regardless of our approach to observability, it is important to have sufficient information about how requests flow through our distributed systems. Each place where a request traverses a microservice is a potential failure point that we must monitor. This can take the form of wide events with metadata, distributed traces, metrics, or logs. Each approach makes tradeoffs in terms of tooling support, granularity of data, flexibility, and context. We often need multiple tools that work well together to provide the full set of required capabilities. And problems will stymy even the most skilled problem solver if we cannot follow the trail of clues down to specific instances of failing user workflows.
By recording telemetry with sufficient flexibility and fidelity, we don't need a service to remain broken to analyze its behavior. We can restore service to users as quickly as possible, confident that we can later reproduce the problem and debug. This lets us automatically mitigate entire classes of problems, such as by draining bad regions or reverting bad rollouts, and have a human investigate during normal working hours what went wrong. The system's operability improves when humans don’t need to fully debug and resolve each failure in real time.
3. Collaboration and building skills
Even a perfect set of SLOs and instrumentation for observability do not necessarily result in a sustainable system. People are required to debug and run systems. Nobody is born knowing how to debug, so every engineer must learn that at some point. As systems and techniques evolve, everyone needs to continually update with new insights.
It's important to develop training programs, to write thorough up-to-date documentation on common starting points for diagnosis, and to hold blameless retrospectives. Service ownership doesn't mean selfishness and silos, it means sharing expertise over time. Above all, teams must foster an atmosphere of psychological safety where team members and those outside the immediate team can ask questions. This empowers individuals to further their understanding and brings new perspectives to bear on problems.
Cross-team interactions can encompass not just upstream/downstream relations but also job roles such as customer support, product development, and SRE. If a customer-support engineer was yelled at the last time they raised a false alarm, they will feel they can't safely escalate problems and the detection and resolution of issues will suffer.
It's critical to the learning process that everyone has some exposure to the way things fail in production. However, many approaches to production ownership are inflexibly dogmatic about putting every developer on call 24 hours a day, seven days in a row. Developers object to participating in processes that are stressful and not tailored to their lives. Fear of having their job description shift under them or of being perceived as shirking duties if they do not comply harms team morale.
Production needs to be owned by engineers of all kinds, rather than just those fitting a specific mold. Involvement in production does not have to be a binary choice of on call or not on call. Involvement in production can take many different forms, such as triaging support tickets for a person who cannot cope with the stress of a pager or being on call only during business hours for those who have family responsibilities at night. Other examples could be a devout Jewish person not being on call during the Sabbath or evening on-call time for a manager who doesn't want pages to interrupt their business-hour one-to-ones. Together, a team can collaborate to fairly share the load of production ownership based on individuals’ contexts and needs.
4. Risk analysis
The final element of production excellence is the ability to anticipate and address structural problems that pose a risk to our system’s performance. We can identify performance bottlenecks and classes of potential failure before they become crises. Being proactive means knowing how to replace hard dependencies with soft dependencies before they fail. Likewise, we should not wait for users to complain that our system is too slow before optimizing critical paths. We still need a portion of our error budget to deal with novel system failures but that does not excuse us from addressing known risks in the system.
The trick is to identify those risks are the most critical and make the case to fix them. In the 1960s, the American Apollo moon-landing program collated a list of known risks and worked to ensure that these risks cumulatively fell within the program's safety parameters. In the modern era, Google SRE teams espouse the examination and prioritization of risks. They propose a framework of quantitative risk analysis that values each risk by the product of its time to detect/repair, the severity of its impact, and its frequency or likelihood.
We may not always be able to control the frequency of events but we can shorten our response times, soften the severity of impact, or reduce the number of affected users. For instance, we may choose canary deployments or feature flags to reduce the impact of bad changes on users and speed up their reversion if necessary. The success of an improvement is its reduction in bad minutes for the average user. A canary deployment strategy might shrink deployment-related outages from affecting 100% of users for two hours once a month (120 bad minutes per user per month) to affecting 5% of users for 30 minutes once a month (1.5 bad minutes per user per month).
Any single risk that could spend a significant chunk of our error budget is something to swiftly address since it could cause the system to fail catastrophically and push us dramatically over our error budget. We can address the remaining multitude of smaller sources of risk later if we are still exceeding our error budget. And with a quantitative analysis enumerating the fraction of the error budget, each risk will consume, it becomes easier to argue for relative prioritizations and for addressing existential risks instead of developing new features.
Risk analysis of individual risk conditions, however, often fails to find larger themes common to all the risks. Lack of measurement, lack of observability, and lack of collaboration (the other three key aspects of production excellence) represent systematic risks that worsen all other risks by making outages longer and harder to discover.
Production support need not be painful
Successful long-term approaches to production ownership and DevOps require cultural change in the form of production excellence. Teams are more sustainable if they have well-defined measurements of reliability, the capability to debug new problems, a culture that fosters spreading knowledge, and a proactive approach to mitigating risk. While tools can play a part in supporting a reliable system, culture and people are the most important investment.
Without mature observability and collaboration practices, a system will crumble under the weight of technical debt and falter no matter how many people and how much money is ground into the gears of the machine. The leadership of engineering teams must be responsible for creating team structures and technical systems that can sustainably serve user needs and the health of the business. The structures of production ownership and production excellence help modern development teams to succeed. We cannot expect disempowered operations teams to succeed on their own.
About the Author
Liz Fong-Jones is a developer advocate, labor and ethics organizer, and site-reliability engineer with more than 15 years of experience. She is an advocate at Honeycomb.io for the SRE and Observability communities, and previously was a SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.