Key Takeaways
- When analyzing outages, it's crucial to focus on "why" rather than "who," emphasizing process improvement over blaming individuals and assuming that humans will make mistakes, even with good intentions.
- When improving the Change Release Process, focus on preventing bugs from reaching production systems through local testing, code reviews, deployment pipeline automation, and pre-production alarms.
- Operate with an assumption that a bug will still reach the production environment and how we can minimize the blast radius when that happens.
- When production systems are impacted, recovery time is critical to protect customer trust. The effect from deployed changes should be reverted in a few hours.
- Operating a system within a safe zone requires achieving equilibrium amidst management pressure to deliver, limited manpower resourcing, and safety campaigns to protect the system's health.
The recent CrowdStrike outage is a good reminder to consistently review and maintain a high bar for processes followed to commit and roll out production changes. This is not a critique of Crowdstrike’s outage and their processes, but rather a good opportunity to revisit best practices as well as look at a blueprint to analyze outages based on my experience reviewing multiple events on complex systems that handle millions of TPS (Transaction Per Second).
A dormant bug, which was apparently a missing null check, caused the CrowdStrike outage. Multiple social media posts censured the developer who may have made the change. Before we delve into other details, I have adopted one principle while reviewing customer-impacting events: we don’t focus on "who" but on "why".
We start with understanding that developers or operators are never at fault or that participating in a blame game doesn’t yield meaningful improvement to the system’s state. If the root cause of a problem was human error, then our systems lack checks or a fail-safe mechanism, and we should focus on why an operator error was possible in the first place.
We should continuously aspire to build systems, processes, and tooling in a way that makes it increasingly harder, if not impossible, for operators to make errors that can have a broad impact on production systems.
A word of caution: never aspire to make your processes hard—keep them simple while making it harder to do the wrong things (e.g., deleting a data store table). Here is a general blueprint of how to think about best practices to help prevent bugs before they reach production or minimize the impact radius if the bug still reaches the production environment. The same process is applicable when you analyze an outage and how to improve your systems after that.
You can generally divide the best practices and post-outage analysis into three categories:
- Complete prevention
- Minimize blast radius
- Fast detection and fast recovery
We will discuss these three steps in detail in subsequent sections.
1. Complete prevention
We start by asking the question: What can we do to catch the bug/issue before it reaches production systems? The answers include easy testing in the local environment, a high bar on code reviews, unit and integration test coverage, deployment pipeline test automation, and pre-production alarms.
Supporting a fully featured sandbox (developer) environment: Developers need a fully featured and isolated environment to experiment and test rapidly with real data without impacting real users. While this is a North Star expectation for a sandbox (developer) environment, these environments are often unstable. This is because many other developers also test their changes in the same environment. One way to handle this is to ensure on-calls or support teams handle these instability issues when they are not handling high-priority production issues. This is, unfortunately, a constant struggle to deal with, but it’s a necessary step to catch bugs/issues early on.
Have a high bar on the code review process: Setting a high bar on the code review process ensures that most bugs, if not all, are caught and fixed before they reach production systems. I have seen astute reviewers catch issues that would impact part of the service that was not changed during code review. Ideally, this shouldn’t happen, but reality is different. Another popular approach is requiring two approvals before the code can be committed. A caution here is that adding too many reviewers can slow down the overall process, affecting agility. So, these need to be balanced well.
If a code review is taking longer than a couple of weeks to close, it is quite possible that reviewers are trying to solve something bigger than what was the initial intention of the code change. A code review process’s objective is not to produce a perfect code but rather an outcome with the objective of improving overall readability, understandability, and maintainability of code. It is important to remember that your customers value new changes/features that are delivered to them. Delivery is the currency of success. Here is a really good read to help you build the right mental model on this topic.
Have high unit test and integration test coverage: I have seen multiple events where insufficient unit or integration test coverage led to the shipping of bugs to production. Even though many developers would confess privately that writing unit/integration tests is boring, this is an uncelebrated point that prevents broader outages in the production system. Mandating code reviews to be approved only with adequate unit and integration coverage for new changes can be another process that teams can follow.
Thorough Pre-Prod environment testing: Unlike the developer environment we discussed previously, which can have stability issues, a pre-prod environment is the closest replica of the production system. The changes in this environment have gone through code reviews, completed unit/integration testing/manual testing, and are ready to go to the production system. Tools like SonarQube in the deployment pipeline can help automate some of these checks. Following an outside-in testing approach in this environment allows you to mimic what an external customer might experience while using your application. This testing approach requires you to set up an external process for your service that sends periodic requests and checks the results against a predetermined response for accuracy.
One downside of a pre-prod environment is that it won’t experience traffic similar to production systems, revealing scale-related problems in your code or service. So, many teams regularly run load tests, i.e., simulated traffic to the pre-prod endpoint as part of the deployment pipeline, to validate if all service metrics look healthy before proceeding to the production environment. It can quickly get expensive to run this type of setup. However, it is easy to get around this using an on-demand setup that can be torn down when not used. This is usually the last step before real customers see the impact of a new code change. It is important to support any availability or latency-related alarm in this environment with high priority and block the progress of deployment until the investigation is completed. An ideal system will block pipeline promotion to production if any metrics start alarming.
The above steps ensure that bugs get caught before deployment to the production environment. However, I have seen cases where even these steps can fail due to edge cases that couldn’t be tested for various reasons. The CrowdStrike event is just another example of the same (here are a couple more interesting reads on this topic—AT&T collapse from 1990, the Zune bug due to leap years). We will move on to the next step, assuming a bug will still slip beyond the pre-production environment and we will discuss how to handle these situations.
2. Minimize impact radius
If a bug still slips into production environments, how can we ensure the impact radius in production is as small as possible?
Note: The previous section covered the possibility of no production impact.
Testing Changes in a one-box environment: Before the code is deployed to the production environment, it is pushed to a one-box environment. A one-box environment usually comprises one or a few servers that handle a small part, usually 1%-5%, of actual production traffic. An important point to note here is that even though a one-box environment handles production traffic, its metrics should be separated from those of the production environment. This is necessary so that we can be alerted of any discrepancies in the metrics of the one-box environment and roll back the changes immediately.
An ideal rollback here would be an automated one that gets triggered once alarms start firing. Ideally, we also want to ensure that any change made to production is reversible. However, as we saw during the CrowdStrike outage, not all changes are reversible. In such cases, a one-box or the gradual rollout described later is a better option to ensure that you control the impact radius. Shadow mode testing is another way to verify code changes before irreversible changes are pushed out. In shadow mode, you collect and analyze metrics or logs from new code and don’t change the existing behavior. You can replace the existing behavior once metrics or logs are fully verified.
Running outside-in testing processes against a one-box environment: As described previously, an outside-in testing approach requires you to build an external process that continuously mimics customer traffic to our system and verify the response of our system. This is critical since just measuring server-side metrics may not reveal the issues that end customers may be experiencing, e.g., network latency, etc. Measuring and alarming these metrics helps detect problems before customers contact your support team. (After all, it is a "sin" if customers tell us about a problem in our system and we don’t know about it.) The expectation is that we proactively warn when it comes to our system issues and actively work on the fix before customers start reaching out to our support team.
Gradual roll-out of changes to production: After a one-box environment, code changes should be rolled out in phases that impact only a tiny proportion of the overall traffic. There are multiple ways to do this. We can use deployment strategies like availability zone (AZ) aware deployment, i.e., deployments proceed to 1 AZ at a time, assuming service runs in multiple AZs, which is one of the best practices to guarantee high availability. These need metrics categorized by AZs to ensure we can catch any issues via our alarms and trigger an automated rollback, as discussed in the previous section. If changes can’t be rolled back, it is better to identify a way to apply the changes to only a few customers with whom you are directly working or have previously identified so you have a better incidence response ready. This should be a rare situation where you are working with a handful of customers. However, it is still a better approach than triggering a broad outage that affects all of your customers if you are pushing changes where you have low confidence about the impact it can create.
As discussed in the previous section, if there is a way to perform Shadow testing in production, it is a compelling way to detect regression before impacting customers. Another approach, similar to A/B testing, is where systems enable the effect of the new change in increments of 1%, 5%, 10%, 25%, 50% and 100% of customers. This gives better control over the impact of faulty changes. If a service is handling millions of TPS, like some of the services I got to work on, it will be wise to look into cellular architecture.
We won’t be able to dive into the details of cellular architecture. However, one simple way to think about cells is instead of running one production environment in a region, you will run multiple environments that handle traffic from different pre-allocated customers to various cells. You can also pre-allocate traffic using other keys of your system, e.g., customer account ids, etc. I wouldn’t encourage implementing cells unless your systems handle upward of hundreds of thousands of TPS. Cells bring in different types of complexity, and the additional operational overhead may not be worth it for all systems. A downside of gradual rollout with cells is that it will increase overall deployment time since a specific region serviced by one deployment before is broken into multiple cellular deployments, and usually, deployments will be sequential to every cell.
3. Fast detection and fast recovery
Once an unintended bug is deployed to production, we want the ability to detect the problem as early as possible and follow that up with actions to revert the system into a stable state.
Build granular metrics and alarming: Alarming on overall service health metrics like availability, latency, etc., is a good point to start consistently, but as your system evolves, it is important to regularly evaluate if you need other granular metrics that may capture customer experience accurately. For instance, if few large customers and other customers dominate a system’s overall traffic and only contribute a tiny proportion of the overall traffic, the P99, P99.9 availability metric may not necessarily capture the experience of smaller customers who contribute a smaller proportion of the overall traffic and may not get captured in the P99 or P99.9 metric.
In such cases, you want to focus on building a custom metric like Per Customer Level Availability (PCA) metric that allows you to measure the experience of all customers irrespective of whether they generate a lot of traffic or a small proportion of your overall traffic. You will need a way to capture every customer’s experience and then aggregate this data across all customers to identify how many customers were successfully served within the accepted margin of faults. This does not mean you don’t track overall P99 and higher availability or latency metrics, but use PCA metric in conjunction with service level metrics to get a better picture of customer experience.
I recall a situation where overall metrics (e.g., p99, p99.9) were acceptable, but a small bucket of customers who used a less widely adopted feature had consistent availability issues. The overall availability metrics were well within SLA defined by owners in this situation. Another way to think about catching these issues is to proactively add a separate metric dimension for the feature to cover their experience. We discussed outside-in testing for a one-box environment, but these external test processes should be running against production fleets too to capture the end customer experience. I recall a situation where service metrics looked all fine but the team was able to catch an issue that was only visible from the client side due to a network-related issue, which we had no visibility into by looking at service metrics. As discussed previously, we never want to be caught in a situation where our customers alert us about an issue that we don’t know of and are not working to fix.
Improvements to root-causing process: Another area to analyze in addition to metrics is that once the system alarm fires, we want our developers to root cause the issue quickly, which usually starts by analyzing logs and useful metrics. If the team has recently experienced an outage, it is worthwhile diving into understanding if we can further reduce the time taken to find the root cause if a similar issue happens in the future. This could be improving logging or driving improvements to surface better metrics that allow on-calls to pinpoint where the error is. This ensures that the response for future outages benefits from learnings identified in the past. A team-level knowledge base runbook or "first-aid kit" that can be used to share broader learnings can help the direct team and other teams in the organization.
Automated Rollback and Rollback speed: Our deployment pipelines should ensure rollbacks happen automatically when alarms fire. It is also essential to be aware that an entire region should be able to roll back the deployed change to a previous version within a few hours, i.e., under 3-6 hours. The longer it takes to roll back the changes, the broader the customer impact will be. For cases where a rollback is not possible, which should be rare, a custom operator response should be evaluated with senior leaders to be followed in case issues are encountered in production. This, again, should be a one-off situation and not a norm.
Improvements to Restoring State of Systems after an outage: A rollback may not always be helpful in restoring the state of the system. In such cases, it is always useful to ask if developers need a better way or process to restore the systems to their previous safe state and what action items could be taken to improve the speed of restoring the state of systems.
Closing Thoughts
In summary, the most essential practice is to ensure there is a culture of continuous evaluation, learning, and improvements. This continuous evaluation process is necessary because there will always be a healthy struggle between forces of fast delivery for customers, ensuring safety of the system, and limited manpower available to handle the work. This is well explained by Rasmussen in his paper on Risk Management in Dynamic Society. A simple explanation of the paper is that there is a safe zone to operate a system.
This zone is constantly getting pushed against forces of Management Pressure (Boundary of Economic Failure), Limited Manpower (Boundary of unacceptable work load), and Safety Campaigns that are keeping the system safe (Boundary of failure). Once the safe zone moves toward the boundary of failure, we see an outage and we take action to bring that "safe zone to operate" bubble back to center. So, while it is easy to tease a team/company that experiences an outage, it is worthwhile looking into the mirror and verifying if your "safe zone to operate" bubble is too close to the boundary of failure.
Diagram credit to Rasmussen’s paper on Risk Management in Dynamic Society