Transcript
Tucker: When I started at Netflix, my team was responsible for a library which produced stream metadata required for playback. We call this playback manifest. That library got consumed by three different services, each of which were responsible for the full playback lifecycle for different subsets of devices. What that meant for us is if we needed to roll out a feature, we would build that feature, release our library, and then we would have to wait for that library to be consumed by all of the services before we could flip on the new feature flag. This made it difficult to iterate quickly.
New Off-Box Service
In order to gain the velocity and freedom that we really wanted, we decided that we would create a new microservice and move the traffic from our library to this off-box call. I was on point for doing that. We had a pretty solid contract in that library and a lot of good test coverage. We decided that we could create a new remote version of that library that maintained the same API, which allowed us to gradually dial traffic from one to the other without affecting the upstream callers. As part of this project, I also added the ability to route just a single user device. That made it possible for me to go into the device lab, pick a variety of devices like one mobile, one TV, one website, and test with them and make sure everything was looking as expected.
Once I was happy with that, I started dialing traffic over in production. While I was doing that, I would monitor a bunch of dashboards, but primarily this one, Global SPS. SPS stands for playback starts per second. This is our key performance indicator of health for the product. Essentially, it represents whether or not users are getting to watch their favorite shows or movies. I would dial and then I would watch, and eventually, I would see the metric deviate from what was expected, indicating a problem with my change. I would turn the dial back to zero, debug the problem. Dig through dashboards. Dig through logs. Figure out what the fix is. Deploy it. Wait for it to roll out everywhere. Then I could retest in the device lab, then I could start dialing traffic and monitoring. This cycle would go on until I actually got all the traffic over into production and kept it there.
Testing On Many Different Platforms
This was time consuming and tedious because Netflix is on thousands of device types, each of which has its own slightly different variation of the user interface and SDK. I was finding one or two problems at a time. I found little corner cases where the client interpretation of what we returned was slightly different than expected. For instance, we returned the list of audio languages in a different order, and the devices broke because they were expecting them to be in the same order that we've been returning rather than relying on the sort order that was on the objects. This whole process just really made me think that there had to be a better way.
Background
My name is Haley Tucker. I'm a Senior Software Engineer at Netflix. When I started, I was on the playback services team. Now I'm on the resilience engineering team, where I'm responsible for building tooling that enables service owners to experiment on their production services, to understand and change them in a safe manner.
Outline
I'm going to talk through three topics. The first one is our solution to this problem, ChAP, and the methodology behind it. Second, I'm going to talk about use case fit. There's a handful of use cases which I think are a great fit for this and some that are not. Lastly, I'll walk through a recent example where we've seen this provide a lot of value to our users.
Netflix's Solution
First and foremost, ChAP is our platform for running experiments at Netflix. It originally started out as our chaos platform, which is where it got its name. It actually stands for Chaos Automation Platform, because we were focused on chaos, we had a lot of emphasis on safety mechanisms, guardrails, reducing blast radius. As we got further into it, we realized that all of these things really could provide a lot of value in additional use cases outside of chaos, so we've been working to expand the offerings.
What Is a Canary?
At its core, ChAP is a canary platform. What I mean by that is we have our production systems, and we allocate users. Maybe we take a fraction of the traffic that is in production, and we siphon that off, and we send 50% of it to a control group, which is treated just like production, and we send 50% of it to the experiment group, which gets some treatment or change in behavior. What this does is it allows us to limit the blast radius of any experiment. It also lets us measure the impact. We have two sets of metrics, which we can look at and measure if the experiment group is causing any issues compared to the control group. We can shut off the experiment very quickly because we're controlling that traffic that is flowing.
Traditional Canaries
When we look at allocation strategies, there's two ways of allocating users to an experiment. Traditionally, at Netflix, deployment canaries have used a non-sticky approach. This means that new server groups are brought up, and individual requests are routed through normal load balancing mechanisms. If this is round robin for you in production, your canary will also receive round robin traffic just like it does. This means that retries are likely to fall back to a production instance and succeed, which makes the approach relatively low risk to the end user, but it can hide certain types of error signals.
ChAP Canaries
For ChAP, we really wanted to be able to tie the experiment to the end user. In order to do that, we've come up with a different allocation strategy, which we call sticky. In this mode, users are allocated to the experiment for the duration of the test. Other dimensions can also be used for filtering such as device type or member status. You can think of this as like a mini-product A/B test. It's short-lived, usually around 45 minutes to an hour, and users are in the experiment for the duration. This means that retries always receive the same treatment, so you don't lose signals. It actually amplifies error signals for you, so it makes them easier to see. This also makes this strategy higher risk to the end user. You really have to account for that when you're building your platform.
Orchestration
What does this look like in practice at Netflix? We have our gray boxes over here, our services. We have ChAP, which is our orchestration platform. ChAP will begin an experiment by spinning up any sort of controller experiment infrastructure, so these would be new clusters. We can then use a traffic control plane which publishes information to the Edge proxy layer. That layer has a servlet filter in it, which will actually perform the allocation that we need, and add headers to every request, indicating which group it's in and what experiment it's part of. As those headers propagate throughout the stack, at some point, in this case, an IPC client will recognize that it has a header that needs to take action on. In this case, it'll say I have a route override so instead of sending traffic to D, I'm going to send it to D, control group, or D, experiment group. While this is happening, we also have real-time monitoring, which is looking at a collection of metrics to decide if the experiment is having a negative impact or not. Then at the end of the experiment, when everything is done, we run statistical analysis over a wider range of metrics to look for other deviations in signals. I'd like to call out here that the change in behavior doesn't have to be a route override. In my example, it was. The header said instead of routing traffic to D, route it to D, control, or D, experiment, but you can really build in headers that can drive any change in behavior that you want to test.
Data Overrides and Fault Injection
For us, we have a couple of others. One of those is data overrides. We have a header that says instead of pulling information from the normal production keyspace, pull it from a canary keyspace. What this does is it can return different responses for those in the experiment group, and continue to do so, so that you actually see if that new data will break users. Another example is fault injection. This started out as a chaos platform. We have the ability to inject failure or latency at any given IPC, or cache layer. That way, you can do things like, show me what it's like if all users experienced a failure in service D?
Safety and Guardrails
The key to making all of this work is safety and guardrails. We have a lot of that. We limit the blast radius. We have alerting in place to catch things that aren't looking good. We kept the traffic so any region doesn't get too much experimentation going on at any given point in time. The single most important aspect of safety that we have in place is the KPI monitoring. Here's an example dashboard. SPS is our key performance indicator for the product. We will actually see here is we have three columns. The first two columns are server and client views of SPS signal, so what the server thinks it's handing out, and then what the clients are actually seeing. The top row is successes and the bottom row is failures. The third column over here is another variation of our key performance indicator, and that is downloads per second. Users that are downloading content rather than streaming it, will show up in that column. As you can see, in this particular test, we saw a depreciation or a reduction in successes for all three metrics, Server, SPS Client, SPS and downloads per second. We saw an increase in errors for the downloads. In this particular case, we shut the experiment down within about two and a half minutes from the start of the test.
When that happens, the user will see that the test case failed. It'll actually get the status of short, meaning that we weren't able to run it to completion. They'll get the explanation of which KPI actually triggered the shutdown and why. Then they also get a bunch of links that allow them to dig deeper. This could be the S3 logs that are pulled off the instances themselves. It could be dashboards that show performance metrics. It can also be detailed insights that we capture in Elasticsearch. One of the things that we really wanted to make sure of is if we have a test that fails, that we could provide more context to users, which allowed them to debug the problem without them having to rerun the test. If we know it's going to cause problems for users, we don't want to keep running it. We log a bunch of information that allows for deep linking into other debugging tools, trace IDs, session IDs, so users can really dig deep and find stack traces, or what's going on with the system, and get enough information to reproduce it and solve the problem without having to rerun the test. Also, at the end of the test, we run a bunch of additional canary analysis. This particular example is a bunch of system metrics. You can see that in this case, the latency increased. That might indicate what is causing the problem for your users, and so you could dig into that further.
The Benefits of Sticky Canaries - Confidence Building
Overall, this has become a pretty useful tool for developers. It's allowed them to build confidence in changes before rolling them out to production. I think a large part of that is because we are focusing on the business KPIs that matter. We're focusing on our end users rather than a proxy metric. With traditional canaries, it could be easy to sit there and look and you say, "I have a 3% increase in CPU utilization, is this safe to push or not?" It's really hard to decide. If you can tie that directly to the user and see that they're not negatively affected, then it brings a lot of that confidence to our users, and they can always follow up on the performance degradations in other ways.
Reduced Experiment Durations
Besides confidence building, we've also seen a couple of other side benefits associated with sticky canaries. One of those is that we've seen reduced experiment durations. Typical deployment canaries at Netflix took at least two hours, and that's just to get enough statistical power to be able to make a decision one way or the other, from per minute signals and proxy metrics. Whereas with ChAP sticky canaries, we've seen that we can get a pretty solid signal within about 45 minutes. This speeds up deployment pipelines. A large part of that is because we're taking per second snapshots of the customer KPIs. Not only are we looking at non-proxy metrics, we're also getting a much higher volume signal.
Reduced Time to Detect Issues
Another side benefit is reduced time to detect issues. Traditional deployment canaries will usually find a problem and shut it down within about 20 minutes. Again, that's due to the per-minute granularity of data and the fact that we're looking at proxy metrics. In this case, we can typically shut it down within two to three minutes.
Use Case Fit - Migrations
I've shown you the platform and the key features that make it up. Now I'd like to talk through use case fit. I have a handful of use cases which are a great fit and a couple that may not be. First of all, migrations. Migrations have been a great example of where sticky canaries can really benefit you. As you see in this picture, these two items were built to the same spec, but they're clearly not compatible. It can sometimes be hard to identify implicit assumptions even if you maintain backwards compatibility in an interface. This is one example where we found that migrations are a nice way to pull those implicit assumptions forward and surface them to users before they roll out more broadly.
Lifecycle Interactions
The next example is lifecycle interactions. In our case, for playback, we have a manifest, which can affect a license call downstream, which can also affect the start play call from the client perspective. These three things are actually linked. A sticky canary, because you're always getting the same experience from the same users, will magnify any problems. If something happens in manifest, which causes a problem in license, you're going to see that in the customer KPI metrics. Whereas with the traditional canary, a problem in manifest might cause a problem in license. When the user retries, they go through just fine because they fall back to a production instance. This actually allows us to see those lifecycle interactions and how they're affected before rolling out broadly into production.
Data Changes
Third is data changes. This could be configuration. It could be a backing datastore. Regardless, data may originate from one service, but be consumed by several other services and client devices. What that means is, you actually want your users in the test to see the same data regardless of where in the system they're pulling it from. A sticky canary allows that to happen.
Chaos Experiments
Lastly, chaos experiments. This is the core of what we originally built. Basically, when a microservice outage occurs, it happens across the board, typically, or it goes latent across the board. A sticky experiment can actually let you see what it looks like when that happens. You may be resilient to a failure on one service but not another, and a sticky canary will be able to show you that.
What Doesn't Work Well?
What doesn't work so well with sticky canaries? The first example is low volume services. This is actually a problem with traditional non-sticky canaries as well. If you don't have enough traffic to your service, you're not going to be able to get enough statistical power in order to make a decision one way or the other. By adding stickiness to it, if you can't monitor and decide whether to shut it down, you're actually putting those users in a really bad position. I recommend looking at something like black box testing or probing, some other mechanism for low-volume services. Also, if you don't have production traffic. We currently don't have anything today that works in this way for services that aren't in production yet, so if they don't exist, there's no traffic to throw at them. Also, if you just really want to be able to use synthetic or shadow traffic, that also won't work, because the power of sticky really comes from being able to get that signal about real users and what's happening to them. If you don't have the real users, sticky is not a good fit for you. Lastly, services which are not directly related to the KPIs that you're monitoring are not a good fit for sticky until you add that monitoring in place. You don't want to run an experiment that can fail and you not realize it. For example, for us, we have really good monitoring around SPS, so we can cover all of the services that participate in those workflows. We don't have great coverage for signup flows, so we don't run sticky canaries on those services yet.
Example
Let's walk through a recent example and see how this is actually working in practice. It's been about five years since I did that original migration to move from on-box to off-box calls, which means it's really time for another migration. My former team is actually in the process of spinning up a new gRPC based service that they're going to move traffic from the REST piece call over to the gRPC call. What does this look like in practice? We have our Edge proxy. We have our playback API lifecycle service today, in production, that's talking to a REST call. They've also now built this playback manifest service gRPC call, and that's what they want to start moving traffic to. In preparation for running an experiment, they have also added a dial mechanism in the client for the play API lifecycle service, which will allow them to change traffic and point it to the gRPC service rather than the REST. That means we can run an experiment on it, and check. We will basically be able to spin up a control cluster which still points to the REST path, and we'll be able to spin up an experiment cluster which points to the gRPC path, and we'll be able to monitor and see how it goes.
What does this look like? The user can go into the tool. They can say, I want to run a sticky canary, pick my region, pick my cluster, and give it a name. If your code is not already out in production, there's also a field where you could provide your AMI or your container ID. In our case, they already have the service out so we don't need to do that. The point is, you don't have to deploy it first. You can actually load that into the test case. We talked about having that dial which is a property that we return, we can actually set that on the test case as well. Anything that they put here will be applied to the experiment cluster and not to the control. In this case, they're setting their dial to gRPC to 100%.
Also, there's no more watching global SPS graphs. The developer can kick off the run, they can go do whatever. They can watch, because they're excited or interested in the results. Regardless, the platform is going to monitor SPS and shut it down if there's a problem. You can see in this case there was, we shut it down in about a minute. This test failed with a status of short, and it cut off due to client SPS errors being elevated.
Debugging
If we click in to the Elasticsearch dashboard, you can see here, we're actually able to drill into the error breakdown. There was actually three primary error codes that came up, 5.2.100, 5.2.101 and 5.2.102. What this pointed the service owners to is that there's a license problem, all three of those correlate with a license issue. This actually turned out to be an example of a lifecycle interaction where the output of the manifest showed up in failing license calls. As they dug in further, they found that there was some DRM header information that was returned on the manifest, and that was a list of items. That list order changed, moving from REST to gRPC, which actually inadvertently broke the license interaction. Also, again, it's an example of an implicit contract that we brought forward with the tooling.
Interesting Finding
There's one other interesting finding I want to point out here. As we were digging into the results, we could actually see that for each, like one decrease in SPS Client successes, we actually just see four times the number of errors reported. As we looked through the logs and other metrics, we were able to confirm that this is because the clients were retrying on-box, and then reporting all four of those retries. It's just a nice little learning about this type of failure mode.
Summary
You remember that circle that I went through earlier, you had debug, deploy, test, dial traffic and monitor. Now with ChAP, that actually looks more like this. You still have to do this work to debug the problem and figure out what it is, and code up your solution. Then you can essentially do everything else as part of running the experiment. ChAP has really cut down on the churn for service owners and allowed them to iterate quickly. Overall, we have seen that this has become quite the confidence building tool for our users. You don't have to take that from me, you can actually take it from them. It allows them to move quickly, even when the change that they're making is really complex and could be really frightening to do if you didn't have the tooling to support you.
Takeaways
Lastly, I just want to leave you with three key takeaways, which I think have proved successful at Netflix. First of all, make sure you build the hooks into your platform, which enable the types of experiments you need. For us, that's routing overrides, data overrides, and fault injection. Just look at the types of things that your users need and build that into your platform. Second, is make sure you're building guardrails and safety mechanisms when you're testing in production. This could be blast radius reduction, alerting, real-time monitoring. Those are the things that are actually going to make your users comfortable with running those experiments and eager and excited to dig in and learn. Lastly, make sure you're monitoring customer KPIs, not proxy metrics. By tying it to actual real users, you can get the most value out of the tooling and make it easier for users to reason about when they're looking at problems.
See more presentations with transcripts