InfoQ Homepage Presentations Enhancing Reliability Using Service-Level Prioritized Load Shedding at Netflix

Enhancing Reliability Using Service-Level Prioritized Load Shedding at Netflix

View Presentation

Speed:

Download

50:41

Summary

The speakers discuss Netflix’s architecture for surviving extreme traffic spikes. They explain the mechanics of prioritized load shedding embedded in their Envoy sidecar proxy, allowing user-initiated requests to steal capacity from non-critical traffic. They share automated platform strategies for continuous chaos load testing, config generation, and retry storm mitigation.

Bio

Anirudh Mendiratta is an engineer on the Playback Lifecycle team at Netflix who has been instrumental in launching live streaming at Netflix. Benjamin Fedorka is an engineer on Netflix’s Java Platform, focused on IPC ergonomics and resilience.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Anirudh Mendiratta: I'm Anirudh. I work on playback backend at Netflix.

Benjamin Fedorka: I'm Benjamin Fedorka. I focus on platform engineering at Netflix. Think back almost two years. It's 2024 and I'm so excited. My favorite TV shows are back for another season and there's so much to see. Life is good as a platform engineer. I hope the other teams are as laid back and happy as I am.

Anirudh Mendiratta: My team owns Play API, a critical backend service that gets called every time you hit play. It's 2024, we have a very exciting slate of content coming up, but I have a problem. The problem is traffic spikes. We have a typical distribution of traffic which looks like the blue area in this graph, which peaks at around 7 p.m. in U.S. West to when people go home and watch Netflix. The troughs are around 3 a.m. or 4 a.m. at night when not a lot of people are watching. It's a very predictable pattern, but when we launch a popular title or a live title, we see these big, massive traffic spikes. We know the timing of these traffic spikes, but we don't know how many people are going to join in and we don't know the magnitude of these traffic spikes. We're talking about legitimate traffic here. We're not talking about malicious traffic.

Outline

We're going to introduce the problem which is traffic spikes. We'll talk a bit about the theory. How do clusters respond to load in the absence of any load shedding? Then we'll talk about our initial implementation of load shedding. Then we'll talk about our improved version which is called prioritized load shedding. Lastly, we'll talk about the process of tuning, validation, and rolling this out at scale.

The Problem

Going back to the problem, traffic spikes can happen and the traffic can be more than the capacity that we've provisioned. You might have a question, why not just scale up? We do scale up. We do proactive scaling and reactive scaling. Reactive scaling is autoscaling based on some resource utilization measure like CPU. Proactive scaling is when we know there's a live event or a big launch, we scale up our clusters. None of these is ideal. Reactive scaling is slow. It takes a few minutes to kick in. In a cloud environment, you need to determine that you have to scale up. You need to start up instances. This takes of the order of a few minutes. Proactive scaling, you can scale up for your peak demand, what you expect if all the users join, but it's expensive to do that. Scaling up may not always be an option.

Your cloud provider or your data center may not have enough capacity available. You can also see outsized thundering herd. As an example, for Netflix, let's say you're watching a live title. A lot of users experience a rebuffer at the same time. Everyone is going to hit refresh. Your typical traffic patterns may be very different from these outsized thundering herds. You may not have enough capacity to scale up for these large request storms. A good solution for this problem is load shedding, which is that we shed excess traffic to keep latency low for the successful request. Load shedding is not exactly the same as rate limiting, even though they could solve the same problem. With load shedding, we're talking about the total request per second exceeds our currently provisioned capacity. Whereas with rate limiting, we're talking about some fixed limit per user. Rate limiting can help prevent a situation where you need load shedding.

Rate limiting could also be used for monetization, where you have a certain amount of API usage that is allowed for a particular user. It's not the same. With load shedding, we're really talking about the total request volume and the total capacity. What happens if there's no load shedding at all? If we get more than expected traffic on a server and there's no load shedding? The graph on the right shows what could happen. This is latency on the y-axis and time on the x-axis. We keep ramping up traffic to an instance. This is one symptom you can see that the latency for all requests just increases. Both p50 and p99 increases. As a Netflix viewer, what you might see is the spinner and loading, when all requests become latent. This can cause threads to pile up on your server. It can result in your server running out of memory.

In both these cases, whether the instance either goes out of memory, or in general, it just fails its health check, the instance is no longer available to handle traffic. Then, what happens if this happens on a larger scale? When one instance becomes unhealthy, it puts more pressure on the other instance and it can cause a cascading failure. This can happen before autoscaling has time to add more capacity or you just don't have additional capacity available. We need to ensure that in case of more than expected traffic, we can degrade gracefully and our servers can handle some requests instead of failing completely.

Theory - How Do Clusters Respond to Load?

Benjamin Fedorka: It's 2024 and I have a problem. My year might not look like this. Anirudh's challenge of handling traffic spikes isn't unique. I have hundreds of engineers operating clusters who are all struggling with this. These things are also busy. They need to focus on problems that are unique to their service. I need to find a way to help all of them handle traffic spikes. I need the solution to scale. We can't spend hours hand-tuning every cluster. To solve this problem, let's start with a shared understanding of what we're trying to achieve. We're going to go over some basics. How do servers respond to traffic spikes and how do we want them to respond? This graph shows a system which is getting an ever-increasing amount of traffic. The green bar shows the successful requests. The red bar shows those that failed. As the total number of requests continues to increase, so does the system utilization.

Unfortunately, this system reaches a tipping point where the utilization goes to 100% and all of the requests are failing. This is the worst-case scenario and we call it congestive failure. The bad request just filled up the entire server until it failed. Why is this so bad? The system rapidly transitioned from successfully working to failing. If it had continued serving successful requests while failing others, there would have been an opportunity for graceful degradation. Even worse, it never recovered. When the load returned to a level that we could serve, there's still no successful requests. Even if we detected this and restarted the node, we'd have to pay the cost of that restart. We need to make sure that this never happens.

Let's think about our system like a highway. I like that analogy because we're still talking about traffic. Most of the time, there's plenty of room for all of the cars. In fact, there's still room for some small traffic spikes. We call this capacity our success buffer. We can successfully serve additional requests without failure. If you've ever run a system cold, you've created a success buffer. Now, if we get too much traffic, we're going to have a big traffic jam and nobody's going to be able to get home. This is that congestive failure that we saw earlier. Let's start with an easy fix. We put a stoplight on the on-ramp and we only allow on cars when there's excess capacity. We call this a failure buffer. We've reserved some capacity for rejecting requests and we do that gracefully without suffering from the congestive failure. However, some requests are still getting through and we know which requests we're rejecting so we can always respond gracefully.

Success buffer and failure buffer give us a generic way to talk about our clusters and how they're responding to traffic spikes. When traffic does spike, we want to continue serving baseline traffic. If we can, we'd like to serve a little bit more, but we also want to reserve capacity to gracefully reject traffic. This is cheaper than serving it. We can reject many more requests for each one we choose not to serve. This cluster is performing perfectly under a load spike. It's continuing to serve traffic while rejecting the entirety of the spike.

Let's send a little bit more so we can see what happens when it runs out of buffer. We finally overloaded this cluster and we know how much traffic it can take. It can successfully serve 130 RPS while rejecting an additional 120 RPS. This means that it can gracefully handle 250 RPS without any impact to the successful requests. After 250 RPS, we're going to allow the cluster to gracefully degrade. The number of successful requests slowly drops, but we can gracefully reject many more. Once the load drops to a servable level, the successful requests return to normal. The cluster with load shedding responds so much better than the cluster that suffered from congestive failure. It continued serving baseline successful requests and it quickly recovered once the load began to drop. Even under very high load, it degraded gracefully. How much buffer does this cluster have? That depends on the baseline traffic.

Let's send out 100 requests per second. Any traffic that we can serve above our baseline is a success buffer. If we can successfully serve an additional 30 RPS, we have a 0.3x success buffer. Any traffic that we can gracefully reject above the top of our success buffer is our failure buffer. If we can inject an additional 120 RPS of baseline traffic without impacting our success buffer, then we have a 1.2x failure buffer. Combined, this cluster has 1.5x of buffer, which means that it can handle a load spike of 150 RPS and 250 RPS total. Past 250 RPS, we allow the successful requests to slowly drop. We're giving up success buffer to keep the system functional. We would say that this load spike exceeded the failure buffer of the system. How do we get more buffer? This is that same cluster with a traffic baseline of 50 RPS.

Now we can serve 80 RPS above our baseline. We have a 1.6x success buffer. We can still reject 120 RPS above our success buffer, but now that's a 2.4x failure buffer, and we have 4x of total buffer. This gives us a really useful workflow. If we set a goal for our buffer, we can work backwards to the maximum RPS that our cluster can handle to compute the baseline traffic it should take. This is the same workflow as running a cluster colder. We just enhanced it with a little bit of math to help us reason about the successful and failed requests. We can also begin load shedding earlier. That would reduce our success buffer, but greatly increase our failure buffer. The concept of buffer gives us a shared understanding of the problem, and a vocabulary to talk about clusters which are overloaded. When the load spike occurs, we can reason about if the cluster responded as expected, and because the measures are objective, we can experiment until we validate them. I also want to highlight this vocabulary applies to many load shedding implementations, not just the solution we're presenting today. We find these concepts helpful to describe what happens to any of our services, no matter how they failed under load.

The Problem with Load Shedding

Anirudh Mendiratta: Load shedding seems great to protect servers from traffic capacity, but here's the problem. As a Netflix viewer, without load shedding, you might see a spinner that hangs on for a few seconds before your request fails, when you hit play. With load shedding, you might just see an error. Is this any better? It is better from the backend engineer's point of view, because as we established, the situation on the right does not result in a cascading failure. Also, the situation on the right is retryable, so a retry may succeed, and a user may never see this error. How can we make it better for the customers? One key insight here is that all requests are not equal. Talking about Netflix playback, we have two types of requests. One is called a prefetch request, so when you're just browsing Netflix, your browser is optimistically fetching data for titles that the browser thinks you may watch, like if you hover over something.

These requests are not critical, and a failure in these requests is not visible to users, only results in a slight increase in latency when you hit play. The other type of requests are user-initiated requests, when you actually hit play. Those are critical, and if those fail, a failure in those is user-visible. A key insight here is that for us, 50% of playback requests were prefetched. We did an experiment where we ramp up traffic to an instance beyond what it can handle successfully. With equal opportunity load shedding, we have two categories of traffic, prefetch and user-initiated. When we ramp up this traffic, both see an equal drop in availability. As I said, this is still good. This is still better than no load shedding, because it helps us protect the server, and the system will return to 100% availability once the traffic spike goes away.

Prioritized Load Shedding

Can we do better than this? We can shed lower-priority requests first, or what we call prioritized load shedding. This minimizes the user impact, because we have higher availability for the user-initiated requests. In the same experiment, we see that prefetch availability drops significantly, but user-initiated availability remains at 100%. It's not always guaranteed that your critical request availability will be at 100%, because when you keep ramping traffic up, you will see a hit to the critical traffic availability. If you have a mix of different priorities, this guarantees that the higher-priority request will have a higher availability. Let's take a look at a real-world example. This is real data. We had a spike in non-critical traffic. The background here is that we had an infrastructure outage that impacted playback across all devices. During this time, playback requests were failing, and Android devices, in particular, queued up these failing requests to retry them later.

This caused a big backlog of queued requests to be built up on these Android devices. When this outage recovered, we saw a huge spike of these prefetch requests, but we were able to just load shed them and have high availability for critical requests. This could have delayed our outage recovery if we didn't have prioritized load shedding. You might also be wondering that, is it a good idea for clients to queue up non-critical requests? It's probably not. As a backend engineer, we have to handle scenarios where clients are not doing the right thing, and prioritized load shedding helps with that.

Let's take another example. In this scenario, we had a critical request spike that happened during a popular live event. In this scenario, we can't just drop the spike because all of these requests are critical. What happens here is that if you see the pink area, which is critical, that steals capacity from the other traffic. The successful critical requests per second go up, whereas for the lower priorities, the successful RPS goes down. The blue line is the overall RPS, so that gap between the blue line and the area is still the failing request. We weren't able to serve all critical requests here, but if you see the graph on the right, the availability of critical requests was higher than the second priority, which is degraded, which was higher than the best effort. Prioritized load shedding ensures that you shed requests in order of their priority. You shed the lower-priority requests first.

This technique can be generalized to other backend systems, and it's not only for playback. Another example that's Netflix-specific is fetching the home page. Every user has a personalized home page at Netflix, which includes a list of shows and movies that we think you'll watch. We prioritize foreground requests over background. If you're actively browsing Netflix, those are foreground requests. You might just have the TV on in the background, and you haven't interacted with it in a few minutes. Those requests are background requests. If we are constrained for capacity, we would shed background requests before shedding foreground. Here's another example. We have a service called Data Gateway that fronts our databases. For this service, we prioritize writes over reads because failed writes cause data loss, and reads are retryable. We prioritize writes over read to prevent data loss. The other insight here is that read RPS is also typically higher than write RPS. This gives us more buffer when we start shedding requests.

Let's talk about the details of how exactly this is implemented. A simple way of thinking about load shedding is that it's a function of request priority. It's a function that takes in the request priority and the current utilization, and returns a Boolean response which says shed or don't shed. How do we determine request priority? There are a few different ways. The main thought is that you want to determine it cheaply so that you don't spend too many CPU cycles on requests that are going to be shed. The best way we've found is using request header and mapping a specific request header to some priority, because if we shed that request, we don't need to pass the request body. You could also use the request body, but you have to pay the cost of passing that request body, so it takes more resources, but it's a little more flexible.

You could also make a remote call to some service or database to map some request header to a priority, but this is expensive because you're effectively making a remote call before even deciding whether to serve that request or not. Note that load shedding is not the only line of defense you have. You might have a malicious actor who might up the priority of their request, so you always want to complement it with some sort of per-user rate limiting. This is an initial line of defense that if a request advertises that I am low priority, we can shed it before spending too much time on it. How do we measure utilization? We have a few different ways. CPU is a good one to determine when your service itself is overloaded. Another one we use is latency or concurrency, which tend to spike up in a very correlated way, so this helps when a downstream service or database is overloaded.

It might not show up as increased utilization on your own service, but it tells you that something downstream is having trouble. The latency-based shedder is similar to a timeout, but it's just a smarter timeout where you prioritize the requests that are more critical to your user experience, so it's more progressive, whereas timeout is just like a step function.

Let's talk about why prioritized load shedding is better than an alternate solution, which is sharding. In the leftmost diagram, you have a single cluster which is serving both user-initiated and prefetch requests. This has no isolation, so a spike in prefetch requests is going to impact availability for user-initiated. The second diagram, you could also have separate clusters for your critical requests and non-critical requests, and in this case, you get that isolation. If there's a spike in prefetch traffic, you're not impacting user-initiated availability because you just have separate server instances serving that traffic. Prioritized load shedding gives you that isolation and it gives you something else. With prioritized load shedding, the same server instance is serving both user-initiated and prefetch requests, but we are prioritizing requests on the application level. We have application-level isolation. The benefit that this gives in addition to sharding is that now user-initiated requests have the ability to steal prefetch capacity.

When we get a spike, this partition can expand all the way to the right and you can use all your capacity for user-initiated requests. This is challenging in the sharded architecture. Our initial implementation of this prioritized load shedding was at the API gateway, and this made shedding decisions based on request priority, the API gateway utilization, and the error rate from the downstream service. This worked great when the API gateway itself was overloaded, but it didn't have visibility into the service utilization. It had visibility into the error rate, but it's a lagging indicator which shows up only when things have gone somewhat wrong. This also does not work for backend-to-backend communication because those requests are not going over our API gateway. A lot of our low-priority requests are actually generated by some internal batch process, which don't go through our API gateway. This API gateway-based architecture doesn't help with that.

We have augmented this, and our current architecture adds this prioritized load shedding to the individual service level. The benefit of doing this on the service level is that service utilization is a leading indicator. We can start shedding low-priority requests before you have an impact to the error rate. This solves the backend-to-backend use case when, let's say we have a service called Viewing History Service, which is used for both generating some real-time, your continue-watching row, but it's also used for recommendation systems, which are batch processes and lower-priority. Doing prioritized load shedding at the service level helps with the backend-to-backend use case. It allows us to steal non-critical capacity, which was not possible with the API gateway-only solution. It also helps us to run this and test this locally because all of this is within the context of a single service. Our initial implementation was a Java library, but we are now moving this to our sidecar Envoy Proxy, which Benjamin will talk about more.

Here's an example config. Let's say your autoscaling target is 50% CPU utilization. You would want that CPU-based shedding kicks in after your utilization, because if you have a traffic spike, you want to make sure that your system is trying to scale up while you are shedding requests. If your system scales at 50% CPU, you could have non-critical shedding start at 60% CPU, and critical shedding start at 80% CPU. Here's an example of what this looks like. In this experiment, we ramp up traffic to an instance. The blue area is the successful RPS. The purple area is the unsuccessful RPS. As we ramp up traffic, and we go beyond our autoscaling threshold, when we hit that yellow line, that's our non-critical threshold, which is 60% CPU utilization, we start shedding non-critical traffic. When we go beyond that yellow line, the traffic is being ramped up at a similar rate, but the growth of CPU utilization slows down because it's cheaper to reject requests than it is to serve them.

As we keep going up, if the CPU utilization gets to 80% after that point, we start shedding critical requests. In this system, we are able to handle, go up to six times the successful RPS, and it does not impact the latency of the successful requests. The successful requests still have a low latency, and we shed the other requests. Another measure we use for utilization is latency. For latency, we define the normalized utilization as the percent of requests which have a latency greater than our defined SLO. Let's say our latency SLO is 1 second. You could define your non-critical shedding threshold as 10% and critical shedding threshold as 40%. What this means is that if 10% of requests are more latent than 1 second, then you start shedding non-critical traffic. If 40% of requests are more latent than 1 second, then you start shedding critical traffic. This config needs to be tuned per service, which is a somewhat tedious process.

Tuning, Validation, and Productization at Scale

Benjamin Fedorka: Let's scale this up. All that tuning, validation, and even the priority detection is just for one cluster, but we need to do it for hundreds of clusters. We don't just have to do this once. Our teams are constantly adding new features to their services. Their performance profiles, and their request priority distributions are constantly changing. We need to be able to do this on demand and with minimal oversight. Solving a problem centrally so that hundreds of teams can solve their unique problems is a ton of leverage. These problems are why I'm a platform engineer. We need three primary capabilities. First, we need a coordinated determination and an assignment of priorities. Second, we need a prioritized load shedding solution that can be centrally configured. Third, we need a way to determine and validate the load shedding configurations for each cluster. At a high level, we're going to inspect the request at our edge, and we're going to assign a priority to each.

That priority is going to follow the request through the entire call graph. When each server gets a request in, it's going to inspect that priority of the request, and it's going to independently make a determination if it's going to serve it or reject it. Then we're going to automatically configure the load shedding parameters for every cluster, automatically validate that the parameters result in the expected success and failure buffers. Then we're going to automatically apply those passing configurations to the production clusters. Let's look at how. First, I said we have to do it for hundreds of clusters, but we have thousands. How do we determine which clusters are valuable to onboard? There are some easy picks, like clusters which caused an incident when they went unavailable, but we don't want to wait for a problem to happen. Our chaos automation platform, ChAP, helps us here. We have a capability called failure injection testing, which allows us to inject failures into requests at any point within the call graph.

Instead of waiting for an incident to happen, we can actively test the impact to a single request if the cluster was to fail. Next, we can trace the priorities flowing into our services. If a cluster is receiving high-priority traffic, that's the signal that we didn't benefit from load shedding. Finally, we can ask our application owners to validate that information. We ask them to confirm the impact of each cluster going unavailable to various business domains. This helps us close our blind spots, but it also allows us to reason about the risk of each cluster to particular elements of the business. We can then manage that risk by establishing how much buffer each cluster should reserve, and we validate that by having each cluster routinely pass load tests.

What types of work are these servers performing? What are we actually trying to shed? With our newest capabilities, we targeted our most common types of web services. Each type has a few differences, as we see in this table. The simplest is gRPC. We have a very natural unit of work, an individual RPC invocation. Each RPC is discrete, so it's very easy for us to track that latency distribution. Generic REST services also have a natural unit that we can load shed on, the incoming request. However, instrumenting the latency is a little bit more complicated due to the presence of path parameters. Instead of trying to inspect the path directly, we instrument the underlying service implementations which are mapped onto the request. GraphQL is easily the most complicated. Even if we wanted to leverage registered queries, it's easy for different query invocations to have different performance profiles.

Instead of trying to parse that complexity, we instrument the latency of the individual field resolvers. GraphQL also provides a mechanism to provide field-level errors and partial responses. We don't use this for load shedding. Instead, we load shed the entire query or mutation. We don't have enough information to know if the calling service could use a partial response, and we don't want to waste resources on something that can't be used. If the request is already low enough priority to provide a partial response, it's low enough priority for us to shed it completely. While I was preparing this talk, I was asked if these techniques could apply to services which read from a queue. I believe the answer is yes. You'd make load shedding decisions as you select items for processing, and you would instrument the underlying business logic. However, we haven't actually implemented that.

As we discussed earlier, not all requests have the same value. If we had to choose, we'd rather reject a request being used for an internal report than reject something that's critical to a user. We want to track this at a really fine resolution because a cluster can shed low-priority requests to reserve the success buffer for higher-priority requests. When we began this process, we had parties assigned to some of our requests. Writing data was often prioritized above reading data. Traffic from member devices was prioritized above traffic from batch clusters. This was a great start, but many priorities were missing, and most had never been validated. We had asked teams to classify categories of requests, and not all the prioritizations were correct. Some important requests had sufficient fallbacks. Other requests were inadvertently critical. To address this, we designed an experiment to help us validate the priority of each request.

Imagine that we only have two priorities, high and low. A request is high priority if it impacts a metric that we really care about, like if a member was able to successfully play their next episode. All other requests are low priority. This allows us to run an experiment. If we fail a high-priority request, we should expect an impact to that metric. We again do this through our failure injection testing capability, which allows us to annotate requests and force them to fail at particular points in the call graph. My goal is to entertain the world, so I don't really want to stop a member from watching an episode. If we can track if that playback would have failed, we can transparently retry the request without the injection failure. Similarly, we can run an experiment to fail low-priority requests and expect no impact to our metrics. The ability to experiment to validate priorities allows us to be confident that our load shedding is protecting the right traffic and avoiding harm to the service.

This is important because it's possible we might even have to shed some critical requests, and we need to know that we've minimized that impact. Once we know what priority to assign a particular request, we actually need to communicate that internally. Our API gateway, based on the Netflix open-source project, Zuul, inspects incoming requests and uses a rules engine to determine the priority to assign. These rules consider the particular API being called and the state of the device that is invoking it. This priority will follow the request through the entire call graph. We do allow services to downgrade that priority. We allow them to make a non-critical request to augment a critical response, but we don't allow them to upgrade the priority of the request. That indicates there's an issue with original prioritization, and that needs to be corrected at the edge. For requests that don't flow through the edge, such as batch traffic, we can use that same mechanism to apply an initial priority.

Let's look at how we actually communicate the priority with some details. A core capability of our ecosystem is the ability to attach metadata to our request and have that metadata be automatically propagated to any subsequent call. We use this capability to communicate identity information, inject failures, override call writing for canaries, and now for request priorities. The primary implementation for this is through self-propagating headers. When a server receives a request, it reads the priority header and places it into a lookaside local storage. This allows for us to modify the structure of that data without needing to impact every business implementation. In Java, this is as simple as a ThreadLocal. Then any subsequent requests which are made to complete the original work item can read this local storage and set that priority on the outgoing call in the same header. This allows the priority to stay attached to the request as it fans out through the call tree.

When our track lead, Daniel, first saw this, he said, "Benjamin, I want to see one of those headers. What is in there?" Here's an actual header that we use to communicate priority, netflix-contextflow-mesh-bin: CAE=. That's a Base64 encoded Protobuf binary header, so it's really not easy to read. Below is what the header parsers do, requestPolicy priority 1. The data is that simple. Contextflow is our generic mechanism for communicating all types of request metadata, not just the priority.

Now that we have parties assigned to our requests, we're almost ready to make load shedding decisions. I can't see any of our servers in the cloud, but I assume this is what they look like. This system isn't busy at all. It shouldn't be load shedding. This system is overloaded and it crashed. It should have started shedding load earlier. How do we know and communicate how loaded each server is? I already talked about there was like an RPC latency as our main indicator for system utilization. We also support utilizations being provided from other sources, such as our data platform team contributing utilization for connected datastores. With utilizations coming from several sources, we need a standardized way to aggregate and collect them. Luckily, we have a simple solution in open source, the Open Request Cost Aggregation message, also known as ORCA. We have each utilization provider write an ORCA message to a known directory shared between all processes on the instance.

These messages are fully out-of-band communication of load, meaning that there's no impact to individual requests. Our Envoy-based ingress proxy uses an inode notify API to monitor the ORCA messages. inotify is a Linux kernel feature, which provides for an efficient way to monitor the activity for all files in a directory. This allows us to quickly respond to updates without needing to pull the files to check for changes. We know the priority of each request. We know how loaded each server is, and it's time to start making load shedding decisions. Remember that we shed low-priority requests before high-priority requests, but we're eventually going to shed all requests to keep the server functional and offer a quick recovery once that load spike becomes manageable. For each utilization and for each cluster, we define a function that maps the tuple of the current utilization value and request priority to a probability that that request should be shed.

If the utilization is very low, that probability is zero for all request priorities. As the utilization increases, we gradually increase the probability to shed low-priority requests. We do this incrementally by request priority, so that way lower-priority requests will be shed with 100% probability before a higher-priority request is at risk of being shed. If the utilization gets sufficiently high, we're going to shed all the requests with 100% probability. Because each cluster receives a unique mix of priorities, we configure a unique function for each cluster, which is designed to create a smooth throttling response.

Let's look at how an actual call is going to flow through the system. A request comes into the instance and is received by our Envoy-based sidecar proxy. The proxy inspects the request metadata and uses the current utilization from the ORCA files to make a load shedding decision. If the request should be shed, it gracefully rejects it without ever sending it to the application. If the request should be served, the proxy forwards it to an application, which will then execute the request and will use the latency information from the execution to inform updates to the latency utilization provided back into the ORCA files. If any subsequent requests need to be made, the application will then use the Envoy Proxy to send the call to the next service, which will then make an independent decision to shed or serve the request. All these interactions are instrumented and sent to our metrics and tracing databases, so they're fully observable for our teams and tooling.

Unique load shedding function for each cluster. Tracking latency distributions for each RPC. Unique priority mix. Our system has tons of tunable values that we need to adjust for every cluster. Ideally, we wouldn't need this, but we tried for years to track latency targets on a per-service level, but the shifting call patterns we saw during load spikes caused us to both over-respond and under-respond. Additionally, even per-service latency targets are more configuration than I want to impose on my customers. As a platform engineer, my goal is to manage more of my users and reduce the cognitive load. Forcing them to maintain these values isn't the experience I want to provide. Being a platform engineer, we built a new system to do this for us. We integrate signals such as historical RPC latency, RPC volumes, request priorities, and CPUs to generate what we call resilience configuration. Resilience configuration contains all the information that we need to configure the expected ranges for each utilization, load shedding response functions for each utilization priority and cluster, and detailed information for tuning that utilization monitoring. This configuration includes the information we need to track if CPU is within the expected range, if requests are completing on time, and the request priorities that we expect to serve, exactly what we need to make load shedding decisions. The system monitors metrics and generates new configurations for each cluster on demand.

Early in the rollout, including through several large content events, we used a manual process where application owners could request overrides to their resilience configurations. They literally sent Slack messages to our support channel. We would then track those overrides in a JSON file and merge them with the output of our recommendation engine. Today, we offer a specialized UI where teams can both view the current load shedding configuration and propose updates. We deferred investing in the UI until we proved the value of the system, but we now have a simple interface to help our customers view and modify the configuration. This is also a good time to highlight something. We do autoscale our clusters, and how we autoscale is a key consideration for our load shedding goals. This talk is predicated on not relying on autoscaling, but outside of extreme circumstances, our systems are constantly scaling up and down with load.

The information that we need to effectively shed load is the same information we need to scale effectively, so we offer a single configuration for both concerns. If we have capacity, we're going to load shed until we have time to scale up. That's not really a big problem. Help is on the way. It's also the most common case. Something has gone wrong. A cluster is temporarily overloaded. We scale up. We've mitigated the overload within a few minutes. However, there are times when we don't have capacity to scale up, or maybe we try to scale up, and something is preventing the new instances from becoming healthy. In these circumstances, we can depend on load shedding to protect critical traffic for an extended period of time.

I just described a robot that's going to reconfigure how our clusters reject traffic, and that's a little scary. It's almost as scary as allowing humans to reconfigure how our clusters reject traffic, which we also allow. Luckily, we can make both these scenarios less scary with an objective test. For each cluster, we need to run the load shedding experiment that Anirudh described earlier. We automatically generate a resilience configuration, then we use our chaos automation platform to run an experiment to validate that configuration. If it passes, we promote the configuration to production clusters. If it fails, we flag it for human review, and we'll adjust the parameters used in the resilience configuration. Deciding when to run these tests looks a lot like a homework problem for my algorithms class. To isolate problems, we want to run at most one test in a given region at a time.

We need enough traffic routing to that cluster to safely run the experiment, but some clusters only see that traffic load for a few hours each day, and even if we just focus on one cluster, it might not see that traffic load for different times in different regions. Then, finally, we prefer to run the test when the on-call for the cluster is within their working hours. If we inadvertently cause a larger problem, we don't want to wake somebody up. Our chaos automation platform, ChAP, handles the execution and evaluation of these tests. It coordinates creating test and baseline clusters, orchestrates request routing to warm up the clusters to their baseline traffic, and then quickly spins a load spike to the canary cluster. It also judges if the test is a success. Importantly, it's monitoring key metrics to ensure that our test is causing no harm. If we detect any custom or impact during a test, even if we can't attribute the cause, we automatically shut down all running tests.

Once a test has completed, ChAP will automatically judge if it passed. It will check if the cluster received the expected load spike, and that it demonstrated the expected success in failure buffers. We also confirm that all load was gracefully shed, and that when it chose to serve RPCs, that they were staying within the SLOs. These graphs show an experiment that passed perfectly. In the bottom left, we see that the clusters were warmed up to a baseline RPS, and then a 4x total spike, 3x increased spike, is intentionally routed to the canary cluster. The top right shows that the cluster briefly shed load, and then three more servers were added, and in about three minutes, we're successfully serving all calls.

Once we're confident that a new configuration will effectively protect the buffer, we need to move that out to production. We take the configuration used in the passing load test, and we create an automated pull request to the application's codebase. At this point, the configuration deploys through the normal validations that we use for any deployment. It goes to test. It goes through the canary process, and it gradually rolls out to production. The configuration is stored in a file in the application's artifact. This is important for a couple of reasons. First, the pull request makes the change visible to the application owners. Even if that pull request automatically merges, the existence of the change is obvious when reviewing a diff between two deployed versions. Second, the configuration is obviously attached to a specific build. That means it deploys through the same process like any other change. Importantly, if there's any problem, there's no special process-initiated rollback.

We track the configuration as it rolls out to production, and we can validate that all apps are running with the expected settings. I want to admit, I didn't get this right on my first attempt. I wanted application owners to treat this as a managed experience, which means that it should just work and not need to contribute to their cognitive load. I specifically wanted to avoid putting the resilience configuration into their normal application config. To achieve this, I packaged that configuration into a dependency and rolled it out to my applications through an automated dependency update process. The version of the dependency was always increasing even if no effective changes for the cluster. The load shedding configuration files were stored in a ZIP artifact, actually on the application image. I didn't consider the visibility of this to my application owners. Because the changes were hard to observe, it was difficult to determine if the configuration change was associated with a bad build.

Then the actual configuration was difficult for engineers to view. They needed to take the version from the artifact, switch away from their codebase into a different application. It was difficult to associate any given configuration with a specific build. Putting the load shedding configuration into the unique file on the application repository made it both easy to find and understand. Like I mentioned at the start of the talk, we don't just do this once. We do this over and over again. Our systems aren't static, and teams are continuously adding new features and adjusting existing implementations. We run a continuous campaign to validate that systems are running with their expected buffer. Before a high profile content release, we're going to validate that any system not demonstrating their expected buffer gets flagged for review and risk mitigation.

I highly recommend that you do not run automated load tests against your production backends without first warning your on-call engineers. It's likely that they have alerts configured to detect service degradations, and a sudden spike in load shedding is exactly what we want them to investigate during the normal course of business. Your teams also might allow some deviations for canary clusters, but those canary clusters are interacting with other production clusters, which are probably going to notice the impact. Instead of surprising teams with pages, we lean into SlackOps by automatically notifying application owners when tests are scheduled, start, and end. Though we'd prefer not to trigger alerts, and we work with teams to minimize the disruption from the automated tests, we still find it's best to warn our teams before they get paged, and give them a big button to abort the test. One more thing to consider.

What happens when servers begin to reject requests? We operate in the cloud, failures are common, and we design for resilience. Just like load shedding, IPC resilience is a common problem solved by platform engineers. I work on the team responsible for IPC resilience, and I tell you, we design our clients to retry calls. Consider the case when a client sends a call to an overloaded cluster. Our cluster protects itself, and uses load shedding to efficiently fail that call. Now the client's going to retry, and the cluster is still overloaded, so it still fails the call. You probably see where this is going. We created a lot of retry storms. Ideally, retries are performed with a backoff mechanism to ensure that you're only doing retries at one level. Otherwise, you're going to see exponential growth during failures like we see in the top graph. Even with single-layer retries, you can double the volume of traffic to a cluster.

When that cluster's failing due to overload, this is especially bad. It takes resources for us to reject that call, and those are already constrained. Worse, if that client retries, and the retry still fails, we've wasted resources and delayed the response. We could've served the same backoff response later, just faster. An easy solution is to just override the backoff behavior when you're load shedding. If a low-priority call is load shed, don't retry. We can either fail the entire call chain, or we can serve that degraded response from a fallback. We call this prioritized backoff. This approach really isn't ideal. If an individual node goes unhealthy, we could've routed the retry to a new node, and have that request served. If that call succeeded, then that would be better than serving a fallback.

How do we identify when to retry, and when to back off, even for low-priority traffic? Our innovation here is something we call a prioritized attempt budget. We look across all calls going from a particular client node to a cluster, and compute the ratio of retries to the initial attempts. If that ratio is very low, then we allow all calls to retry. It's likely a single node in that server cluster is impacted, and it's going to get replaced soon. As that ratio increases, we progressively throttle calls. So far, this technique isn't new. This is client-side throttling as described in the SRE book. We extended this technique by integrating request priorities. Because the determination call priority is coordinated, it's also available to our clients, and we can use that information to inform their behavior. If a client detects that a server is being overloaded, we progressively stop retrying low-priority calls.

This allows the high-priority calls to retry, while low-priority calls sacrifice their capacity. If the server cluster is aggressively shedding traffic, we don't retry any calls. They're unlikely to succeed. You'll notice that this mirrors the approach that we use for load shedding. This is effectively load shedding on the client side. Another advantage of this approach is that it doesn't require any live updates from a centralized control plane. It operates entirely from configuration that is available at application startup. Just like load shedding, prioritized attempt budgets are implemented in our service mesh proxy. It's 2025, and I'm still a platform engineer at Netflix. I got to enjoy all my favorite shows and discover some new ones along the way.

Recap

Let's recap what we learned over the past few years and what we shared today. First, we quantified the problem using objective measures to describe the amount of success buffer and failure buffer that each cluster has and how much we need. If your organization isn't ready for any of the implementations we discussed, you're at least ready to understand how much buffer you have. Next, we applied prioritized load shedding to allow a cluster to efficiently shed low-priority requests from its failure buffer in order to reserve success buffer for high-priority requests. Finally, we created tooling to automatically configure load shedding for each cluster and validate that we're running with our expected buffer. Of course, it took way more than two engineers to build all of this, and we're standing on the shoulders of giants. I've collected some related talks and blog articles, and you can find all of them at the link in this QR code, java.netflix.io/qconsf25.

Questions and Answers

Participant 1: Do you use the same cost for all the requests for the same service, or do you model different costs depending on the request header or payload?

Benjamin Fedorka: We designed our utilizations to be independent of the request complexity. You can see that with what we do with GraphQL, where we're instrumenting the underlying data fetchers instead of top-level requests. At this point, we are assuming that all requests are a single cost. We can handle priorities of different requests by tracking multiple points on latency distribution. We can say that 60% of our calls need to be under one target, and 90% of our calls need to be under a separate target. We do have some flexibility there. The really important thing is that your utilization needs to be monotonically increasing, so that way your load shedding function handles that well.

Participant 1: At any point in time, you consider them all the same?

Benjamin Fedorka: Yes. We've built our system to allow for costs to be different priorities without having to communicate the cost.

See more presentations with transcripts

Recorded at:

Jul 02, 2026

InfoQ Software Architects' Newsletter