Engineers at the streaming provider Netflix recently described how the company uses prioritized load shedding to maximize streaming reliability, and as a result, maximize the user experience.
Netflix uses its homegrown API gateway, Zuul, to classify incoming requests into priorities. When the system comes under load or is otherwise unstable, Zuul throttles traffic, starting with the lowest priority. It then progressively adjusts to shed load according to the priorities calculated until the system is healthy again.
Zuul's progressive load shedding prevented a major outage only days after Netflix deployed the implementation in 2020. Members' ability to play videos remained consistent until full functionality was restored.
Netflix engineers describe the problem statement as the following:
Getting stuck in traffic is one of the most frustrating experiences for drivers around the world. Everyone slows to a crawl, sometimes for a minor issue or sometimes for no reason at all. As engineers at Netflix, we are constantly reevaluating how to redesign traffic management. What if we knew the urgency of each traveler and could selectively route cars through, rather than making everyone wait?
Netflix's goal is to minimize interruption to users' streaming experience in the face of partial system outages or issues that would otherwise cause all functionality to halt. These outages can occur due to various reasons such as retry storms, under-scaled services, bad deployments, network blips, etc. To achieve this goal, Netflix's engineers put Zuul, Netflix's API gateway, at the heart of their implementation.
Source: https://netflixtechblog.com/keeping-netflix-reliable-using-prioritized-load-shedding-6cc827b02f94
First, the engineers classified all incoming API requests into three categories - non-critical, degraded experience, and critical, based on the requests' importance. Loss of critical requests directly results in playback issues for users, while the loss of non-critical requests does not affect the user experience. When requests arrive at Zuul, they are tagged with a calculated priority score. Typically, all requests are handled as usual, regardless of the priority score. However, once metrics such as CPU usage, failure rate, and latency begin to rise, Zuul drops requests with a priority lower than a calculated threshold.
Zuul either drops requests for a specific back-end service or drops requests globally if Zuul itself is under load. The decision is based on health metrics, which Zuul monitors for its back-end services. Based on these metrics, Zuul calculates a dynamic priority threshold and discards all requests with a lower priority than the calculated threshold.
Source: https://netflixtechblog.com/keeping-netflix-reliable-using-prioritized-load-shedding-6cc827b02f94
To validate that each API request's priority calculation is correct, Netflix's engineers used an internally developed failure injection tool (FIT) to test the effect of shedding each type of request in the face of load on the users' core streaming experience. Chaos engineering, which is the discipline of experimenting on a software system in production, continually tests this priority calculation. Small groups of production users are chosen for staged A/B testing, where some of the users' traffic is throttled by Netflix's chaos automation platform (ChAP). These users' experience is measured, and if it degrades, the team is aware that they should modify the request priority calculation.