In this week’s podcast, Thomas Betts talks with Haley Tucker, a Senior Software Engineer on the Playback Features team at Netflix. While at QCon San Francisco 2016, Tucker told some production war stories about trying to deliver content to 65 million members.
Key Takeaways
- Distributed systems fail regularly, often due to unexpected reasons
- Data canaries can identify invalid metadata before it can enter and corrupt the production environment
- ChAP, the Chaos Automation Platform, can test failure conditions alongside the success conditions
- Fallbacks are an important component of system stability, but the fallbacks must be fast and light to not cause secondary failures
- Distributed systems are fundamentally social systems, and require a blameless culture to be successful
Subscribe on:
Show Notes
Fun with distributed systems
1m:24s - Every outage at Netflix follows the philosophy of Leslie Lamport that “a distributed system is one in which the failure of a computer you didn't know exists can render your own computer unusable.”
Weird data in the catalog, solved with data canaries
2m:04s - The Video Metadata Service aggregates several sources into a consistent API consumed by other Netflix services.
2m:43s - Several checks and validations were in place within the video metadata service, but it is impossible to predict every way consumers will be using the data.
3m:29s - The access pattern used by the playback service was different than that used in the checks, and led to unexpected results in production.
3m:58s - Now, the services consuming the data are also responsible for testing and verifying the data before it rolls out to production. The Video Metadata Service can orchestrate the testing process.
4m:22s - This process has been described as data canaries
A vanishing critical service, prevented by implementing a “kill switch”
5m:07s - A second, related incident involved a thin and lightweight logging that service ran out of memory and crashed, due to the entire video metadata blob being loaded into memory.
6m:22s - Isolating the problem was tricky due to different configurations in test and prod.
6m:44s - After setting the test config to match production, the root cause was identified deep within the dependency graph, in a .jar that isn’t actually needed.
6m:59s - Pruning the dependency graph, and removing the .jar completely is still a work-in-progress.
7m:07s - Have worked with video metadata team to implement a kill switch. This allows the logging service to no longer consume the metadata service.
7m:56s - There were 14 pages of .jars in the application, which is a remnant of the service being built as part of a legacy monolith.
8m:25s - The kill switch is not a circuit breaker, but a config value to prevent loading the data.
9m:16s - When this service went down, it created a cascading failure to the proxy tier. That’s where Failure Injection Testing (FIT) comes into play.
10m:16s - FIT tests are manually run, based on a specifically defined scenario, at small or large scale.
Throttling issues, solved by sharding the service
11m:00s - For playback, a critical feature is anything involving the play button, and video playback starting. Other services, such as customer experience improvements, are deemed noncritical.
12m:04s - Very regularly spaced spikes in latency, every 40 minutes. Requests would fail, and all retries would fail. This led to upstream throttling occurring.
13m:15s - Indiscriminate throttling caused both critical and noncritical services.Therefore, the application was sharded into two stacks, to allow different scaling and performance characteristics for critical and noncritical services.
14m:23s - The stack is now two smaller monoliths, which was a relatively simple solution to buy some time for redesigning and re-architecting. The future state will involve individual components being split out into microservices, as appropriate.
Embracing failure
16m:05s - The Netflix culture embraces and accepts failure, with new team members being officially welcomed to the company only after they first break something in production.
16m:28s - This goes hand-in-hand with working to correct the root cause and not fail in the same way twice.
16m:38s - Major outages are met with additional resources and a strong focus on resolving the problem.
17m:40s - Tools such as FIT and the data canary architecture often come about as part of a Hackday project.
18m:15s - The Chaos Automation Platform (ChAP), is the next generation of FIT, and allows automated testing of failure scenarios with every code push.
Monitoring and troubleshooting
21m:06s - Tracking how requests and responses for different devices flow through the Netflix tech stack relies heavily on Elasticsearch. Previously, usage data was stored in Atlas, but the data was very coarse-grained.
21m:51s - Elasticsearch has made it much easier to drill down and find the common ground between failures.
22m:42s - Teams at Netflix are working on traceability, which aids in troubleshooting to identify bottlenecks in the system.
23m:13s - Tucker’s team’s recent fun project is a batch loader pipeline to listen to the video metadata, compute what they need, then cache it using Cassandra.
Companies Mentioned
People Mentioned
Languages and Platforms Mentioned
- Amazon S3
- Failure Injection Testing (FIT). For more on this listen to former Netflix Chaos Engineer Kolton Andrus from last week’s podcast.
- Chaos Automation Platform (ChAP)
- Elasticsearch
- Atlas
- Cassandra