At QCon New York 2015, Kolton Andrus discussed Netflix’s Failure Injection Testing (FIT) platform, which allows the injection and monitoring of arbitrary failure scenarios to a targeted group of customers using the Netflix production web services. FIT allows Netflix to maintain an ‘antifragile’ programming culture, which results in the creation of systems that are resilient to failure.
Andrus, a senior software engineer at Netflix, began the talk by stating that software application failure testing should be conducted within a live production environment primarily for three reasons: this makes systems immune to failure, prevents larger outages, and allows verification of correct behaviour within a realistic production deployment. Andrus suggested that failure testing at scale within a production environment is much like hormesis:
Failure testing is a form of Hormesis - we imbibe the poison to become immune.
Andrus introduced Netflix’s Failure Injection Testing (FIT) ‘failure as a service’ platform, which has previously been written about on the Netflix Tech Blog. The traditional approach to failure testing at Netflix has leveraged the Simian Army, but the use of these applications can lead to unwanted problems propagating through to customers under exceptional circumstances. In particular, the effects of the ‘latency monkey’ have occasionally caused unintended cascading failures, and as such, Netflix developers have become cautious in its deployment.
The FIT application provides a web-based user interface (UI) that allows Netflix developers to define a specific failure scope, for example, a single customer or a cohort of customers. This limits the potential ‘blast radius’ of the failure testing. A ‘Halt all Failures’ button is also provided within the UI, which allows any developer to immediately stop all FIT failure testing in the case that unintended Netflix customers are being inadvertently affected.
Netflix utilise a custom API gateway/proxy named Zuul to perform routing (and other actions) for all inbound traffic to the Netflix web services. The FIT platform supplies failure metadata to Zuul, which allows the incoming requests from the targeted failure scope (customer/cohort) to be identified and marked as candidates for failure injection. Injected failures can include adding latency to a request, returning an arbitrary HTTP status code, or throwing an error. An example of potential failure injection points can be seen in the diagram below:
The FIT UI allows failures to be monitored, and also customers and devices to be traced. Andrus provided a live demonstration of the use of FIT on the production Netflix website, and injected a failure scoped to his Netflix customer account that caused only non-customised film recommendations to be shown on his account home page. After the demonstration was complete, Andrus disabled the failure injection and reloaded his account home page to show that the standard personalised film recommendations were once again visible.
Nassim Nicholas Taleb’s notion of antifragility (the opposite of fragility) was also referenced, and it was suggested that tooling such as FIT could allow the creation of an antifragile software development process:
Aggressive failure testing creates not just robust programs, but an antifragile programming culture
Andrus concluded the talk by stating that in his experience of working within the Netflix team he believes that failure testing is a worthwhile investment, testing in production is sustainable, and this technique can harden systems against failure.
More information about Kolton Andrus’s “Breaking Bad at Netflix: Building Failure as a Service” talk can be found at the QCon New York 2015 website. The Netflix FIT application is not yet available as open source, but additional information can be found within a recent Netflix Tech Blog post.