Adrian Cockcroft, VP cloud architecture strategy at AWS, recently shared his thoughts on how to produce resilient systems that operate successfully in spite of the presence of failures. He covers five tools: concentrating on rapid detection and response, System Theoretic Process Analysis (STPA), lineage driven fault detection, no Single Point of Failure (SPOF) principle, and risk prioritization. At the recent QCon San Francisco event, he also shared what he considers are good cloud resilience patterns for building with a continuous resilience mindset.
Cockcroft indicates that there are many possible failure modes within a given system, with each testing a different aspect of resilience. However, he feels that of particular note are incidents where the system fails silently and components that fail infrequently. He states:
The system needs to maintain a safety margin that is capable of absorbing failure via defense in depth, and failure modes need to be prioritized to take care of the most likely and highest impact risks.
He continues that adopting a "learning organization, disaster recovery testing, game days, and chaos engineering tools are all important components of a continuously resilient system." Ryan Kitchens, senior site reliability engineer at Netflix, shared that the Netflix team leverages these techniques to ensure the team is prepared to prevent and respond to incidents.
The first technique that Cockcroft recommends is ensuring rapid detection and response. In the event that a failure occurs, quick detection is vital to ensure a prompt resolution. He indicates that engineers should understand the delay built into their observability systems and ensure that some critical metrics are collected at one second intervals. He stresses the importance of measuring the mean time to repair (MTTR) but also stresses the importance of finding a way to track prevented incidents. As Kitchens notes, "we need to start realizing that the other side of why things went wrong is understanding how things go right."
His second technique involves working with the system constraints that need to be satisfied to maintain safe operation. Cockcroft recommends using the System Theoretic Process Analysis (STPA) approach as described in Engineering a Safer World by Nancy G. Leveson. STPA is based on a functional control diagram of the system with safety constraints and requirements identified for each component in the design. Commonly, the control pattern is divided into three layers: a business function, a control system that manages the business function, and the human operators that monitor the control system. This approach emphasizes the connections between the components and how they can be affected by failures.
Lineage driven fault detection is the third technique that Cockcroft suggests. This approach starts with the most important business driven functions of the system and follows the value chain that is invoked when that function is working as intended. Cockcroft explains that this is "effectively a top-down approach to failure mode analysis, and it avoids the trap of getting bogged down in all the possible things that could go wrong using a bottom-up approach."
The fourth technique is to ensure that the system does not contain any single points of failure. Cockcroft recommends that high resiliency systems should have three ways to succeed, and leverage quorum-based algorithms. He notes that "when there are three ways to succeed, we still have two ways to succeed when a failure is present, and if data is corrupted, we can tell which of the three is the odd one out."
The final technique that Cockcroft describes is risk prioritization. For this he recommends leveraging the ISO standard engineering technique of Failure Modes and Effects Analysis (FMEA). This technique involves list the potential failure modes and ranking the probability, severity, and observability of those failures on a 1 - 10 scale, with 1 being good. By multiplying those estimates together, each failure mode can be ranked based on its risk priority number (RPN) of between 1 and 1000. As an example, an extremely frequent, permanently impactful incident with low to zero observability would have an RPN of 1000.
In Cockcroft's opinion, STPA allows for a top-down focus on control hazards while FMEA facilitates a bottom-up focus on prioritizing failure modes. He feels that STPA tends to produce better failure coverage than FMEA, especially in cases of human controller and user experience issues.
In his recent talk at QCon San Francisco, Cockcroft shared what he considers are good resilience practices for building cloud-based applications. This includes the previously mentioned rule of threes. He also noted that when performing a disaster recovery between regions, the goal should be to fail from a region with a smaller capacity to one with a larger capacity. This helps ensure that the failover is not further impacted by insufficient capacity. Along the same lines, failing from a distant region to a closer, and hopefully lower latency, region can help ensure the failover is more successful.
He continued by stressing the importance of employing a "chaos first" mentality and noted that while he was at Netflix, chaos monkey would be the first app introduced into a new region. This forced applications, and the teams that own them, to be prepared to handle failure from day one. Finally, he shared his concept of continuous resilience, which he admits is a basic rebranding of chaos engineering in order to hopefully allow for less hesitation in applying the concept to production workloads. As continuous delivery needs test-driven development and canaries, Cockcroft posits that continuous resilience needs automation in both test and production. He stressed the importance of making failure mitigation into a well-tested code path and process.
Cockcroft notes that while it isn't possible to build a perfect system, these techniques can "focus attention on the biggest risks and minimize impact on successful operations." As Charity Majors, CTO of honeycomb, notes:
It will always be the engineer's responsibility to understand the operational ramifications and failure models of what we're building, auto-remediate the ones we can, fail gracefully where we can't, and shift as much operational load to the providers whose core competency it is as humanly possible.