Doug Barth, from PagerDuty, talked at DevOps Days London about their approach to start testing their systems for resilience without dedicating a lot of automation effort upfront. The goal was to quickly start learning about failure points and openly discuss how to fix them with a one hour per week time box.
Automating failure testing with the same coverage as Netflix’s famous simian army was not possible due to PagerDuty’s multi-cloud environment and investing in in-house automation tools would delay initial results. Thus they opted for a manual failure testing approach nicknamed “Failure Friday”. It consists of spending one hour each Friday trying out a list of “attacks” (provoked failures) and checking how the “victim” (system being tested) reacts.
Between attacks the system is put back to a normal working state. Attacks stop if things break badly (for example requests sent to the victim not being picked up by other service instances after failure). In such case the session is halted and the system is recovered manually. A permanent fix gets tested the following Friday. Otherwise the attacks continue till the hour long session is over.
Attack strategies went from quick failure simulations such as stopping one Cassandra database instance or rebooting a server instance to more complicated simulations of network isolation (misconfigured IP tables dropping packets coming in on specific ports) or slow nodes (using netem’s network emulation).
Fixing issues in the system and raising overall awareness of the need to handle and test for failure were some of the benefits expected. But Doug also highlighted unexpected benefits such as the ease to ramp up new on-call people (dev or ops) after their exposure and understanding of the provoked failures, as opposed to just theoretical knowledge that will likely be outdated or inaccurate by the time a real non-provoked failure happens. Another unplanned benefit was the uncovering of hard to simulate component failures which led to changes in the system architecture that increase its overall testability.
In terms of practical organization Doug mentioned the importance of keeping logs and action times, tracking discoveries and issues as well as sharing dashboards and metrics. He also recommends not turning off alarms during the session in order to check that monitoring is working as expected but announce the attack sessions to everyone in order to avoid alarm escalation due to the provoked failures.