AWS has announced the upcoming release of their chaos engineering as a service offering. The Fault Injection Service (FIS) will provide fully-managed chaos experiments across a number of AWS services. The service includes pre-built templates that generate disruptions mimicking common real-world events. It can be integrated into CI pipelines via API.
The service, recently announced at re:Invent 2020, is scheduled to be fully available within 2021. At the time of announcement, FIS supports running experiments against EC2, Elastic Kubernetes Service (EKS), ECS, and RDS. Auto-roll backs and the ability to stop upon reaching certain conditions allow for safe running of experiments within production.
The service supports gradually and simultaneously impairing performance of resources, APIs, services, and geographic locations. Pre-built templates will be provided to generate the disruptions. Disruptions include service latency, database errors, increased CPU, and increased memory.
Some actions will not require the installation of any additional agents or softwares on the servers. However instance-level faults such as increased CPU or memory will require the SSM agent to be installed. The injected actions are designed to create conditions that nearly mimic their real-world equivalent. For example, the CPU utilization action actually consumes CPU resources. The API throttling action will throttle the requests at the control plane level.
Alongside the included pre-built templates, custom fault types can be created using AWS System Manager. FIS is fully integrated with IAM to control which users and resources have permission to access and execute experiments. This can also be used to limit and control which resources and services can be affected.
Monitoring of the running experiments is possible via CloudWatch or third-party monitoring tools via EventBridge. Both the console and API provide visibility into which actions have been executed. Once the experiments are complete, details on which actions were run, the stop conditions that were met, and metrics as compared to steady state can be reviewed.
There are a number of similar services available either as paid offerings or open-source. This includes the AWSSSMChaosRunner which is an open-source library from AWS. This library simplifies failure injection for both EC2 and ECS using the AWS System Manager SendCommand. Currently available failure injections include adding latency to inbound/outbound calls, dropping packets, memory impact, disk space consumption, and CPU usage.
Amazon leveraged AWSSSMChaosRunner to test and validate a new feature for their Prime Video service. The new feature was Prime Video profiles and as the service is part of a distributed system. In order to validate the timeouts, retries, and circuit-breaker configurations, the DependencyLatency attack was used from AWSSSMChaosRunner.
AWS has also open-sourced a number of SSM Documents that allow for performing chaos experiments. These include randomly stopping EC2 instances, detaching EBS volumes, and introducing CPU stress. Other alternatives for chaos engineering include the open-source Chaos Monkey, and the paid offerings from Gremlin.
While the service isn't scheduled to be released until 2021, more information can be found within the features page and the FAQ.