As InfoQ have previously reported Netflix announced the upgrade of Chaos Monkey, which is a general-purpose tool aimed at improving the resiliency of Software as a Service by randomly choosing to turn off servers and containers during business hours. With this upgrade, Chaos Monkey integrates with Spinnaker, Netflix’s Continuous Delivery platform that in turn enables integration with a variety of Cloud platforms including Netflix’s own Titus platform for Docker containers.
Chaos Monkey receives the configuration information via Spinnaker and using this information, Chaos Monkey goes about its business by scheduling termination of resources. It enables a better UX to schedule this termination and being able to group by app, stack or cluster. It’s also possible to opt out of Chaos Monkey for a variety of different reasons, including during known outages. Finally, these resource terminations can be tracked better via tracking tools for metering and reporting purposes.
InfoQ caught up with Lorin Hochstein, an engineer at Netflix who worked on the new release, regarding the announcement.
InfoQ: It’s been more than five years since Chaos Monkey was originally announced- can you talk about the history and motivation for the latest release?
Lorin Hochstein: Before we embarked on the upgrade, the Simian Army project, which houses the previous version of Chaos Monkey, had gotten to a state where it had become difficult for us to make changes to it. The internal version of the Simian Army imported the open source version and included some Netflix-specific behavior in a way that made it difficult for us to reason about the impact of changes we wanted to make. One particular pain point was managing Chaos Monkey application configurations, where the internal Netflix version managed application-specific configuration data quite differently from the open source version.
To complicate things further, responsibility for the Simian Army inside of Netflix was split across teams: the Chaos team owned the Chaos Monkey, and the EngTools team owned Janitor Monkey and Conformity Monkey. We even ran two separate deployments of Simian Army internally: one managed by the Chaos team that had only Chaos Monkey enabled, and another managed by EngTools that had Chaos Monkey disabled.Because of these issues, the Chaos team felt it was time to do an upgrade so that we could make changes to the codebase more quickly and with confidence that it wouldn’t do unexpected harm to our production environment.
InfoQ: As described in your blog, it’s a significant upgrade. Integration to Spinnaker is one of the features. Can you elaborate on this integration and other upgrades?
Hochstein: Spinnaker integration makes it easier for Chaos Monkey to support different cloud backends. For example, once we had initial Spinnaker support implemented for AWS, it required very little effort to add support for our internal container cloud, Titus, since Titus is a supported Spinnaker backend. In addition to AWS, Spinnaker supports public cloud platforms like Google Compute Engine and Microsoft Azure, and private cloud platforms like Kubernetes and OpenStack. As long as Spinnaker supports the backend, Chaos Monkey should just work out of the box.
Even if you're just using AWS, if you are running in multiple region or have multiple accounts, the Spinnaker integration makes Chaos Monkey simpler to deploy. The previous version of Chaos Monkey required a separate deployment in each region and in each account, whereas now we can have a single Chaos Monkey deployment that works across accounts and regions.Internally at Netflix, where we used to have multiple tools for managing different aspects of deployment, we're moving to exposing more and more of this functionality through the Spinnaker UI. Given this trend, it just made sense to allow services teams to manage their Chaos Monkey configurations using the same system they use for creating deployment pipelines or visualizing the state of the current deployment.
Another significant change is allowing users to control what Chaos Monkey considers a logical group of instances. The previous release of Chaos Monkey used AWS auto scaling groups (ASGs) as a logical grouping of instances: when activated, it would randomly kill one instance per ASG. Inside of Netflix, we have a notion of apps, stacks, clusters, and ASGs, where the common case is that a cluster is associated with exactly one ASG. However, when we spoke internally to different service teams inside of Netflix, one thing that emerged was that a team's notion of a logical cluster did not always map on to what Spinnaker considers a cluster. One of the configuration options we added to Chaos Monkey was to allow engineers to choose what they considered a logical cluster: it could map to a Spinnaker cluster, or a Spinnaker stack, or a Spinnaker app. I'm using the modifier "Spinnaker" here, but the notion of app/stack/cluster predates Spinnaker, and was used in Netflix's previous deployment tool, Asgard.
One more new feature is the presence of hooks to allow users to integrate Chaos Monkey with other services. For example, users can define their own "trackers", which are invoked when Chaos Monkey decides to terminate an instance. We use trackers internally for sending metrics to Atlas and for recording terminations to an event logging service. The previous version of Chaos Monkey had to support sending email for notifications, and was tied to Amazon’s Simple Email Service(SES). With the new version, Netflix engineers can just configure email alerts with Atlas. Another example of a hook is one that provides an interface for querying if there is an ongoing outage. If there is an active outage, Chaos monkey will not terminate instances.
InfoQ: Being able to only terminate instances and not inject other faults, like latency, etc. seem to be restrictive in the latest release of Chaos Monkey. Can you comment on the rationale behind this restriction?
Hochstein: These other types of failure injection were not being used inside of Netflix, and they would have required Chaos Monkey to either have ssh access to the instance or to run some kind of agent process on each instance, and we did not want to add the additional complexity for features we aren't using.
When we announced the upgrade on the Netflix Tech Blog, one of the comments on that blog post was that these other types of failure modes can be even more problematic than a simple instance termination, and that's absolutely true. One of the monkeys that was never open sourced was called Latency Monkey, which randomly injected latency. It did not see widespread use internally because it was considered too dangerous.Instead of randomly injecting these types of faults, we're taking a more focused approach to testing that our service is resilient to them. There are many different kinds of chaos that can be inflicted on an instance. For example, the previous version of Chaos Monkey supported faults such as increased CPU usage, increased I/O usage, and using up local disk space. From the point of view of a client making an RPC call against an instance suffering from a fault, many of these faults manifest as either an increase in latency, a failure in the RPC call, or both. That means that we can model a large space of faults by injecting latency or failures at the RPC level. Netflix has had the ability to inject these types of RPC failures for a while using an internal system called FIT.
More recently, the Chaos team has developed tools that allow us to leverage FIT to run more focused failure injection experiments. My colleague, Ali Basiri, is giving a talk on this approach at the upcoming QCon SF, and I'll be talking about this at the IEEE International Symposium on Software Reliability Engineering.
InfoQ: As a Developer or Architect, if I want to use Chaos Monkey, can I use it only with Spinnaker? Any advice to developers and architects for integrating Chaos Monkey with their own cloud or container platforms?
Hochstein: I'm afraid that if you want to use Chaos Monkey and you are not using Spinnaker as your deployment platform, then currently you're out of luck.
The main complexity of Chaos Monkey isn't the termination part. At the last Chaos Community Day, Jesse Newland from GitHub implemented a Kubernetes Pod Chaos Monkey during the meeting. It's a 20 line shell script. The complexity comes in implementing the domain model for the deployment. For us at Netflix, Chaos Monkey understands our notion of app/stack/cluster that all teams in the organization conform to and that is explicitly exposed by Spinnaker. If an organization isn't using Spinnaker, I suspect their deployment wouldn't follow that model, and so our Chaos Monkey implementation wouldn't be a good fit.I would recommend developers look for a chaos tool that meshes well with their deployment patterns. There are several other tools out there, including pumba, Chaos Lemur, Chaos Lambda, Blockade, Simoorg and even Microsoft Azure's Fault Analysis Service. If nothing matches exactly, I'd suggest either forking the one that’s the best match and customizing it for your needs, or writing your own.
I would also recommend taking a look at the Principles of Chaos Engineering, which captures our thinking about how to apply chaos engineering to ensure your system is resilient.
InfoQ: Is Netflix the sole contributor to Chaos Monkey? Will this likely change with the new release?
Hochstein: Historically, Netflix has been the primary contributor to Chaos Monkey, although the community has contributed code as well. For example, the additional failure modes that had been added to the initial release of Chaos Monkey were contributed by a community member.
With the new release, I believe Netflix will continue to be the primary contributor. As Spinnaker adoption increases, I suspect we’ll see more Chaos Monkey community contributions.
The Chaos Monkey github site provides more information including how to install the go tool binary and how to deploy it.