The Gremlin team has announced "Gremlin Free", which provides the ability to run chaos engineering experiments on a free tier of their failure-as-a-service SaaS platform. The current version of the free tier allows the execution of shutdown and CPU attacks on a host or container, which can be controlled via a simple web-based user interface, API or CLI.
At the end of 2017 the Gremlin team announced the first release of their chaos experimentation SaaS product that supported the coordination of multiple attacks on hosts and the associated infrastructure. In 2018 application-level failure injection (ALFI) was also added, which supported running attacks on individual application services or functions. One of the primary attacks throughout the evolution of the product has been the shutdown of instances, which was partly inspired by the Netflix Chaos Monkey -- one of the first chaos engineering tools within the cloud computing domain.
The Gremlin team has argued that although the Chaos Monkey tool is useful, it does require time to learn how to safely operate. The original tool also only supported AWS (although additional tooling has emerged that offers similiar instance shutdown abilities within Azure and Google Cloud Platform). With the launch of Gremlin Free, the Gremlin team is aiming to reduce these barriers to running chaos experiments, and faciliate teams in quickly seeing the value from doing so.
For engineers looking to explore the new free tier, Tammy Butow, principal SRE at Gremlin, has created a "Shutdown Experiment Pack" that is available on the Gremlin website. This provides a detailed walkthrough for running five chaos experiments that shutdown cloud infrastructure hosts and containers on AWS, Azure, and GCP (for which cloud vendor accounts are required), and also shutdown containers running locally with Docker.
InfoQ recently sat down with Lorne Kligerman, director of product, and discussed the motivations and future plans with Gremlin Free.
InfoQ: Hi Lorne, and thanks for joining us. Can I ask what the motivation was for releasing Gremlin Free?
Lorne Kligerman: There were a few motivations. The first is simple: we want democratize the practice of chaos engineering to further our mission of making the internet more reliable for everyone. People are interested in chaos engineering, but existing solutions don't provide the safety, security, and user interface to make it easy to get started.
Another reason is that we want people to do the right thing for their end users, by experiencing the value and impact of chaos engineering first hand. Gremlin Free will allow anyone to quickly sign up, install our client, run an experiment, and observe the results.
In going through this exercise (which includes working with your existing tools and monitoring) the results you see -- either validating the resiliency of your system or uncovering a bug -- will increase the maturity of your organization.
There is also the option to unlock the full Gremlin suite.
InfoQ: How does the Gremlin Free SaaS offering compare with, say, running my own Chaos Monkey and associated tooling?
Kligerman: This goes back to the first question, where the awareness of chaos engineering comes from various open source solutions, including Chaos Monkey. While open source is a great thing, the cost of setting something up and keeping it up and running is high. Chaos Monkey specifically isn't easy to use, as it only works on AWS and only provides only the attack type of randomly shutting down VMs. (Fun fact is that our CEO Kolton actually built Netflix's second generation of failure injection tooling).
Gremlin provides a full SaaS offering, which includes a simple UI along with easy installation. It also offers over a dozen attacks, from simulating CPU spikes to disks filling up to injecting network latency. Whether using the UI, API, or CLI, you always have quick access to a "Halt attack" button, that quickly stops all attacks and puts your system back in a healthy state. From the beginning, we've prioritized simplicity, safety, and security.
Finally, we want to distance ourselves from the idea of randomly breaking things. There is a time and place for that -- but what really drives value is the concept of a thoughtfully planned out experiment, where you start with a small blast radius and increase it over time. We believe in forming a hypothesis, and then running the experiment to learn about how the system behaves. From there, we can scale up the impact of the experiment as our confidence in our system grows.
So to your question about Gremlin Free specifically, part of our thought process was basically to give away a better Chaos Monkey. It has the same UI as our enterprise product, can run on any cloud or bare metal, and offers two attacks: Shutdown (Chaos Monkey) as well as CPU.
InfoQ: How do you see Gremlin's platform evolving? We're seeing increasing talk about the importance of observability, particularly in relation to complex distributed systems, and so would Gremlin consider building a related product or providing integration with existing tooling?
Kligerman: We plan to stick to our core competency and not build out a monitoring solution. There are already tons of them out there that do a great job. We have an existing integration with Datadog, we're talking to the folks at New Relic and Dynatrace, and Charity Majors at Honeycomb is super friendly and spoke at our conference last year. So yes, we want to build out strong integrations with all of these players, and agree that observability is crucial to chaos engineering.
In terms of the future of the Gremlin platform, in some ways what we've already built is advanced and ahead of market -- we launched ALFI (Application-Level Fault Injection) last year for running more granular experiments at the application and request level (which works on serverless) that the market is now catching up to.
InfoQ: There are quite a few foundations gaining popularity, or emerging -- e.g. the CNCF, and the new Continuous Delivery Foundation -- and so how do you see Gremlin's relationship to these?
Kligerman: We're an active member of the CNCF, and it's important for us to be a part of the community as much as possible. The new CD foundation is interesting, because we are doing some work in that space and believe strongly that to get the maximum value out of chaos engineering, it should be automated and often built into your CI/CD pipeline. Look out for some related announcements soon!
Access to Gremlin Free requires sign-up via the Gremlin website. There is extension documentation on the product, and additional help is available via the Gremlin Slack Chaos Engineering #support channel.