Gremlin has released enhancements to its Chaos Engineering platform aimed at DevOps engineers interested in testing the scalability of services running within Kubernetes clusters by isolating "noisy neighbours". On Kubernetes, the noisy neighbour issue occurs when multiple applications sharing a Kubernetes cluster compete for resources leading to degraded performance.
Kubernetes can improve infrastructure utilisation as containers have a smaller resource footprint, which enables a much higher tenant density on a host. However this density can result in a conflict of interest for resources utilisation. This is the "noisy neighbour" problem, where one service deprives another service of resources when they share the same node.
Gremlin's latest upgrades to their Chaos Engineering service enable engineers to isolate their resource attacks to single containers, experiment on both containerd and CRI-O runtimes, and to control access for experiments on individual nodes within namespaces for shared cluster environments.
Containerd is an industry standard runtime that was designed to be used by Docker, Kubernetes or any other container platform that wants to abstract away syscalls or operating system (OS) specific functionality to run containers on any OS. CRI-O is a lightweight alternative to containerd, designed for faster, stable workloads.
Kubernetes targeting via the UI (source: Gremlin)
By testing individual pod scaling and Kubernetes resource limits, engineers can implement measures to prevent noisy neighbours from causing application failure.
The final feature is aimed at organisations running large scale clusters in multi-tenant environments. Namespaces in Kubernetes isolate objects between teams, so that objects can only exist in one namespace, and access to each namespace can be controlled. Gremlin's new namespace access control means that administrators can maintain the same logical separation to users who are performing chaos attacks.
Namespace access control (source: Gremlin)
Gremlin is a Chaos Engineering platform that can perform experiments for Linux distributions, containers, and cloud platforms. It provides a framework of attacks that inject faults into a system by limiting resources, changing the state of environments, simulating network fluctuations or impacting individual requests. The service aims to support the discipline of continuous experimentation when applied to complex systems under stress in order to identify sources of failure before they occur.
Kubernetes allows engineers to deploy multiple pods on a single node, and scale out the individual pods without impacting their neighbours. Horizontal pod autoscaling (HPA) improves efficiency by scaling out pods based on their observed CPU utilisation, rather than scaling out the entire application. Kubernetes resource limits prevent containers from disrupting other services on a node by hogging resources.
Gremlin's director of product Lorne Kiligerman comments that without testing these behaviours, it's difficult to determine if your application is decoupled enough to scale out pods independently and to know if noisy neighbours can still break services sharing the same node.
Amazon and Netflix developed Chaos Engineering almost a decade ago as part of their work to ensure their complex systems could survive worst-case scenarios as they scaled. Gremlin was founded in 2016 and built its platform on lessons learned at both organisations. The team subsequently launched native Kubernetes Chaos Engineering as a service in 2019.
Netflix's Chaos Monkey is an alternative, open source Chaos Engineering tool, but does not run as a service; it is operated by setting up a cron job that calls Chaos Monkey once a day to create a schedule of terminations.
In related news, AWS recently announced it will launch its own fault injection service in 2021. Doug Campbell, a site reliability engineer from GrubHub, recently discussed how to fit Chaos Engineering and Gremlin into an existing DevOps culture at ChaosConf 2020.