Atlassian released their in-house tool Escalator as an open source project. It provides configuration-driven preemptive scale-up and faster scale-down for Kubernetes nodes.
Atlassian adopted containers and built their own Docker based PaaS around 2013-2014, using it to run their internal platform and services. Their orchestration needs, apart from basic scheduling and running of pods, were to be resilient against cloud hardware failures and scaling up and down quickly in response to load. Kubernetes fit the bill and the team adopted it as their orchestrator. Their build engineering infrastructure - one of the first pieces to be migrated - consisted of workloads which required hundreds of VMs to be provisioned.
Kubernetes has two autoscalers - the horizontal pod autoscaler and the cluster autoscaler. The former scales pods - an abstraction over a container or a set of related containers - up and down, and thus depends upon the availability of underlying compute (usually VM) resources. The cluster autoscaler is to scale the compute infrastructure itself. Understandably, it takes a longer time to scale up and down due to the higher provisioning time of virtual machines. Any delays in the cluster autoscaler would translate to delays in the pod autoscaler. In a similar manner, pods can scale down very quickly, but compute VMs take time to do so. This can lead to huge costs from idle compute VMs especially on something the size of Atlassian’s infrastructure. Atlassian’s problem was very specific to batch workloads, with a low tolerance for delay in scaling up and down. They decided to write their own autoscaling functionality to solve these problems on top of Kubernetes.
Escalator, written in Go, has configurable thresholds for upper and lower capacity of the compute VMs. Some of the configuration properties work by modifying a Kubernetes feature called ‘taint’. A VM node can be ‘tainted’ (marked) with a certain value so that pods with a related marker are not scheduled onto it. Unused nodes would be brought down faster by the Kubernetes standard cluster autoscaler when they are marked. The scale-up configuration parameter is a threshold expressed as a percentage of utilization, usually less than 100 so that there is a buffer. Escalator autoscales the compute VMs when utilization reaches the threshold, thus making room for containers that might come up later, and allowing them to boot up fast.
Image Courtesy: https://developers.atlassian.com/blog/2018/05/introducing-escalator/
Public cloud providers like AWS have attempted to solve the VM management problem with solutions like Fargate, however, they lack the configuration options that Escalator has. The compute scale up/down is a common problem in the context of containers. The speed advantage of starting a container is lost if the compute VM takes minutes to start up. Keeping VMs running solves this at the cost of running possibly idle VMs. It is worth noting that Escalator solves a very specific problem - that of batch workloads - in the context of the two Kubernetes autoscalers.
Escalator is certified to run against Kubernetes 1.8+, and Golang 1.8+. It currently supports only AWS as a cloud provider. Atlassian has open-sourced other Kubernetes related projects like Smith earlier.