Netflix announced the open source release of their container management platform called Titus. Titus is built on top of Apache Mesos and runs on AWS EC2.
Netflix, which runs its services on virtual machines on AWS, started moving parts of its systems to containers to take advantage of the benefits of a container-based development and deployment model. Netflix's unique challenges included an already-existing cloud-native infrastructure, which meant that moving to a container model should not involve too many changes. Hybrid deployments of both VMs and containers, a mix of microservices and batch jobs, and ensuring reliability with the additional layer that containers would introduce were some of the technical challenges.
These challenges led to the development of its own container management platform called Titus. Currently, Netflix runs video streaming, recommendations and machine learning (ML), big data, content encoding, studio technology, and internal engineering tools in containers, which add up to half-a-million containers and 200,000 clusters per day.
Titus was built on top of Apache Mesos, a framework that abstracts the underlying resources like CPU and RAM and presents the view of a pool of resources that applications can use. Mesos uses Linux cgroups to provide isolation to running processes, and is "application aware", which is to say that it can handle diverse applications like batch jobs as well as microservices.
The app specific logic is encapsulated in a Mesos "framework", which is then offered the pieces of the underlying resource by the cluster manager called Mesos master. Container specific orchestration is done on top of Mesos with a framework called Apache Marathon, which supports Docker. Even though container orchestrators like Kubernetes and Docker Swarm existed at that time, the Netflix engineering team felt that their specific requirements were better served by a custom implementation on Mesos. Titus grew out of two other projects at Netflix called Titan and Mantis.
Image Courtesy - https://queue.acm.org/detail.cfm?id=3158370
Titus’s architecture is based on a master-agent model, with Zookeeper for leader election and Cassandra for storing the master’s state. Jobs are submitted to the master in a declarative way, which launches containers for running them. This is similar to Kubernetes Deployments for specifying the number of replicas and other metadata for launching containers.
Titus manages capacity using two "tiers". The first is a "critical" one which ensures that enough VMs are running to launch containers quickly, potentially at the expense of running more VMs than necessary. The second is a "flexible" one that works at the VM level, and initialises or terminates VMs based on demand. This essentially boils down to auto scaling container clusters up as fast as possible. Other schedulers solve this in different ways. Kubernetes, for example, uses a combination of the horizontal pod autoscaler (analogous to the critical tier) and the cluster autoscaler (analogous to the flexible tier).
Since Netflix's infrastructure is based on AWS, a logical question to ask is how tightly is Titus tied to AWS and to Netflix's own microservice platform dependencies like Eureka, Ribbon and Atlas? According to the ACM paper, Titus had tight integration with Netflix's internal services, and some services had to be refactored to leverage Titus. Existing Netflix apps assumed that they would be running on single VMs on AWS and use AWS-specific security and networking abstractions at the VM level. Titus had to expose additional layers to offer similar functionality at the container level, since there would be multiple containers on a single VM.
Titus also integrates with Netflix's CI/CD tool called Spinnaker. In the course of making it open source, Titus has been refactored to "disconnect it from internal Netflix systems".