Google has published the paper Large-scale cluster management at Google with Borg, unveiling details on a technology that was very little spoken about in the past.
Borg is a cluster manager taking care of accepting, scheduling, starting, stopping, restarting and monitoring hundreds of thousands of jobs submitted by thousands of applications on behalf of various services and running on a variable number of clusters each comprising up to tens of thousands of servers. The purpose of the Borg is to make resource management effortless for developers so they can focus on their work, and to maximize resource usage efficiency across data centers. The following graphic depicts Borg’s main architecture:
The components of this architecture are:
- Cell – a collection of machines treated as a unit. Cells usually contain 10K servers, but can be larger if needed, and are heterogeneous in terms of CPU, memory, disk capacity, etc.
- Cluster – generally contains one large cell and sometimes a few small special purpose cells, some of them being used for testing. A cluster always is limited to a data center building, and all machines in a cluster are connected by high-performance networking. A site can have multiple buildings and clusters.
- Job – activity that is executed within the boundaries of a cell. They can have requirements attached – CPU, OS, public IP, etc. Jobs can communicate with each other, or a user or a monitoring job can send commands to a job via RPC.
- Tasks – a job consists of one or multiple tasks that are run from the same executable. These tasks usually run directly on hardware not in a virtualized environment to avoid virtualization costs. Tasks come as programs statically linked to avoid dynamic linking at runtime.
- Alloc – a set of machine resources reserved for one or more tasks. Allocs can be moved to a different machine along with the tasks it runs on them. An alloc set represents the resource reserved for a job and is distributed across multiple machines.
- Borglet – an agent running on each machine.
- Borgmaster – a controller process running at cell level and holding state data for all borglets. The Borgmaster adds jobs to a queue to be executed. The Borgmaster and its data is replicated five times, the data being persisted in a Paxos store. One of the borgmasters is leader.
- Scheduler – this monitors the queue and schedules jobs considering the resources available on individual machines.
Users of the Borg system submit jobs consisting of one or several tasks which share the same binary code and are executed on a cell, an individual Borg unit made up of several machines. On these cells, Borg combines two types of activities: long-running services such as GMail, GDocs, BigTable, etc. but with low-latency responses – µs to hundreds of ms –, and batch jobs that have no need to respond immediately and can run for a long time, even a few days. The first type of jobs are called prod jobs (i.e. production jobs), and they receive a higher priority over batch jobs, which are considered non-production jobs. Prod jobs receive 70% of a cell’s CPU resources and consume ~60% of the total CPU; they also are allocated 55% of the memory and consume ~85% of it. The purpose of mixing different types of jobs within cells is to make resource utilization as efficient as possible, saving Google the cost of an entire data center, according to John Wilkes, Researcher at Google.
According to the paper, some cells have an arrival rate of 10K tasks/min, and a Borgmaster can use 10-14 CPU cores and 50 GB of RAM. A borgmaster reaches 99.99% availability, but tasks continue to be executed even if the borgmaster or a borglet goes down. 50% of the machines run 9 or more tasks and some machines have 25 tasks and 4,500 threads. Task startup latency is on average 25s, 20 of them being spent on installing the package. Much of this waiting time has to do with disk access.
The primary security mechanism employed is Linux chroot
jail and ssh
via borglet to perform debugging on tasks. For external software running on GAE or GCE, Google uses hosted VMs running as a Borg task in a KVM process.
Google has another cluster manager called Omega, briefly described as:
Omega supports multiple parallel, specialized “verticals” that are each roughly equivalent to a Borgmaster minus its persistent store and link shards. Omega schedulers use optimistic concurrency control to manipulate a shared representation of desired and observed cell state stored in a central persistent store, which is synced to/from the Borglets by a separate link component. The Omega architecture was designed to support multiple distinct workloads that have their own application-specific RPC interface, state machines, and scheduling policies (e.g., long-running servers, batch jobs from various frameworks, infrastructure services like cluster storage systems, virtual machines from the Google Cloud Platform). On the other hand, Borg offers a “one size fits all” RPC interface, state machine semantics, and scheduler policy, which have grown in size and complexity over time as a result of needing to support many disparate workloads, and scalability has not yet been a problem.
Some of the lessons learned by Google in production over a decade have been applied in designing Kubernetes: the ability to orchestrate jobs belonging to the same service, multiple IPs per machine, using a simpler job configuration mechanism, using pods which are the equivalents for allocs, load balancing, and deep introspection providing debugging data to users. Many of the engineers that worked on Borg are working now on Kubernetes.