DigitalOcean's network engineering team wrote about their journey in switching from a Layer 2 to a Layer 3 based routing model for traffic to virtual machines in their cloud using Open vSwitch (OVS) and other tools.
DigitalOcean - which has infrastructure spread across the world - faced performance, reliability and scalability challenges with their Layer 2 (L2) based routing mechanism for directing public internet traffic into their virtual machines. They utilized label switching and the Border Gateway Protocol (BGP) to mitigate these and switch to a Layer 3 (L3) based, Clos datacenter topology.
Each virtual machine (VM) - also called a Droplet in DigitalOcean parlance - has a public IPv4 address, routable from the public internet. Network packets are routed to the VMs via Open vSwitch - a virtual network switch implementation. Since the range of available IPv4 addresses started getting scarce years ago, it’s important to be able to reclaim IPs, says Carl Baldwin and Jacob Cooper in their OVS and OVN 2019 conference talk. The initial setup at DigitalOcean used L2 routing as it is easy to deploy - most components in the datacenter can communicate out-of-the-box over it. VMs were bridged directly on the L2 network. The primary challenge with this was scalability as the infrastructure grew. Routing over L2 is about locating other machines using the Address Resolution Protocol (ARP). Since ARP is a broadcast protocol, it generates a lot of traffic which can flood the network. The team also wanted to avoid refreshing hardware configuration each time there was a routing change, and also minimize the impact that any failure had. Both of these were issues with the older L2-based setup.
The solution to this was to use L3 routing using OVS. InfoQ reached out to Armando Migliaccio, staff engineer, networking at DigitalOcean, to find out more about the transition.
The Droplets' gateway is on the core switches. The common answer to such problems is to "virtualize the network", says Migliaccio, i.e., use a Software Defined Networking (SDN) solution. He also adds that turnkey solutions would not have fit their requirements. Typical solutions like tunnelling had a high performance overhead, and VXLAN was not feasible due to the lack of hardware support in most of the network interface cards already deployed.
Image courtesy: DigitalOcean (used with permission)
The team finally settled on a combination of label switching - a routing technique based on short path labels rather than the next hop address - and BGP - an L3 routing protocol widely used across the internet. The new topology was a Clos network, and other organizations have done similar changes in the past. According to their conference talk, the team had to develop automation to get around challenges using BGP in large datacenters. Migliaccio says that "for network automation, we heavily rely on Ansible and Salt among others." GoBGP, BIRD and OVS formed the primary components. Migliaccio explains the considerations behind choosing GoBGP compared to similar software like OpenBGPD:
Our choice to go with GoBGP was primarily driven by our need to embed the service/library into a bespoke component running on the HV that tracks droplets events (namely creates, destroys, etc). This component is written in Golang and it made sense for us to choose a Golang library we had prior knowledge in which in turn simplified the integration effort.
To leverage existing label switching capabilities in the switches, they created "label switch paths for each hypervisor and routed the public traffic over those instead of pure IP". The transition from the older routing mechanism to the newer one had to be seamless without any downtime for customers. To solve this, they added parallel data paths using OVS on both L2 and L3, so that traffic could be rerouted without "evacuating Droplets from the hypervisor". It also took the various teams (both engineering and non-engineering) "a fair amount of cross-functional communication", asserts Migliaccio, laying out the non-technical details of the transition:
The bulk of it was handled with the help of our TPM team (Technical Project Management). As for customer communication, we broke down the rollout process in each region into multiple steps to potentially address unanticipated failures via rollbacks etc and at each step automated communication was sent out for customers whose workload could be potentially impacted. In each region we rolled out we did hit a couple of snags that led to some limited networking downtime for a subset of our customers, but all in all given the sheer scale and complexity of the effort I felt overwhelmingly positive about the outcome.
The transition to L3 for the remaining datacenters is ongoing, and will likely continue throughout 2020.