GitHub.com uses MySQL as a backbone for many of its critical services like the API, authentication and the GitHub.com website itself. GitHub's engineering team replaced its previous DNS and Virtual IP (VIP)-based setup with one based on Orchestrator, Consul and the GitHub Load Balancer in order to get around split brain and DNS caching issues.
GitHub runs multiple MySQL clusters for different services and tasks, making it imperative to have them highly available (HA). GitHub's infrastructure is spread out across multiple datacenters, consisting of around 15 clusters, close to 150 production servers and 15 TB of MySQL tables. Each MySQL cluster has a single master, which responds to write requests, and multiple replicas, which serve read requests. The master node forms a single point of failure, and without it writes would completely fail. The HA requirements for this setup include auto-detection of failure, auto-promotion of a replica node to a master and auto-advertisement of the new master node to client applications.
GitHub's engineering team has employed several strategies for HA over the years, gradually moving towards uniformity across the organization. Since this is not restricted to MySQL, requirements for an HA solution also include cross-datacenter availability and split brain prevention. There are different possible approaches for MySQL master discovery. Previously, GitHub utilized DNS and VIP for discovery of the MySQL master node. The client applications would connect to a fixed hostname, which would be resolved by DNS to point to a VIP. A VIP allows traffic to be routed to different hosts to provide mobility without tying it down to a single host. The VIP would always be owned by the current master node. However, there were potential issues with the VIP acquire-and-release process during failover events, including split-brain situations. When this happens, two different hosts can have the same VIP and traffic can be routed to the wrong one. In addition, DNS changes have to occur to handle a master node that is in a different data center, and that can take time to propagate due to DNS caching at clients.
The latest setup at GitHub includes the Orchestrator toolkit, Consul for service discovery and the GitHub Load Balancer. In this architecture, when a client application looks up the master’s IP on DNS via its name, it is resolved via Anycast. The advantage of using Anycast is that while the name is resolved to the same IP address in every data center, the client traffic to that IP will be routed to the nearest master. The nearest master is the one that is co-located in the same data center. This routing is taken care of by GLB, which knows the current active MySQL master backends.
Image courtesy: https://GitHubengineering.com/mysql-high-availability-at-GitHub/
Orchestrator, also a GitHub engineering open source project, is responsible for master failure detection and the failover process. It utilizes collective knowledge drawn from all MySQL nodes including the replica to arrive at an informed decision about the master’s state. When a write master fails, the Orchestrator leader node detects the failure and starts the failover process to choose a new MySQL master. The rest of the Orchestrator cluster nodes notice this change and update their local Consul daemon with the new master details. Consul, a service discovery tool from HashiCorp, keeps track of the master nodes by storing them as key-value pairs. Consul can run in a distributed mode across datacenters but in GitHub's case each Consul cluster is independent at a datacenter level. The GLB gets notified of master status changes on a failover event using Consul Template, which queries the Consul clusters and updates the GLB state, which in turn routes traffic to the new master.
In the article, Shlomi Noach, senior infrastructure engineer at GitHub, mentions that although the new setup provides "between 10 to 13 seconds" of max outage time in most cases, there are some scenarios that need more work, like data center isolation leading to a split-brain or a Consul outage at the time of failover. GitHub’s new setup is a move away from traditional techniques based on networking, to ones based on proxying and service discovery. It completely replaces the VIP-based one, but there is debate around whether it would have been easier to adopt a different approach utilizing the Border Gateway Protocol (BGP).