WePay's engineering team have talked about their new highly available MySQL cluster built with HAProxy, Consul and Orchestrator. It improves upon their previous architecture by reducing any downtime from 30 minutes to 40-60 seconds.
MySQL is used as the main RDBMS at WePay, and runs as multiple clusters serving different services. Their application architecture is based on a master-replica model, where writes happen to the master, and reads, including analytics queries, are run against the replicas. When a master fails, a new master is chosen and its identity propagated to the replicas and the write clients. The total downtime in any failover event includes all of these activities.
The team decreased their MySQL cluster downtimes from 30 minutes to 40-60 seconds and built an architecture resilient against single zone failures in Google Cloud. Akshath Patkar, staff site reliability engineer at WePay and author of the article, summarizes that in the new architecture they "simplified and decoupled the failover tasks into smaller chunks".
WePay’s previous failover architecture used Master High Availability Manager (MHA) for master failover, whereas the new one uses Orchestrator, an open source tool built by GitHub. HAProxy - a TCP/HTTP load balancer - plays a critical role in both architectures but the way it is used differs. In a typical configuration, HAProxy distributes requests across a backend "pool" of servers. In the previous architecture, WePay employed MHA with a patched version of HAProxy to handle dynamic changes in the pool configuration. For failover of HAProxy itself, they utilized Google Routes which could be used to route client traffic to a healthy HAProxy instance there was a failure in another.
HAProxy was a single point of failure and the challenge was compounded when Google Cloud had network issues and the routes could not be updated immediately. The minimum downtime was around 30 minutes. In addition to this, replication lag was calculated by pt-heartbeat - a tool that measures replication delay - and time skews between the MHA server and the replicas would lead to a miscalculation. It is not clear, however, why time skews could not have been avoided by using NTP.
The new architecture adopted the open source Orchestrator tool, written by GitHub’s engineering team and used in their datacenters. Orchestrator chooses a new master when it detects that the existing one has failed. GitHub uses Orchestrator in tandem with Consul, like WePay does, and the local Consul daemons update themselves of the new master details. WePay uses consul-template to update the Consul daemons, which hold information about the current master as well as the replicas.
In contrast with the previous architecture, HAProxy runs in two layers in the new one. One HAProxy instance runs with each application client, either as a local process or as a sidecar in their Kubernetes deployments. The application connects to MySQL using the local HAProxy. The remote and the local HAProxy instances run in different zones to avoid split brain scenarios. The remote HAProxy instances are aware of the current topology, and the local ones connect to the remote ones to get updated master information.
With this approach, both planned (maintenance) and unplanned downtimes could be handled with some differences. The way replication lag is calculated also changed with pt-heartbeat running on each replica, inserting a row into the master, and then looking for that row.