Micheal Haggerty, infrastructure engineer at GitHub, has published a blog explaining how GitHub has engineered Spokes, their replication system, to function over large distances. This includes reducing round trips, introducing a three-phase commit, reference update performance optimisations and various other tweaks.
Haggerty explains that GitHub requires repositories to be replicated across data centers in order to maximise resilience and reduce latency. In the event of a disaster, a replica needs to be available in another geographic location, and in order to improve performance, the closest replica needs to be served to the user.
Spokes is the system developed by GitHub for performing this replication of user repositories and ensuring that they are in sync. It functions as a proxy, transparently performing this replication at the Git application level. Haggerty explains that it originally required replicas to be close together in order to function, but over time this was overcome by reducing latency and optimising reference updates.
One of the original problems of spreading the replicas was the increased latency, as this would limit the maximum rate of Git reference updates that Spokes could sustain. Haggerty points out that whilst this wouldn’t be a problem for most users, certain Git workflows lead to this push heavy requirement:
Well, most users don’t push often at all. But if you host nearly 70 million repositories, you will find that some projects use workflows that you would never have foreseen. We work very hard to make GitHub Just Work for all but the most ludicrous use cases.
He also explains how GitHub generates many of its own reference updates. For example, when the site tells you whether a pull request can be merged or rebased, it’s based on internal test attempts. Also, if there are many pull requests against a certain branch, and that branch is pushed to, then each of those tests must be repeated for all the pull requests.
One way of overcoming latency was by reducing network roundtrips, writes Haggerty. GitHub makes use of the three-phase commit protocol in order to update replicas, and also a distributed lock to ensure the correct update order. He explains that whilst this produces four round trips, it’s not too expensive. They also aim to make sure work is being done whilst waiting for network calls to complete.
Haggerty also explains that GitHub engineers made open source contributions to the Git project. This included transactions for reference updates, which allows transactions to be committed or rolled back based on replicas ability to carry out a reference update. Another was speedups to a number of reference update related operations themselves.
In order to compare replicas, Haggerty explains how Spokes compares custom checksums of each of them. If they match, then they contain the same content. The checksums themselves are also cheap to compute, as rather than calculating them from scratch they are done so incrementally.
Booking keeping updates are also coalesced into fewer transactions. Haggerty explains that this helps in situations where a single commit could cause hundreds of booking keeping reference updates, as a single one can take a third of a second.
The full blog is available to read online, further explaining the work undertaken.