GitHub has been quietly rolling out DGit, short for “distributed Git”, a new distributed storage system built on top of Git with the aim of improving reliability, availability, and performance of using GitHub.
DGit is an application-level protocol that leverages Git’s distributed nature by keeping three copies of every repository on three different, independently chosen servers. This simple architecture has a number of immediate benefits in terms of reliability, availability, and performance, according to GitHub:
- the possibility that all three servers where a repository is hosted become unavailable at the same time is highly unlikely, assuming that the three servers are independent on one another;
- user requests can be load-balanced across the three servers; since the majority of them are reads, they can be served immediately and do not require any synchronization across the three servers, with a direct, almost 3x performance improvement;
- “fate sharing” among repositories is greatly reduced. Fate sharing is the circumstance that leads to performance degradation for one or more repositories due to their sharing the same server with another, highly popular, or too large repository. Indeed, since repositories are replicated on independent servers it is highly unlikely that the same circumstance occurs on all three of them. This makes if possible to serve a request from less overloaded servers.
- replica servers need not be close together, thus they can be located in different availability zones and/or data centers. Behind the obvious availability improvement, this also brings better performance for geographically closer users.
DGit allows GitHub to do away with the backup-based scheme that was previously used (and still is while DGit rollout is taking place). For each active file server, that scheme required a dedicated spare server connected by a cross-over cable, to which data were synchronized using DRDB.
Moving away from this scheme brings additional benefits to GitHub overall workings:
- when a server fails, the only measure that must be taken is routing all pending requests to a new server and reboot the failed one;
- additionally, replacing a faulted server is not that urgent anymore, since there are still two servers working, which are then quickly replicated to a third one;
- not using a dedicated spare server means that GitHub may better use every CPU and all memory they have available to serve user requests;
- DGit makes it much simpler to manage GitHub infrastructure, e.g., adding new servers, handling repositories becoming very large or popular, etc.
DGit, as mentioned, is built on top of Git itself and does not take advantage of RAID, DRBD, or other replication technologies. Instead it has to implement its own algorithms for serializability, locking, failure detection, and resynchronization. In conversation with InfoQ, GitHub explained that they are using the 3-phase commit protocol to handle distributed transactions and that “DGit has almost eliminated service interruptions from the Git layer caused by single host or whole rack outages”.
As mentioned, GitHub has been gradually rolling out DGit for the last months, starting with their own repos, then moving to a number of high-profile public repositories once they gained confidence in the new system. Finally, during last February, GitHub began moving repositories in bulk. Currently, about 60% of all repositories and 98% of all Gists, accounting for 67% of all GitHub data, are already working in DGit and “import jobs to DGit from the legacy storage architecture are running around the clock”, GitHub stated to InfoQ.