How GitHub Partitioned Its Relational Database to Improve Reliability at Scale

GitHub has been working for the past several years on partitioning their relational database and moving the data to multiple independent clusters. This effort led to a 50% load reduction and a significant reduction of database-related incidents, explains GitHub engineer Thomas Maurer.

GitHub architecture originally relied on a single MySQL database cluster, known as mysql1.

With GitHub’s growth, this inevitably led to challenges. We struggled to keep our database systems adequately sized, always moving to newer and bigger machines to scale up. Any sort of incident negatively affecting mysql1 would affect all features that stored their data on this cluster.

To improve this situation, GitHub engineers defined a strategy to partition their relational database without impairing the level of service. Key to this was the definition of virtual partitions of database schemes and the use of SQL linters.

Virtual partitions are groups of tables used together in queries and transactions. Identifying table groups that could be separated in the application layer was the first step to a smooth transition. To codify the relation among tables, GitHub engineers introduced the notion of schema domains. Below, you can see an example of schema domain, defined in a simple YAML file:

gists:
  - gist_comments
  - gists
  - starred_gists
repositories:
  - issues
  - pull_requests
  - repositories
users:
  - avatars
  - gpg_keys
  - public_keys
  - users

This notation was the basis for the application of SQL linters.

[Linters] identify any violating queries and transactions that span schema domains by adding a query annotation and treating them as exemptions. If a domain has no violations, it is virtually partitioned and ready to be physically moved to another database cluster.

Linters were applied to both queries and transactions. Query linters ensured that only tables belonging to the same schema domain were referenced in the same query. Transaction linters were used to identify transactions that needed to be reviewed or required a data model update.

Once virtual partitions were identified through schema analysis, the related table could be physically moved to different database clusters. For this final step, it was paramount for GitHub to avoid any downtime.

After an initial investigation of Vitess, a scaling layer running on top of MySQL, GitHub engineers opted to implement their own approach, dubbed write-cutover. This consisted in adding a new replica to a cluster then running a script to stop replication and make that replica independent from the original cluster. Using this approach, GitHub moved 130 of their most critical tables at once.

Thanks to this effort, explains Maurer, GitHub database was spread across several clusters. This made it possible to increase handled queries by over 30% while the average load on each host was halved. Additionally, GitHub engineers observed a significant reduction in the number of database-related incidents.

Maurer's article contains more details than can be provided here, especially concerning their write-cutover procedure. Do not miss his write-up to get the full picture.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Relational Databases topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter