The production data loss and hours of downtime at GitLab is an unfortunate and fascinating story about how little things, from spam to engineer fatigue, can coaelsce into something more catastrophic.
Anecdotes started to trickle in on January 31st, but a single tweet confirmed that something was amiss at GitLab.com:
We accidentally deleted production data and might have to restore from backup. Google Doc with live notes https://t.co/EVRbHzYlk8
— GitLab.com Status (@gitlabstatus) February 1, 2017
"Deleted production data" are not words any IT worker wants to hear, but it happens and that's why backups are so crucial to the operation of any production service. Unfortunately, as the team toiled through the night to restore service, the bad news got worse.
In a post outlining what happened, the trouble started when replication issues popped up due to malicious activity by spammers "hammering the database by creating snippets, making it unstable". Three hours later, the database couldn't keep up anymore and the site crashed.
Working late into the evening hours, an engineer attemping to resolve the problem succumbed to an unfortunate mistake and accidentally deleted the data on the primary cluster machine:
At 2017/01/31 11pm-ish UTC, team-member-1 thinks that perhaps pg_basebackup is refusing to work due to the PostgreSQL data directory being present (despite being empty), decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com.
At 2017/01/31 11:27pm UTC, team-member-1 - terminates the removal, but it’s too late. Of around 300 GB only about 4.5 GB is left.
As the team worked to discover what backups were available to restore, each option ended in a dead end.
- LVM (Logical Volume Management) snapshots only run once every 24 hours by default
- Regular backups only occured once every 24 hours and they weren't working
- Disk snapshots weren't running on the Azure machines running the databases
- Backups to S3 in AWS were empty
By chance, an engineer had made an LVM snapshot six hours prior to the deletion. Without this serendipity, even more data would have been gone forever.
Throughout the entire event, the GitLab team was completely transparent, posting live updates to a Google doc so the community could follow along. In addition, they had a live video stream of the engineers working through the restore process.
Roughly 18 hours after the database went down, GitLab.com was back online:
https://t.co/r11UmmDLDE should be available to the public again.
— GitLab.com Status (@gitlabstatus) February 1, 2017
The community was both supportive and critical of the team. Some posted messages of familiar condolences and praising GitLab for their transparency. Hacker News user js2 said the feeling was familiar: "If you're a sys admin long enough, it will eventually happen to you that you'll execute a destructive command on the wrong machine." Others were less charitable.
Despite the loss for GitLab, the community used their pain as a reminder to test backups, says David Haney, Engineering Manager at Stack Overflow:
GitLab got this part right, and are being heralded as a great example and learning experience in the industry instead of spited for mysterious downtimes and no communication. I promise you that this week, many disaster recovery people are doing extra backup tests that they wouldn’t have thought to do otherwise – all as a direct result of the GitLab incident.
Others teased that February 1st should become Check Your Backups Day.
GitLab started in 2011 as an open source alternative to the dominant player, GitHub. It has a hosted version at GitLab.com as well as self-hosted community and enterprise editions. Only the GitLab.com hosted service was impacted by the failure.