After the damage was done and the air started to clear, GitLab wrote a post wrapping up what caused their 18 hour site outage, how they planned to move forward, and why the whole incident started.
The massive database load was first diagnosed as an influx of spam. But on further review, it was clear the incident was exacerbated because a troll reported a GitLab employee for abuse. The flagged account was then accidentally deleted because another employee reviewing abuse reports didn't realize it was for one of the team's engineers:
We would later find out that part of the load was caused by a background job trying to remove a GitLab employee and their associated data. This was the result of their account being flagged for abuse and accidentally scheduled for removal.
The affected engineer wrote in a bug report that his account was deleted because "we received a spam report from a user that was created 10 minutes before the spam report was made. A human error was made and all my projects were deleted".
Due to the increased database load, the replication from the primary to the secondary stopped because the Write-Ahead Log (WAL) segments were purged by the primary before the secondary processed them. Unfortunately, WAL archiving — which would have required the segments to be archived before removal was permitted — wasn't turned on.
Because this replication had stopped, the secondary needed a rebuild. Starting replication requires an empty data directory, so an engineer manually cleared it out. Only it wasn't the secondary's data directory. He had accidentally removed the primary database's data directory.
The unfortunate loss of the primary database should have taken the site down only for a short period, but things would get much worse for the GitLab team. In working to restore the data, they discovered that their backups weren't working. Their primary method of backups, pg_dump
, wasn't backing anything up. It was failing because of a version mismatch. Nobody knew it was failing because notification emails were blocked by the receiving server for not supporting DMARC.
Other backup options weren't available for various reasons. For example, the team wasn't using Azure disk snapshots on the database. Even if they had, it may have taken even longer to get the data back online:
Each storage account has a limit of roughly 30 TB. When restoring a snapshot using a host in the same storage account, the procedure usually completes very quickly. However, when using a host in a different storage account the procedure can take hours if not days to complete.
The only option was to restore the LVM snapshot taken six hours before the incident started.
The team has come up with 14 issues to improve their fix their recovery procedures and progress is being made. In the two weeks since the incident, they have already implemented WAL-E to archive those WAL segments in real-time to AWS S3. Testing a restore from this type of backup took just under two hours to perform a point-in-time restore. In addition, they are working on a system to automatically test recovery of PostgreSQL backups.