Going beyond publishing the post mortem of major incidents, GitHub recently introduced the Availability Report. Emphasizing the belief of collectively growing as an industry by learning from one another, Keith Ballinger, SVP of engineering at GitHub, provided the report of incidents in May and June 2020.
Providing the background of this initiative, Ballinger mentioned, "We strive to engineer systems that are highly available and fault-tolerant and we expect that most of these monthly updates will recap periods of time where GitHub was >99% available." On the first Wednesday of every month, the Microsoft-owned company will publish a report presenting the availability summary.
This report will not only have a description of incidents but also highlight what is being done to advance GitHub's engineering systems and practices. The real-time updates will continue to be on the status page.
It appears that there is an increasing focus on transparency at GitHub. The organization has also released the public roadmap, which will provide more information about what features GitHub is working on and when it will ship them.
As described in the availability report, there were two incidents each in May and June 2020. In both of these months, one (each month) was related to the MySQL instance. The conversations on Reddit and Hacker News debated whether using Postgres will be beneficial for GitHub. Earlier this year in February, the unexpected database load caused multiple service interruptions for GitHub over eight hours, affecting the main database cluster mysql1.
GitHub has communicated that there will be a continued investigation into the CPU starvation issue and that they will continue to utilize their automated failover systems to reduce recovery time. This Hacker News discussion also sympathized that the incidents were the result of unexpected edge-case and that they would have been very difficult to reproduce outside of production (e.g. overflowing primary key index).
Applauding this initiative, twitter handle Micro Services tweeted, "We're happy to see @github publish an availability report. The world has come to rely on GitHub for source code management. It's where we all store our DNA as technology companies [...]."
Following their commitment, GitHub published their availability report for July 2020 here, where Ballinger has described the sequence of events that caused a four and a half-hour outage on July 13th. Occurring at the start of the week, GitHub Issues, Actions, Pages, Packages, and API requests reported "degraded performance" during this outage.