The GitHub engineering team recently blogged about how they redesigned their deployment approach due to the rapid growth of the engineering that exposed some problems with existing tooling.
Julian Nadeau, senior software developer at GitHub, provided an overview of the revamped deployment experience at GitHub. A year ago, GitHub employed a branch-deploy model using ChatOps to release changes before merging into the main branch. Using a Slack channel (#dotcom-ops), developers added their changes to a queue and monitored the queue's status. As team members' count increased, so did the messages related to deployment in the same Slack channel.
Utilizing a Canary stage, the code changes were first released to 2% of GitHub.com users and then 100% of users. This meant that before releasing to 100% of the traffic, it was not always possible to catch all issues with the release. Eventually, an incident would get opened, triggering the need to roll back.
With the growing GitHub team, the challenges with the above multi-step, information-heavy ChatOps deployment process became more prominent. The GitHub engineering team identified two problem areas: monitoring deploy changes, and catching more issues in earlier release stages, thereby minimizing the impact radius of production incidents.
To address these issues, the team came up with a single ChatOps command to initiate a deploy sequence and a user interface(UI) to provide an overview of any given deployment.
Source: https://github.blog/2021-01-25-improving-how-we-deploy-github/
A second canary stage was introduced at 20% of traffic, which reduced the risk with a 100% production stage. As shown in the above diagram, the stages are separated by automated 5-minute timers, with the ability to pause the deployment to provide more time to test. After the completion of the timer, an automated pointer progresses across each stage.
The below UI provides an overview of deployments as opposed to a heap of messages in the Slack channel. The stages highlighted in the above diagram can be seen on the right.
Source: https://github.blog/2021-01-25-improving-how-we-deploy-github/
The system can be monitored and started from Slack, ensuring minimal change in how a developer interacts with the deploy system.
The tech community on Hacker News took note of this, sparking an interesting conversation.
A conversation thread mused on the approach with a pull request (PR) getting deployed in canary, production, and then getting merged at GitHub, as opposed to merging first before releasing to the public. The thread got a response from Brooks Swinnerton, engineering manager, explaining GitHub flow and mentioning GitHub's Hubot.
Another thread compared the use of a Slack-based deployment system with CircleCI or Jenkins.
In this thread, a user Xorlev, commented that "...canary stages are just 5 minutes. Many problems take longer to manifest. That seems like a fairly risky release process." Nadeau (jules2689) provided an elaborate response highlighting the need to balance time and duration.
The GitHub engineering team has received much positive internal feedback on this automated system. This blog post is a part of the Building GitHub blog series, providing a deep dive into GitHub Engineering Organization. InfoQ has previously provided coverage on this series.