The recent article about scaling Facebook’s release process covers its flexible methodology of pushing code to production. It focuses on how they moved from a "cherry-picking" to a "push-from-master" strategy over a period of one year. Facebook has shared details of their deployment process earlier too. Chuck Rossi, the author, was the first release engineer at Facebook and is currently engineering director, Release Engineering at Facebook.
Facebook’s release cycle is "quasi-continuous" - which is another way of saying that not every commit results in a deployment to production. Instead, tens to hundreds of commits are batched together and pushed every few hours. This tiered manner of releasing makes it easy to rollback any changes.
This new system was rolled out slowly over a period of one year, starting in April 2016. The previous model involved selecting specific changes from the commits in the master to be put in the release branch. The changes were pushed daily to production from the release branch. The number of such cherry picked selections ranged from 500 - 1000 per day. The remaining changes were pushed into a "weekly" release branch. With time, both the number of commits and the engineers involved grew and the manual effort from the release engineers became too high to be sustainable.
The key components of this CD system are a controlled approach to who will receive the changes, and automated tools for deployment and measurement. In the first step, changes are pushed out internally to Facebook employees after going through a battery of automated tests. Any regressions discovered at this stage are considered blockers and stop the process. The next step involves a canary deployment which pushes to 2 percent of production. There is continuous monitoring to detect issues. If all goes well, the changes are deployed to 100 percent of production. A tool called Flytrap collects user reports and sends out alerts about any anomalies.
Image Courtesy https://code.facebook.com/posts/270314900139291/rapid-release-at-massive-scale
The web and mobile products in Facebook follow two different paths, with the native mobile changes being deployed less frequently than those in web. Both of these are controlled by a system called Gatekeeper. In addition to this, Gatekeeper also separates out deployment and release. Such a separation has challenges, including maintaining backwards compatibility.
The challenges of mobile continuous deployment are unique due to the nature of tools and deployment options. Web deployments are easier since Facebook owns the entire stack as well as the tools. To get around some of these challenges, Facebook has built tools and libraries focusing on faster mobile development, including Buck, Phabricator, Infer, React and Nuclide. Facebook’s mobile deployment has three layers which run concurrently:
- Builds - All code merged to the mobile master are built, for all affected products (Instagram, Messenger), across chip architectures.
- Static Code Analysis - A combination of linters and a static analysis tool called Infer check for various issues including resource leaks, unused variables, risky system calls and coding guideline violations.
- Automated Testing - This includes unit, integration and end to end tests with tools like Roboelectric, XCTest, JUnit and WebDriver.
The mobile build and test stack runs on every commit as well as multiple times during the life cycle of any code change. There are between 50,000 and 60,000 builds a day for Android alone. The mobile deployment system follows the older web based model of releasing once per week, with cherry picked changes. Despite the growth in code delivery velocity and release frequency, engineer productivity has remained constant. However, the criteria mentioned in the article - lines of code and number of pushes - may not be the best measure of productivity.
According to a 2016 IEEE paper - and the related talk, Facebook was utilizing a form of CD as early as 2005. Some of the conclusions in that paper listed the prerequisites for CD to succeed - considerable and continuous investment, highly skilled developers, strong technical management, an empowered culture, risk-reward tradeoff management, objective retrospectives of failures, and smaller, focused teams.
Facebook's quasi-continuous deployment system has several advantages - there is no manual overhead for pushing hotfixes, there is better support for distributed engineering teams, and it leads to faster feedback cycles from users for engineers.