Instagram recently published an article about their continuous deployment (CD) pipeline which lets them push code faster to production, identify bad commits easily and always be release ready. Put together in an iterative manner over a period of time, the key principles behind it include a high quality test suite, quick identification of bad commits, visibility at each stage to improve buy-in from stakeholders and a working rollback plan.
Before Instagram had the system in place, the rollout process was a mish-mash of manual steps and scripts. Deployment was done on one machine first to check for sanity. A rudimentary release tracking application, called Sauron, consisting of a UI and a database was in place for viewing the results of previous rollouts. The rollout was scripted in Fabric, an ssh based automation tool.
Facebook acquired Instagram in 2013 and they started to migrate their infrastructure from AWS to Facebook’s own data centers. Instagram has thousands of machines, according to Michael Gorven, Production Engineer at Facebook and the author of the article, but they manage to deploy code 30 - 50 times per day. The large scale of the infrastructure and future growth won’t impact the rollout of code changes, according to Gorven, mostly due to Facebook’s distributed SSH system that is used to perform the rollout.
Two of the initial obstacles for CD were a flaky test suite and a large backlog of commits. The former was caused by unreliable and slow tests, and the latter by multiple failed deployments. The latter increased as the size of the infrastructure increased. The test suite had to be optimized and a canary environment had to be put in place. Canary deployment is a pattern where the deployment is pushed to a subset of servers and tested. Depending on the result, the deployment is either rolled back or pushed to the entire server fleet.
Instagram added more intelligence to the system for it to be able to decide which commit to push to production, and also integrated Jenkins to categorize commits as good or bad depending on the test results. The results of these were pushed to the Sauron app, which according to Gorven, is also the primary tool for making things more visible to stakeholders:
The Sauron UI shows the commits and the rollouts. Rollouts are announced in a chat channel mentioning the authors of new commits being deployed, and authors also receive an email and an SMS so they know that their changes are going out. Lastly they are recorded in an operations event database (similar to Graphite's events mechanism) so they can be overlaid onto graphs.
Database migrations - both schema and server - are a key challenge of most CD pipelines. The solution is often specific to the product and the infrastructure. Instagram is no different - they have their own system for it. For migrating the database to a new location, the basic principle is to copy a single shard (while it’s still live) and repeat the copy until the deltas are small. It then uses a feature toggle to disable the shard, does one last copy, and uses the feature toggle to enable the shard in the new location.
Schema changes are done with feature toggles. Gorven says:
We'll deploy code which is able to read and write to both versions of the schema. We then make the database changes (adding columns etc.), and enable writing of the new schema (in addition to the old), possibly incrementally. If necessary we then run a batch job to update existing data to the new schema. Finally we enable reading of the new schema instead of the old, also possibly incrementally. We'll keep writing to the old schema for a while in case there's a problem.
Improving the detection of bad commits and maintaining the same speed of rollouts are among the next focus areas for the release engineering team at Instagram.