At QCon San Francisco 2022, Tom Wanielista, a staff engineer on Infrastructure at Lyft, presented on Adopting Continuous Deployment at his company. The talk is part of one of the editorial tracks called "Architecting Change at Scale."
Wanielista started his talk with defining bad deployments, and mentioned that everyone has had one here and there. For example, he mentions that when you do a deployment, all of a sudden instances run out of memory because of a memory leak. As a result, services are down, and there is an issue. Once fixed, you could run into another problem the next week or the day after a deployment. And when the problem isn't apparent, things are run back per "commit" to find out what causes it.
The situation above could lead to, as Wanielista calls it, a deploy freeze. Every time a deployment happens, the system becomes unstable. To mitigate this, stabilize the situation, deploy less often, and get back on track later. Yet, in his talk Wanielista mentioned he wants to do the opposite. Instead of freezing the deployment, he suggests to deploy more often and continuously. Furthermore, what he talks about is partially covered in his colleague Miguel Molina's post, "Continuous Deployment at Lyft." The talk will cover more behind the scenes, why Lyft needed Continuous Deployment (CD), how they got to CD, technical aspects of the design of a CD system, cultural aspects and the rollout of it, and the effects of the rollout at Lyft.
First, Wanielista explained the difference between Continuous Deployment and Continuous Delivery and used the definition of AWS:
The difference between continuous delivery and continuous deployment is the presence of a manual approval to update to production. With continuous deployment, production happens automatically without explicit approval.
Lyft needed CD as they started with Jenkins for deployments leveraging its pipelines.
Source: https://eng.lyft.com/continuous-deployment-at-lyft-9b457314771a
A user would click through the steps at the right time to make the deployment happen. And depending on the result, stop the deployment if it fails and possibly roll it back. A manual process where someone takes ownership in Slack by what's called "taking the ball." Users take the lock on a specific project and deploy it—a best practice at Lyft at the time. In addition, the developer performing the deployment also had to look at server metrics and consider the business impact—a lot to handle. Moreover, if multiple services or projects need to be deployed during the day, it causes stress to the developer.
As a result, deployment took half a day, there were deploy trains (many commits deployed at once), and many uncoordinated changes made it difficult to find out what happened during incidents. Lyft organized a deployment team six years ago to tackle this, focusing on the deployment experience and reaching Continuous Deployment. The goals were summarized by Wanielista as follows:
- Continuous Deployment
- Discrete changes going out individually
- Shipping often
- Shipping quickly
- Shipping any time
- New User Experience
- No more manual intervention
- Ambient observability
Furthermore, the design tenants for a new CD system had three significant temples: automated, scalable/pluggable, and responsive.
Based on the design tenant, Lyft knew it would work because of the smaller deployment size and quicker time-to-production.
Next, Wanielista continued with the design of their system for Continuous Deployment. It consists of three major components: AutoDeployer (the actual automated deployment system), DeployView (User Interface), and DeployAPI (storing all of the states on the deployment and conducting deployments).
As for the pipeline of deployments, Lyft copied the model from Jenkins. The model itself runs as a state machine without rollbacks or automated rollbacks. There are jobs that perform a specific step in a deployment, and they have a state. Deployment only happens when it successfully checks certain aspects defined in a gate, such as passing integration tests and firing no service alert. The state will change from waiting to running, or skipped if it fails the checks. Ultimately, the state goes to success or failure.
The AutoDeployer will monitor all projects for all services at Lyft and prioritize the most urgent deployments. The prioritization uses a priority queue concept, where a critical fix gets a higher priority, thus making the system more responsive.
To deploy the CD system at Lyft, Wanielista explained it was done in stages:
- Replace underlying plumbing
- Integrate into existing projects
- Onboard some projects voluntarily
- Onboard more projects voluntarily
- Change process (remove the Ball)
- Continuous Deployment is now opt-out
- New hire onboarding uses AutoDeployer
- Onboard all projects
And as a result, the learnings were as follows:
- Less delicate rollout would have been fine - there was less resistance than we thought - teams can use the approval system and opt out when necessary
- Removing Ball caused disruption but made some developers excited about continuous deployment
- No automated rollbacks would have been fine
- Pipeline concept is flexible, but ultimately confusing:
- Rollbacks are hard to model
- Commits can have multiple jobs
- "Where is this commit?" can be confusing
The results, in the end, were 90% adoption, fewer failures (only minor), a uniform deployment system, less exposure to Common Vulnerabilities and Exposures (CVE), and discrete changes deployed at a time. The only projects not using the developed CD system are projects with specialized configuration and data pipelines with significant performance hits during interruptions.
Wanielista also mentioned what lies in the future for the system, such as approaching 99% adoption, detecting lingering issues that span multiple deployments, smarter default gates, and anomaly detection.
And finally, he summarized his talk with the following:
- Continuous Deployment is worth it
- Automation and safety are true force multipliers
- Speed is a feature
- Design gradual on-ramps to introduce it into your organization, but do not overthink it
- Do it for the sanity of your engineering teams
Wanielista told InfoQ after the talk:
It took around two years to get to continuous deployment, but it was most certainly worth it.
Lastly, note that the talk on adopting continuous deployment and most other conference presentations were recorded and will be available on InfoQ over the coming months. The next QCon conference is the QCon plus online.