Key Takeaways
- Continuous deployment has great influence on the business, quality, velocity and satisfaction.
- Like any major change start with small incremental steps.
- Several measurements are needed to be in place before applying full continuous deployment and the most important measure is testing.
- Continuous deployment is an evolutionary process - start with something "good enough" and improve along the way.
- Manual scripts and tasks are error-prone (wasteful) and time consumers, aim to automate your full pipeline.
You probably ask yourself - "another article about Continuous Deployment (CD)?". Well, the answer is Yes and No; Yes - the article is about CD, No - we will be discussing the unique CD implementation we have at MyHeritage that includes canary testing, all developers working on a single branch (trunk), chat bots as part of the deployment pipeline and more.
Introduction
Several years ago, the R&D at MyHeritage used to work with several branches, some of which lived for quite a long period. Whenever a feature was close to readiness the code was merged to the trunk. These merges were quite a nightmare, with many conflicts which wasted precious R&D time on trying to figure out what was going on.
Our hassle didn’t finish with the merge since we had weekly deployments, with many components that were merged from different branches, that had never worked together. They were scheduled to be released in one "mega" code distribution (service pack). Such service packs distribution was a tedious, error prone and frustrating process.
After struggling with many such deployments, I’ve decided that we should start switching to a continuous deployment way of work.
Pre-requisites: test levels, feature flags, statistics and logging
Switching to continuous deployment in a company that serves 90 million users, holds 2.7 billion tree profiles, 7.7 billion historical records and terabytes of sensitive DNA data must be done in small incremental steps. We must ensure that changing the way of working doesn’t affect the site’s stability.
The first step we started with was adding unit tests to newly written code, both for back-end and front-end code. This alone was not enough since there were still millions of lines of code not covered. We decided to tackle legacy code coverage by writing integration tests and end to end tests (using Cucumber) for the major components. These are not as efficient as unit tests and are much slower, but they have a major advantage of achieving good coverage with less development effort. Soon enough we had reached a situation with reasonable coverage for our codebase.
[Click on the image to enlarge it]
Figure 1: Testing at multiple levels is a key factor in applying CD
We have developed the ability to control certain behaviors externally to the code, mainly features exposure to users via feature flags. This is based on simple key store values (MySql DB backed up by memcached for high performance) and allowed controlling the exposure for testing purposes and the exposure to real users. All changes to these controls are stored in an audit table so we can track changes. We have started canary releasing by exposing new code to a small percentage of our users, monitoring production and gradually increasing the exposure if all goes well.
[Click on the image to enlarge it]
Figure 2: Feature flags system and the triggered email notification
In parallel we made sure that each feature is released with statistics and sensors that monitor the feature behavior in production (for instance, number of successes or failures, response time, etc.). We aimed for the minimal possible sampling time, especially for critical sensors that can monitor serious issues in production in a matter of seconds. This goes hand in hand with verbose logs that allow us to quickly analyze production issues as they happen.
As the number of statistics and log messages grew, we adopted several tools to help us understand the status in production, such as NewRelic and ELK stack. We also developed internal tools that allowed us to easily understand and track log messages and log level changes (from a steady state to problematic). A report is sent twice a day to all developers that are responsible for monitoring and fixing issues in their domain of responsibility. A similar tool was developed to scan all our statistics and report any abnormality found.
[Click on the image to enlarge it]
Figure 3: Automatic stats and logs scanning email report
Getting started: experimenting with continuous deployment
Once we had all this in place we were ready to start with an "experiment" – building a new feature by a small group of developers in continuous deployment fashion: frequent commits, no branching and developers responsible for distributing code to production.
The distribution to production was a manual process and used the same scripts that were used to distribute the service packs. It was not the ideal way to update production, but the experiment was very successful for us and gradually more agile teams switched to work this way. Quite soon the "service pack" had died.
We have realized the benefits of working in Continuous Deployment fashion:
- Higher velocity – no longer wasting time on merges.
- Higher quality - smaller code pieces distributed to production, with less bugs and easier tracking in case of failures (and much faster time to recover).
- Better coding - developers started to design more modular code that could be distributed separately. A side effect was better testability.
- Higher satisfaction - the R&D liked the new way of working and our users enjoyed more frequent updates.
Improving along the way: automating the pipeline
One of our R&D values is to always be looking for ways to improve. Obviously, the next step was to improve the manual process of distributing code to production.
We created an agile team whose purpose was to automate the CD pipeline. We wanted to avoid wasting time running manual scripts, that had to be invoked by one developer at a time. This solution did not scale and was a time waster.
After a single sprint, we were in much better shape. We had a solution that allowed developers to commit code from their IDE and trigger a build that eventually went live to 90 million users.
Behind the scenes we developed a Jenkins workflow, which was invoked after any commit to trunk (our single branch). As part of the workflow we run the following steps (some of them in parallel):
- Parsing the (structured) commit message - used to update the relevant people when build starts.
- Unit and integration tests.
- Building an RPM that contains the snapshot of the trunk after JS minify.
- Uploading the RPM to a canary server (production server that doesn’t get traffic from the external world).
- Running end to end tests (for the major user flows) on the canary server.
- If all goes well the RPM is uploaded to a repository server, that allows MCollective agents that installed on each production machine to pick up and install the new RPM.
- At the end of the process a notification email is sent about the update to production.
[Click on the image to enlarge it]
Figure 4: CD flow is composed of many steps that can be easily monitored
The satisfaction from the new "commit equals deploy" process was very high. It allowed developers to push their code from commit to production in 25 minutes in a completely automated process.
Further improvements
During this time we have learned a lot and collected feedback on how we can improve even further:
- Integration with Slack bots - progress of the build is reported to a dedicated channel, with mentions to the relevant people about the status of the build.
- Automatic "scan logs" task activated after any production update. If there’s a high number of error messages in the logs, it is reported in the channel.
- Improving release time - we encountered slowness on deploying RPMs to our large server farm, so we added an HTTP reverse proxy to parallelize the RPM uploads.
- We have split our tests into several groups to increase parallelization and shorten even further the release time.
- We have decreased the RPM size by removing assets from the codebase and by having a dedicated Jenkins flow for these.
- We added a special "rollback" job in Jenkins to allow fast recovery or emergency updates to production.
- We have indexed/integrated build information and feature flags modifications into ElasticSearch, to be used as annotations in Grafana/Graphite so they are visible together with all other metrics and allow easier correlation.
- A daily digest email is sent with all relevant updates. It allows synchronizing all stakeholders in the company on recent changes.
[Click on the image to enlarge it]
Figure 5: Typical CD progress report in CD slack channel
[Click on the image to enlarge it]
Figure 6: Detailed message in case of build failure
Conclusion
Continuous deployment was a key factor in improving our quality, velocity and satisfaction. There are several milestones to achieve before switching to working in CD fashion, but once you have made the change it will be beneficial for your R&D department and for the entire company.
This article is dedicated to the pioneers in CD way of working in our R&D department and to the entire R&D and DevOps departments.
About the Author
Ran Levy is the VP R&D at MyHeritage and has been working for MyHeritage in the last 6 years. He has twenty years of experience in the industry as developer, architect and manager in complex and large-scale systems. Ran is passionate about agile and efficient processes and was leading the transition to Continuous Deployment in the company.