BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Blue-Green Deployment from the Trenches

Blue-Green Deployment from the Trenches

This item in japanese

Key Takeaways

  • Handling breaking changes in a blue-green deployment requires upfront planning and good architectural choices
  • One approach to handling breaking changes is to dependency order the releases, however, this can complicate the release process
  • Ideally, blue-green deployments in microservice architectures can make use of API versioning to ensure the requests are routed to the correct version
  • Attempting to tack-on blue-green releases into an architecture that doesn't support it can lead to more complicated and fragile deployments
  • While a microservice architecture has its benefits, it can make deployments more complicated. If it isn't needed, a monolithic approach could more easily enable blue-green releases

The other week I was driven to withhold a "thumb of approval" on a merge request to our GitLab project. I wasn't keen on the solution they were proposing – namely, specific changes in our application code base to support blue-green releases. It flags a warning to me as a code change – tying deployment to code; writing code to support the environment when the environment should be invisible and interchangeable. Creating these kinds of dependencies ties us to a specific platform and release method, and the extra code opens up a world of possible bugs and errors that could vary per environment and as such are extremely difficult to test for.

How did this happen? It’s quite an interesting background, and I wonder how common such things are. It all starts well enough with a desire to improve releases so that we can get changes into production more often.

Our team’s applications are relatively modern and flexible - hosted in Docker containers and deployed automatically to the cloud, unit and component tests run on changes, and deployment can continue automatically once a full suite of automated tests have passed and code quality gates have been met. We have the concept of a "release" as a tagged collection of build artifacts for multiple services which are deployed to a cloud environment.

However, moving these artifacts to "higher environments" (e.g. pre-production, production) requires downtime to restart all the services and must be scheduled out-of-hours, and the releases are run by a separate team. There are manual steps if we wish to run certain types of updates (for instance, database changes which are too complex or slow for Liquibase) and as such, these release windows are infrequent and painful for the team. Not to mention the exhausting antisocial working hours required. Overall a good candidate for improvement and blue-green releases should help by removing the need to work out of hours and the need for downtime.

To recap, the concept of blue-green deployment is to have (at least) two instances of an application running at one time. When a new version is released, it can be released to just one (or some) instances, leaving the others running on the old version. Access to this new version can be restricted completely at first, then potentially released to a subset of consumers, until confidence in the new release is achieved. At this point, access to the instance(s) running the old version can be gradually restricted and then these too can be upgraded. This creates a release with zero downtime for users.

There are, of course, caveats. Any breaking change to data sources or APIs means that old requests cannot be processed by the new version, which rules out a blue-green release. It’s one of my favourite interview questions to ask how one might approach a breaking change in a blue-green environment on the off-chance that someone comes up with a great solution, but it would probably involve some bespoke routing layer to enrich or adapt "old" requests to the "new" system. At which point, you’d have to consider whether it isn’t better just to have some good old downtime. Most software teams do their very best to avoid breaking changes but they are often inevitable.
So, let’s assume the best case and we don’t have any breaking changes. Let’s also assume, as was the case with my project, that we are deploying Docker containers straight onto cloud services - an Azure App Service rather than Kubernetes or another PaaS layer with support for autoscaling and routing. So, how would we go about it?

Our architecture consists of a number of microservices that communicate via REST APIs, deployed as separate artifacts – but currently, all artifacts are in a single Git repository and are deployed at once in a single release. So say we have microservice A and microservice B running version 1.0, and a release (version 2.0) that contains a new interface for A which will be called by a new method in B. And let’s say that we have load-balanced 2 instances of A and 2 instances of B deployed in production; for blue-green one instance of each will be migrated to the new release.

You can instantly see the problem – the instance of B which is on version 2.0 can ONLY call the instance of A which is on version 2.0. If it is directed to the 1.0 endpoint it won’t find the new functionality it requires. Because of this specific routing requirement, service B can’t use the load-balanced endpoint it picks up from service discovery to call service A and instead needs the specific "green" instance address.

This is the equivalent scenario that our team faced.

Let’s look at some of the solutions we could have for this:

1. Dependency-ordered releases

Release APIs before functionality that calls APIs. In the above case, if we did one blue-green release for microservice B, checked it was fine, and then made sure both instances of microservice B were migrated to version 2.0, we would then be safe to do a blue-green release of microservice A afterward.

This model is a good and simple way to accommodate incremental, non-breaking API changes, although it does of course result in many more releases as each dependency needs to be in place before the next service can be released. And it does make it more difficult to answer the question - "What version do we have in live?" - your tagged releases cross multiple microservice versions. But that really is the trade-off with microservices, deployment complexity against compute efficiency. A microservice architecture means that if one particular part of your system requires more resources, you can horizontally scale just that one part rather than having to scale the whole system, but you then have to manage the lifecycles of all the parts individually.

2. Versioning in API calls

There are a few ways we could introduce versioning to our API calls. There is the direct way, which is to put a version in the actual URL of a RESTful endpoint, for example. An alternative is to try and represent versioning using metadata such as HTTP headers; however this only works for intra-service communication where you control all the services. Otherwise, you can’t dictate that service requests must include the versioning information.

If our API endpoints are versioned, how does this help our release? It would allow our version 2.0 of service B to manage any HTTP 404 "URL not found" responses if it should happen to send a V2 request to a version 1.0 instance of service B, and it would allow our service A to host both V1 and V2 of the endpoint so it could continue to service the previous version while it is still live. This would result in some work to manage and clean up the V1-mitigation code in Service B once every service has migrated.

3. Rely on your infrastructure

The cloud-native option. Our team deploys applications to Azure. If you ask Azure how to do blue-green releases, they will point you to their Azure Traffic Manager product. This is a DNS-based load-balancing solution that supplies a weighted round-robin routing method. The weighting can be used to gradually introduce load to the newly migrated servers, and you can also add rules to ensure that "blue" servers only route to other "blue" servers - keeping your blue and green environments separate. It does come at a cost, albeit not a high one.

Back to our specific problem. We hadn’t built versioned APIs, and as I mentioned we currently deploy all our microservices in one release. I would classify our services as microservices because they can be deployed and scaled separately, but our release process effectively merges them into a BBOM (Big Ball Of Mud – the historical monolith).

For option three, without Azure Traffic Manager (which was deemed too expensive) our team had no way to check or enforce that when the "blue" front end sent a request to the back-end microservices, it would call the "blue" back end. This meant that unless we propagated changes from the back end first (which wasn’t always possible, especially as both blue and green share the same database) we were at risk of routing a request which could not be processed. The workaround that caused me to wince horribly was to include a configuration variable that could be set to either blue or green, and then to set an HTTP header in the requests from our front-end specifying this variable to effectively recreate Azure Traffic Manager functionality in the application code base. Aaaaargh.

The code could use this HTTP header/config variable as a flag when generating routing URLs, to decide whether to generate paths through the green servers, or the blue servers. So, for example, the "logout" link would have 2 configuration variables specified in the front-end config: one for green and one for blue, allowing different logout links to be generated depending on the server "colour" ... had enough yet?

Our team knew this was an awful way to create a blue/green release process, they were forced into it by the usual two demons of budget pressure and time pressure. The request was to create a blue-green deployment process within a month and without using Azure cloud-native services, and given our starting point our options were extremely limited. But we should have seen it coming earlier – API versioning should have been considered as soon as we knew we were building APIs, for example.

We fell foul of the "DevOps gap" because we had two teams with different priorities, a development team whose priority is to get changes into the release pipeline as fast as possible, and a WebOps team whose priority is to ensure repeatability and security of the cloud platform. When a request came in to build microservices, the development team assumed the WebOps team would manage things like blue-green releases and didn’t stop to consider how they should architect their solution to assist them. In the way of such oversights, it came back to bite us in the end.

So where are we now? Currently, we don’t have blue-green releases working using our hardcoded version; as I predicted we are discovering some pretty nasty routing bugs when we try and use the process we’ve built. What I expect will happen is that we’ll eventually switch to using Azure Traffic Manager. From there we will begin to break down our "Big Ball of Microservices" into multiple deployment pipelines so that we can plan a bottom-up release of new changes. In our original examples, our first release will upgrade Service A to 2.0 so that the new endpoint fields are available in the API and the database, then the second release will update Service B to call Service A’s new endpoint.

It’s been a very valuable learning process for us – bringing developers and WebOps teams closer together and working more closely with the release team to understand how we can help them. When skill sets differ it’s natural for people to delegate tasks that they assume belong to someone else (load-balancing application instances, for example, would be delegated to someone who understood Azure cloud concepts and the various template languages to write infra as code), but we’ve learned to break down these tasks so that both parties understand what the other is doing and can therefore help spot issues with the overall process.

Lessons Learned

In summary, there are a few things we learned from our early attempt at a blue-green setup.

Architect for Change

I am very much against "future-proofing" applications. If you don’t have a performance problem, don’t build a cache. If you don’t have a requirement to, say, delete content then do not implement delete. Chances are your guesses at what will be needed are wrong.
However, you SHOULD be making those future changes possible and easy, right from the word go. This means that when architecting your overall application design you should consider things like how to implement breaking changes at the database level, and how you might add versions to your APIs.

Don’t Create Microservices for the Sake of It

Microservices needn’t be the default for your design. If there are no pinch points in your architecture, no points that are more likely to be highly loaded than others, and if your components just talk to each other and will be deployed in the same approximate place (same cloud, for instance, or same data centre) then you may not be gaining a great deal from a microservice architecture.

You may gain more from simplifying your deployments by reducing the number of moving parts and also reducing the network latency between component calls. Don’t just follow the zeitgeist, have a good think about what you are trying to achieve.

Watch Your Team Boundaries

With any teams who are working together, be it UX designers and developers, business analysts and QAs, or developers and operations teams, we need to realise that the boundaries where the teams interface are the riskiest areas of the project.

Assumptions will be made all the time by each team – the developers will assume that the UX designers are providing valid HTML prototypes, for example; the business analysts will assume the QA teams have based their automated tests on documented requirements; the operations team will assume they’ve been notified of application dependencies. It’s good to use some techniques to flush out these assumptions whenever two teams start working together – you could take some tools from domain-driven design and run an event-storming event workshop, for example.

The earlier in a project that these assumptions are raised as risk areas, the better and safer things will be!

About the Author

Rate this Article

Adoption
Style

BT