BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Infrastructure as a Code—Why Drift Management Is Not Enough

Infrastructure as a Code—Why Drift Management Is Not Enough

Key Takeaways

  • In software engineering, drift is a major source of code refactoring and slow release cycles.
  • Drift management is critical as it can cause unnecessary toil and ultimately hamper developer productivity
  • Applying and using Infrastructure as code process can help reduce drift but its not enough as IaC configuration versions are intertwined with application versions
  • Automatic drift detection and auto-updates are key but that requires integration with both your configuration as well as your version control
  • Use of environment-as-a-service solution can be of huge impact on not only drift management but also your overall engineering team.
  • A combination of an EaaS solution coupled with IaC, thorough documentation and processes, automated change monitoring, and regular audits will help you manage the drift effectively.    

What is configuration drift? 

As a company evolves, its software production and delivery systems tend to become more and more complex. Naturally, this leads to frequent changes to their configurations. 

In a perfect world, all of these changes would be tracked in a comprehensive and well-structured manner. But we do not live in a perfect world, and, as things stand today, many of these modifications go unrecorded. If they are benign, then the impact on the systems is minimal. However, if these changes take the systems out of a hardened state, then the owners will be facing what is known as “configuration drift.”

Drift arises during the time elapsed between when a new branch was created and when it is merged, and multiple other changes were committed to the main branch, resulting in conflicts that need to be resolved. In small development teams, a developer can just update his colleagues that he committed changes. In larger teams, the number of changes between a fork and merge can be substantial, resulting in more conflicts and more time wasted resolving them. 

Code drift is probably the most common kind of drift, but with the complexity of today’s software architecture and dependencies, configuration drift is also frequent. A developer might create a new table on staging or pre-production after a branch was created. Maybe a new lambda was created or an SQS configuration was updated. If the developer’s environment is drifting, the code might work fine on an old version, but fail upon merging into the updated environment. This might not happen immediately, in simple cases, but probably later on, when complexity increases and more diverse scenarios get used. This leads to lots of debugging and rework, resulting in a longer time to release. In the next couple of sections, we will cover different approaches to code and configuration drift management.

Exhibit 1: Code drift example

The damaging effects of configuration drift 

Code travels through several environments, from personal workstations to shared development, test, QA, staging, and production. Inconsistencies between some of these environments can cause security vulnerabilities and issues at deployment. Furthermore, if you’re dealing with applications and services that require strict regulatory compliance standards, then the entire development process might be at risk.

Making sure that various environments in the software development cycle share a similar configuration is a time-consuming task that involves coordination between several departments. Teams can sometimes work for weeks at a time provisioning different environments for various stages.

Quite often, employees will perform small changes to their environments without conveying these to the production environment. This is the type of configuration drift that often goes unnoticed, but it can also wreak the most havoc. Having been unnoticed for so long, this drift can lead to application failure, which may take many hours for software engineers to trace back and fix. Instead of spending that time on more productive topics, they are faced with the task of troubleshooting code and environments, trying to identify potential causes for the unexpected behavior.

Over time, this can prolong your product development lifecycles. Alongside downtime, this is one of the most commonly listed consequences of configuration drift. A 2014 article by Gartner mentioned that an average IT company loses approximately $5,600 for every minute of downtime.

Furthermore, when these incidents occur, they may arise in a show-stopper form, causing the developers to abruptly stop their work, switch context and jump in to resolve the incident. This interruption of work potentially introduces bugs into the code because not all the previous mental context can be brought back and some might beforgotten. The danger appears for a vicious circle to be created or continued.

This affects employee satisfaction, causing metrics related to Developer Experience to go down.

Possible approaches to minimizing drift 

Configuration drift is more or less unavoidable. There are, however, a number of ways you can keep configuration drift at a minimum. In the next section, we will discuss some practical approaches to drift management.

Establishing clear and well-documented processes  

When dealing with configuration drift, your priority should be to establish a clear set of change management policies and procedures. Human error is often the main cause of drift, whether it involves failing to follow procedures or failing to properly communicate with other teams. A well-structured change management policy guarantees that all necessary testing takes place and that someone authorized reviews and assesses the impact of these changes before formally approving them for production, thus minimizing the risk of unwanted side-effects and unknown issues. Furthermore, documenting the changes that should take place, when they should take place, and on what systems is a must.

The various ways to apply infrastructure changes must be reduced to a minimum, and ideally only have a single channel through which all of the changes get applied, no matter if they’re applied into dev, staging, or production.

Besides the channel to push changes through, clear permissions must be enforced and approval/release privileges should be granted only to a select group of people who are most experienced and trusted due to their previous history.

Anything that does not fit the standard is a vector for potential configuration drift.

Implementing Infrastructure-as-Code (IaC) 

One of the most efficient ways of eliminating configuration drift is adopting infrastructure-as-code principles and using solutions such as Terraform.

Instead of manually applying changes to sync the environments, which is inherently an error-prone process, you would define the environments using code. Code is clear, and is applied/run the same on any number of resources, without the risk of omitting something or reversing the order of some operations.

By leveraging code versioning (e.g Git), an infrastructure-as-code platform also provides a detailed record, including both present and past configuration, which removes the issue of undocumented modifications and leaves an audit trail as an added bonus. Tools like Terraform, Pulumi, and Ansible are designed for configuration management and can be used to identify and signal drift, sometimes even correcting it automatically—so you get the chance of making things right before they have a real impact on your systems.

As with any tool, the outcome depends on how you’re using it. Using a tool like Terraform does not make your company immune to configuration drift by itself. Processes still need to be in place and followed by everybody; even when all deployments depend on IaC, drift may occur in certain situations (e.g. when remote resources are added, removed, or modified). There is also no way of guaranteeing that all deployments are made using IaC since in many cases manual deployments are still possible using CLI, API, or the web portal browser.

The easiest way to detect potential drift in Terraform is to re-compute and assess the plan that Terraform would execute to bring the infrastructure to its desired state: if the plan is empty, there is nothing changed, compared to how you described it should be; if there are steps to be taken in the plan (and you did not change the code), it means changes were incurred through other channels, causing a divergence in the configuration. Sometimes, this can be fixed automatically, and the system brought immediately to its described state, but you should at least investigate how things got to differ—and adapt processes accordingly, to prevent this from happening again. 

Infrastructure-as-code will prove even more useful when sharing and releasing containerized applications. Although a container image includes all the code and software dependencies required to run, once deployed in the cloud, it often requires additional infrastructure elements to allow scalability and increase reliability (e.g. a load balancer, monitoring, logging, etc.).

After the app is successfully deployed in the cloud, you need to make sure it is accessible to the designated audience and runs smoothly. This means you need to recreate all the infrastructure around the container image, and the easiest way to do this is through an IaC template that describes all the necessary configuration.

It’s important to note that the behavior and reliability of containerized applications can be greatly affected by differences between environments, such as development and production. This is due to all the databases, services, and other cloud-native resources, which are external to the app, but critical for it to run properly. In this sense, IaC helps by making changes reproducible and predictable, ensuring that the staging environment closely resembles the production environment, and making deployment of code and infrastructure changes to production a lot less risky and considerably faster.

The pros and cons of rules vs. IaC

Having a manually-executed recipe, which should be respected to the letter by different people, multiple times, in very close repetitions is an error-prone approach. Incidents are just waiting to happen—it’s not a matter of “if,” but a matter of “when” and “in what way,” and also “how often.”

Dealing with tested code, which runs much faster, and consistently applied every time, eliminates most of the issues, but ultimately this falls under a larger process, which is change management. Policies need to enforce the use of IaC and block other ways of applying changes, as well as ensure that quality-related processes are followed by all team members. Ultimately, testing, code reviews, impact assessments, and approvals boil down to some buttons clicked within a UI or a command run in a CLI tool, but the underlying work performed before these final actions is so important, and still lies in the hands of people.

IaC gives you the means of doing better and eliminating issues, reducing incidents, and moving faster, but it’s up to you how you leverage these.

Addressing drift with Environment-as-a-Service (EaaS) 

Change management and automation will help you build and scale your company and also create an engineering culture based on simplicity and straightforward processes. But one thing that can help you properly implement these are Environment as a Service solutions..

Earlier, we talked about the detrimental effects of configuration drift on your engineering team: hours upon hours spent on troubleshooting code and environments, trying to identify potential causes for any unexpected behavior. Also, static environments are significantly more predisposed to configuration drift, as they are mutable—in order to reach a certain state, changes get applied to the current state, but the current state might not be what we expect every time. Having immutable environments, which are created from scratch, definitely reduces friction and greatly lowers the probability of encountering errors.

In this sense, an environment-as-a-service solution can have a huge impact on the broader engineering team, granting seamless access to environments for testing and development, while increasing the time spent on actually developing the product. Over time, your engineering team will become more independent and product-focused.

Summary

The reality is that configuration drift will remain unavoidable for the foreseeable future. And while there are drift management methods being implemented on the market, such as automating the process of comparing the current configuration of environments against their baselines, these only serve to mitigate the negative effects of drift. An EaaS solution, coupled with an IaC platform and good change management policies will help you prevent drift and shorten your development cycles. With proper webhooks, we can identify changes to either the code or the infrastructure. By maintaining the state of each environment you know whether or not it’s drifting and can decide to trigger an auto-update. Typically, you’d want to avoid drift for any pre-production environment. Production environments run live customers, typically demand meeting certain Service Level Agreements (SLAs), and have maintenance windows, hence those environments would be updated by a manual trigger or through Continuous Deployment with a scheduler.  

About the Author

Rate this Article

Adoption
Style

BT