BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Reducing Build Time with Observability in the Software Supply Chain

Reducing Build Time with Observability in the Software Supply Chain

Tools commonly used in production can also be applied to gain insight into the CI/CD pipeline to reduce the build time. Ben Hartshorne, engineer at honeycomb.io, gave the presentation Observability in the Software Supply Chain: Seeing into Your Build System at QCon San Francisco 2019.

Honeycomb is a company and product that is built around trying to understand the workings of complex production processes. They chose to apply their own tool to their build process. Hartshorne said that "the visibility into this normally opaque black box has been stunning". He mentioned two areas where adding instrumentation has really changed his understanding of how the build was behaving: variety over time and running in containers.

Hartshorne mentioned that he knew that their build was getting slower as they were adding a lot of code. What they didn’t realize was that the slowness was not just unequally divided, but that there were some areas of the build that fluctuated dramatically (over the course of months) going both up and down:

As we had added and removed code while building Honeycomb, we shifted frameworks and made architectural changes to code that dramatically influence the amount of time some areas of our build took. My naïve expectation that all build stages got slower over time was wildly inaccurate - some stayed the same, some went down, and others went up by a little, with some going up by a lot. Understanding this variety changed my mental model of how the build was behaving and helped us focus efforts where it would have the most impact.

Hartshorne mentioned that they had switched SaaS providers with a side effect of changing the build from running in VMs to running in containers. This was mostly irrelevant to their decision to change providers (the main reason being for easier parallelization), but they figured it’d be nice to have the reduced startup time you get in a container:

While our numbers did show that the median time per build step dropped by a bit, the 95th percentile increased dramatically! This was not something we had expected and is not something we would have noticed without powerful tools to visualize performance over time.

We dug in to it a little bit and talked with our CI provider but the culprit for increased time eluded us. The closest we’ve come is recognizing increased co-tenancy issues due to tighter container packing, but one limitation of keeping instrumentation in user-space was our inability to get numbers to confirm the idea.

Thankfully, the time we gained from running independent steps in parallel far outstripped the increased variance per step and the overall build times still dropped significantly. In this case, the instrumentation identified a previously unknown characteristic of our build system - the wider variance in per-step time - and including that in our model for the process lets us make better decisions about the architecture of the system and future work.

Those two examples illustrate the part of this experiment that’s been the most interesting, said Hartshorne. It’s taking the tools that they normally apply to production infrastructure (whether it’s tracing or metrics or anything else) and using those to influence more of the software supply chain, and more of the build and test processes. Every step along the path from commit to deploy could benefit from using the same toolset that they (as operators) are already experienced with for running complicated applications, Hartshorne said.

InfoQ interviewed Ben Hartshorne about the challenges they faced, what they have done to get insight into the performance of their build process, the benefits they have gained and what they have learned.

InfoQ: What challenges did you face when developing Honeycomb?

Ben Hartshorne: As a young startup, we hit a number of issues along the way that are totally common. We have our automated builds and as any codebase grows, the build slows down. We spend a certain amount of time making them better.

Most of these changes were rather boring (when you’re just starting out even the obvious things get skipped), but to give you a sense of the first steps:

  • Run tests on more capable hosts (scale up)
  • Run independent tests in parallel instead of serially (scale out)
  • Increase parallelism within tests (for languages that support multiprocessor builds)
  • Cache dependent libraries that don’t change between builds
  • Re-use built results instead of rebuilding at each step

Each of those can be done without really understanding why your builds are slow and they’ll almost certainly have a positive impact. Different SaaS providers may make some easier than others, but eventually you’ll exhaust the obvious easy answers and need better data in order to choose where to invest your time.

InfoQ: How did you get a deeper insight into the performance of your build process?

Hartshorne: Some of the tools you might use in the software supply chain export some metrics. Github has its commit history, and build systems have a small amount of timings and their status around the builds. We have other really fancy tools, and as developers and operators, we know how to understand very complex systems using these tools and we should take advantage of that.

The key to fitting these pieces together is to insist that our vendors have APIs that expose this data, both in realtime and after the fact. APIs are the glue that let us push a commit to GitHub that triggers a build in CircleCI that pushes release artifacts that trigger a deploy… APIs are how we create a software supply chain - and they will be how we instrument it as well. The APIs that expose timing and performance data are less complete than those that provide direct functionality because not enough people want them.

As our industry realizes that some of the data around how code moves from development to production can be closely correlated with the ability of a business to quickly respond to a changing environment, I think more people will want access to the numbers that let you better see these processes go by. Two of the factors identified by the State of DevOps DORA report as correlating to high performing teams are directly tied to how long it takes code to move from concept to production (deploy frequency and change lead time) - there’s no question in my mind that this will be an area of growth in the near future.

InfoQ: What benefits have you noticed?

Hartshorne: By hooking these processes together and understanding patterns in the data flow, we can really improve the lives of our own developers. We can improve the efficiency of our development process. This is an exciting area because there hasn’t been a whole lot of work focused there for the most part. I’m not really sure why this focus has been missing, but hazarding a guess I’d say that the CD portion of the CI/CD world seems to be the part that’s hanging on to custom patchwork setups the hardest.

Seeing developers hook up their code to a continuous testing environment is standard fare these days. Automatically moving the artifacts built by that continuous process into production seems to be far less standardized. Folks that work in PaaS environments might be the first to see progress here (in something like Heroku or Jenkins X or Kubernetes, the deploy step is much more likely to be easily integrated with the test step), but we’ll see.

InfoQ: What have you learned?

Hartshorne: First, it became clear that build systems are not as impenetrable as they appear to be. The purpose of a build system is to run a bunch of commands, and by hooking into that and using some normal operating system and process tricks, you can get an enormous amount of insight into your build processes very easily. Second, you can use this insight to focus your developer actions around maintaining your build system. And third, that there are huge areas of software supply chain outside of build and test that will benefit from this same kind of extra analysis.

Rate this Article

Adoption
Style

BT