Honeycomb is a tool for introspecting and interrogating production systems. The team has been a long-time pioneer of infrastructure-as-code (IaC) and is currently using HashiCorp Terraform for their configuration-as-code management. They recently made a push to bring the rigor from their binary release process to their infrastructure configuration releases.
InfoQ took the opportunity to ask the team about their move to centralised push-on-green deployment, how that has reduced their reliability risk, and how painful it was.
InfoQ: You suggest in your original article that by continuously deploying infrastructure as code, you’ve de-risked your operations. Was the juice worth the squeeze?
Honeycomb team: Yes, it's enabled us to move quite a bit faster, going from two to four Terraform deployments per week at a steady state to two-plus deployments per day, and we’re taking advantage of some of the newer AWS and Terraform features.
Our infrastructure engineering manager worries that we're perhaps now on the other side of the spectrum of exposing ourselves to more risk in net, now that we're moving faster, but certainly the risk has scaled at a lower rate than the number of changes we're making. His main concern is that we've made it so routine for people to make large Terraform changes that they're getting less scrutiny than they should... but we'd also counter that perhaps these changes were already being made manually in peoples' Terraform clients without being run through a pre-flight gauntlet.
InfoQ: Did this help you achieve any infrastructure cost savings?
Honeycomb: Like any SaaS analytics startup, our cloud bill is a significant fraction of our overall expenditure. As part of this shift, upgrading our Terraform instance meant that we could use an AWS feature released in November 2018 which allows for combining the autoscaling groups that we were already using with Spot requests for peak load and to replace/reschedule instances automatically.
InfoQ: I see that you’ve used HashiCorp’s Terraform CI tool. Would you recommend this approach to others?
Honeycomb: Yes, especially given that Terraform Cloud is much cheaper than our previous tooling. Our existing CI tooling didn't manage Terraform state, and also we didn't want to give it access to mutate production. Having something out-of-the-box that understood Terraform and a subscription which included HashiCorp support and prioritization of our bugs was a definite win. Previously, I'd have said that it was challenging to get proper CI/CD for Terraform without Terraform Enterprise's price tag. We'd been quoted tens of thousands of dollars, which seemed a bit steep for us given we have literally four people in the company who touch Terraform.
InfoQ: What’s next?
Honeycomb: We have a lot of work that needs to happen to support Honeycomb's feature launches, ongoing operations and scaling up over the next year. I know that centralising some of our Chef management is on our wishlist for 2020, because while we use Terraform to handle our AWS infrastructure as code, we use Chef to bootstrap individual nodes, and that process is less smooth than we would like it to be.
InfoQ: Have you any plans to mitigate the higher risk profile that your infrastructure engineers manager mentioned, or is that accepted?
Honeycomb: We know we'd like to write more [policy as code] Sentinel scripts to enforce guard rails both against availability and cost risks. It just hasn't hit our top-five most important things to do, because infrastructure should always serve business needs rather than building too far along ahead of the train's path.
InfoQ: What were your main challenges in bringing your infrastructure as code into a CI/CD pipeline, and what can others learn from your experience?
Honeycomb: Honestly, it was pretty simple and painless.
backend remote {}
is pretty simple to swap out in place ofbackend s3
, but we obviously had to get people provisioned with Terraform Cloud accounts as part of the onboarding workflow rather than the workflow running locally on their machines.The worst pain was probably doing the non-mandatory but still useful upgrade of Terraform. Previously we were stuck on version 0.10 and moving to Terraform 0.12 introduced language changes, but also some rather large code diffs with their migration utility.