Transcript
Jenson: Roads might be the most expensive thing you touch every day, with a typical road costing around $3 million per mile. The only way that this immense cost could possibly be worth it is if we share the roads. Everybody shares the same road, whether you drive a Toyota or a Tesla, you expect the same quality. You have a contract with that road. There's enough lanes for the required traffic. You can drive safely at or above the speed limit. The car you buy from a manufacturer is allowed to be driven on the road. You probably only even notice a road if it breaks that contract. The road is not wide enough for traffic. It's not safe enough with all these potholes. The drainage is broken so the road is flooded. It's the same with software infrastructure. If there are many different paths to production, your applications don't share a common road, it will be more expensive to provide a good, reliable experience. Imagine a company where every project uses a different source control system, a different build and CI tool. A different deploy pipeline. You would need to hire an infrastructure engineer to support each project. That would be like every car manufacturer building roads that only their cars can drive on.
Where the Rubber Hits the Road
A developer shows up your at company and wants to get a new application into production, what are the steps they have to go through to get their code deployed? How is the application developed? How is it built and tested? How are the resources provisioned? How is the application configured? How is it deployed? These questions need answers because they make up the contract between application and infrastructure engineering. It's the surface area between the two. It's where the rubber meets the road. The goal is to reduce this area into a single lane with clear direction, a paved road.
Background
I'm Graham Jenson. A staff engineer at Coinbase. I've seen Coinbase grow from 40 to 400 engineers, from tens to thousands of projects over the 5 years I've worked there. My main job has been making tools that take developer source code, build it, package it, and ship it to production safely and securely. I'm like the reverse delivery guy, I take your packages and then send it to Amazon. Coinbase being a company that deals with billions of dollars in cryptocurrency, has a specific focus on security. We've spent a lot of effort making our deploy pipeline secure. Our developers also want to get their code quickly deployed. My job is to remove that friction between security and developer productivity.
Outline
Today, I could be talking about how much Coinbase has gotten from its deploy pipelines, how we deploy thousands of servers across hundreds of projects per day, to serve our millions of customers and their billions in assets. Although I'm super proud of where we are today, that's not the whole story. Instead, I want to talk about where Coinbase is now. I want to talk about the journey we took to get here. Specifically, I want to look at our paved roads and how they've had to change over time in response to our company growing.
Let's Hit the Road
Coinbase was founded in 2012. It started out as a Rails app deploying onto Heroku. Heroku is a platform as a service. It's a pretty fast way to get an application spun up and seen by the world. To deploy, you get pushed to a remote branch. This will kick off a build pack, which is like a Docker file that builds and containerizes your application and launches it into the cloud. The application is configured on Heroku's admin panel with information like the number of servers to run and the environment variables. At a point in time when the company priority is to get your product in front of people, why would you want anything more complicated? The paved road to Heroku then looks like, create a Git repository for your project. Actually, create the application. Add the project to the CI server. Ask Bob to create the Heroku application. Ask Bob to create the Heroku resources. Ask Bob to create the Heroku configuration. Go through security and code review, get pushed to deploy.
Bob is a blocker, he has the Heroku login, which means if we want to create a new application or change a configuration, we have to ask him. Getting back to our metaphor, Heroku is like a toll road. We are paying sometimes a lot to use someone else's paved road. We don't control quality, security, or safety, and it introduces blockers like Bob. On the other hand, we don't have to spend a lot of time learning to pour concrete by building our own deploy pipelines.
The Gravel Road
Heroku isn't perfect. Although a lot of companies start there, they soon move off of it, either because of the cost, the security, or the lack of control over the environment. Late 2014 when Coinbase had a few projects and teenage engineers, we started the discussion around what it would look like to replace Heroku with our own paved road. There were a few things we didn't want to lose when moving off of Heroku. Using Git commits as the primary index for what is deployed, means you can find out the exact code that's in production. Also, keeping parts of the Twelve-Factor App philosophy, like stateless processes, one project per repository, and explicitly declared dependencies. This will help us build scalable and resilient applications. We also wanted to add some more features, like ensuring that deploys are properly scanned and reviewed before going out. The ability to scale horizontally to dynamic events, and allowing our users to control their own configuration.
We decided to move to AWS, for the scale it can give us spinning up hundreds of machines in minutes, and the security controls it has over its many products. Taking further inspiration from projects like Etsy's Deployinator, we started to have discussions about the culture that we wanted to build. We wanted to default to open. Let developers contribute and deploy to any project on the company. We wanted to deploy on the first day. We want to remove the fear of deploying by having every new developer deploy coinbase.com on their very first day at the company. We wanted to work hard to be dumb by showing relevant information, answering questions before they're asked, and being intuitive to use. On the technology side, containerizing our applications made sense. We decided to use Docker and Docker Compose to define applications processes that need running. These processes would be split up and deployed to many AWS Auto Scaling groups, with declarative descriptions of how to handle a lifecycle when scaling management of many cloud instances. We also wanted better control over our sometimes very sensitive environment variables. This was before HashiCorp Vault was created. We wanted something similar, a secure store.
Codeflow
What we really liked about Heroku was the easy to use and informative interface. We ended up building our own interface and calling it Codeflow. Codeflow is a pretty simple CRUD Ruby application. A user would log in using GitHub authentication, find the project they wanted to deploy. Select it, and be presented with a list of commits and the history of deploys. You can then configure the project by selecting the deploy target and editing the Docker Compose, the environment variables, or the Auto Scaling group configuration. The user can select a commit, which shows whether its tests and security scanners have passed, if it's been properly reviewed, and if the build is ready to be deployed. Each deploy target has a button, which when clicked, will create a deploy. When deployed, Codeflow will bundle up the configuration files, build an initialization script. Spin up all of the new Auto Scaling groups with attached security groups, IAM roles, and load balancers. These Auto Scaling groups will create new compute instances. Once those instances become healthy, the old instances are attached and deleted, finishing the deploy.
With Codeflow, our paved road now looks like, create a Git repository for your project. Create the application. Write a Docker file to containerize that application. Add the project to the CI server. Ask Bob to create AWS resources like security groups, IAM roles, and load balancers. Add the project to Codeflow with the Docker Compose, the Auto Scaling group configuration, and all the environment variables. Go through security review and code review. Click to deploy. If a developer follows these guidelines, they'll get their project into production. The only place Bob still has a job is creating AWS resources. This is only needed infrequently. Since we codified all of our resources with Terraform, this is not so much of a blocker yet. There are many other aspects of our infrastructure, which are not part of the paved road. Things like building our hardened AMI, or how our deployers manage the lifecycle of Auto Scaling groups. These details are the substrate of the road. The aspects that our developers shouldn't need to worry about. The only time they'll ever see them is if they hit a pothole.
Moving the Road
By March 2015, we had a prototype of Codeflow deploying a few small projects. By July, we were ready to move coinbase.com from Heroku onto Codeflow. This was very stressful. We made lists. Lists of things we needed to do before, lists of things we needed to do during, and lists of cleanup. Lists are great. They make it much easier to collaborate when you have many stakeholders all working together on a single project. They also highlight risky areas and clarify what to do if something goes wrong. Coinbase.com was founded in 2012, and was on Heroku until July 2015. It took six months of planning, development, and execution to get us onto our own paved road. My point here is that there's no rush. Roads are difficult and expensive to build, even more so to change. Doing lots of groundwork here will save you in the long run.
Desire Paths
Desire paths are created by people walking a route, over and over, eroding the soil and eventually creating a new trail. They appear in places where there are no or inefficient existing paths. You can stop the erosion by building fences or you could learn from them and pave a more usable path. Coinbase was a Ruby company but people wanted to start using Golang. There's really no blocker to using Golang. If you could containerize your application, then our deploy pipelines would work just fine. Also, there was no hard rule saying you couldn't use Golang, especially if you thought it was the right tool for the job. The only thing we had to convey is that it wouldn't be supported. They'd be going off-road. As more projects started using Golang, developers were working out patterns and solutions. The path was becoming more well-trodden. As this happened, we started to offer more support. After a few years of work, the road was paved and the developers were happy. In fact, Golang is now the preferred language at Coinbase. As road makers, we don't really decide where the roads go. Every company will have a different path. My tool might not be right for you, because your problems aren't the same as mine. The team building the road just needs to look where people are going and pave underneath their feet.
Single Lane Road
In 2017 to 2018, Coinbase exploded with new users, new engineers, and new projects. When you get hundreds of projects, all building and deploying at the same time, you quickly find out which of your paved roads don't have enough lanes. Bob being the only one creating AWS resources became a big blocker. Even though we had all of our resources codified using Terraform, adding to them or changing them became a huge pain. To remove Bob from this workflow, we created a project called Terraform Earth. Terraform Earth allows developers to submit a change to our codified resources. These resources are in a format we created called GPS. GPS compiles to Terraform but simplifies it to encourage developers to manage their own resources. Once developers submit a change, Terraform Earth comments a plan of the changes that will happen have merged. Once the plan and the code have been reviewed, the change can be merged. Then, Terraform Earth will apply that change to the cloud. This removes Bob as a blocker, and has enabled us to scale to hundreds of changes from hundreds of contributors per week. Our Auto Scaling group deployer also became a big blocker. It started out as nothing more than an infinite while loop listening to a queue. We found this design difficult to scale reliably and safely, and deploy times were getting painful during peak work hours.
Teams were also wanting to deploy to other locations, like serverless with Lambdas and API gateways. What we saw was an explosion, not only in deployments but different types of deployers. In response to that, we created bifrost, our paved road for building deploy pipelines. Bifrost uses AWS step functions, which are basically state machines that orchestrate AWS lambdas and come with strong guarantees, error and retry handling, and built-in scale. This let us build deploys quickly, replacing our old ad hoc deploys with ones that are more reliable, scalable, and share a similar API. Around this point, we started building an internal framework called Coinbase Service Framework. This is a multi-language library that contains much of the boilerplate needed to build a service at Coinbase. The paved road with Terraform Earth and CSF looks like, create a Git repository for your project. Create the application using CSF. Write the Docker file to containerize your application. Add your project to the CI server. Submit a code change to create the AWS resources that you need. Add the project to Codeflow with all of its configurations. Go through security and code review, and click to deploy.
The Highway
From 2018 to now, saw Coinbase grow enormously. The biggest problem when you get thousands of projects and hundreds of engineers is that any ad hoc process, any tribal or siloed knowledge becomes a massive pain point. The only way to scale is to document everything. We needed to build maps of all of our existing roads and where they went. Another issue is the increased amount of work that goes into upgrading shared components like CSF, across all of the projects that depend on. Before, if a component was used by a dozen projects, it might take an hour to upgrade all of them to a new version. When it's 500 projects, that can be weeks' worth of work. Components become a victim of their own success. The more projects that use them, the more difficult they become to maintain. This is one of the reasons why in late 2019, we started a monorepo at Coinbase. A monorepo is a single large repository that contains code for many projects. Having all components and their dependencies in a single location, with examples and tests, makes broad multi-project updates possible with a single commit.
With a monorepo we had to throw out some of our old assumptions. Firstly, and most obviously, a project is no longer one to one with the repository. Second, configuration like Docker Compose will now live with the code. Also, although we still use Docker, we don't allow each project to define its own Docker file. Instead, we have a shared base container for each different language. For our monorepo, we use Bazel, an open source build tool based on Google's internal monorepo tool, Blaze. Bazel has some pretty strict constraints, like explicitly listing all the inputs and outputs for every build. This creates a queryable dependency graph from every file to every output. In the monorepo, when we see a new commit, we can calculate the exact tests to run and the exact artifacts we need to build based on the change files in that commit. This together with really good caching, will allow us to expand our monorepo to thousands of projects.
To deploy a project in the monorepo, we don't have a UI yet. Instead, we use a CLI tool called Artifact-Shipper, or Ash. To deploy from the monorepo, you just run, monorepo> ash deploy, the project. This will give you a list of artifacts to deploy and where they can be deployed to. Once selected, Ash will send that artifact to one of our bifrost deployers. You can also use Ash to get deploy history, follow a current deploy, or perform a bunch of other related tasks. To create a new project in the monorepo, we built a bootstrapping tool, where you run, monorepo>bazel run new, answer a few simple questions before it creates all the files in the right locations.
Our paved road and the monorepo now looks like, execute bazel run new, to create a new project with CSF. Submit a code change to create your AWS resources. Go through security and code review. Ash deploy. This has removed and simplified so many steps, than where we were just a few years ago. Our monorepo is still in its early stages. It contains dozens of projects, supports four languages: Python, Ruby, Go, and Node. It continues to grow daily. The next big project though, we'll be moving coinbase.com into the monorepo. At the moment, we're just laying the foundations for that big move.
Fork in the Road
The United States Interstate Highway System is a massive grid of roads that crisscrosses every state. It would cost around $500 billion to build today, but since one quarter of all the miles driven in a year in the states is on those highways, it's probably worth it. Imagine the way in which interstate roads were built before this was in place, each state would have to negotiate with its neighbors about how and where to build the roads. Sure, Hawaii and Alaska would have it easy, but it would be a nightmare for states like Tennessee and Missouri, which each have eight neighbors. For drivers, the safety, the rules and the quality of the roads would change from state to state, making it a horrible experience. Having a single entity set the standards by which these roads operate removes the need for in-squared negotiations. This is the same at any organization. If you have to negotiate with every team and project how they're going to build and deploy, it's going to be very difficult to build a shared paved road. This means less quality for developers and infrastructure engineers. If you want to build a shared paved road, the very first step has to be getting buy-in from your organization.
End of the Road
Building a shared paved road, let's you focus on the specific problems developers have in getting their code into production. The easiest place to start is to write a list of all of the steps a developer needs to go through to get a new application into production. Once you have that list, work on reducing the number of variations, eliminating steps, simplifying decisions made by developers, and removing any blockers like Bob.
See more presentations with transcripts