Transcript
Today, I'm going to talk to you a little bit about the Coinbase's journey towards self-service infrastructure. When we talk about Coinbase, a lot of time we talk about scale, right? We have this great story of going through November and December experiencing cryptomania and having explosive growth, 10x, 100x, huge requests, tons of users. It was a really tough time. I think we had a lot of foresight when we were coming out of that moment to realize that this wasn't going to last forever and that we had to build more of a market. After all, we do have a mission of building an open financial system for the world.
The powers that be decided, "Hey, we're going to diversify the business. We're going to start doing lots of new things.” All of a sudden, our scale of users turn to scale of complexity. I'm going to walk you through a little bit of our journey, and I have to give you some background. There are more or less three components to this. There is the component of where we were, where we're going, and where we are today, of course. I think to give you some background, we had some tools, we weren't prepared for this, and really, this is a growing up story for Coinbase's infrastructure team.
To dive right in, we have one important component that we build off of at Coinbase. It's a Terraform-like wrapper. It's Ruby, it's in code, it's Terraform; if you guys are familiar with Terraform, it's a great way to describe your resources in Amazon or GCP or whatever you want to use. In our case, we're an Amazon shop. We don't use it for much beyond that, but there are a lot of complexities that go along with Terraform. If you have a lot of engineers, you have to manage your state files, right? And it gets really easy to stomp on each other. It doesn't support for loops. Well, they're adding it, but at the time it didn't. So, we decided about two years ago, to build something like this.
GeoEngineer is simply a Terraform on its head before it starts running. We go fetch all the resources from Amazon, we build up our state files, and then we plan against those changes. It looks a lot like HCL, has some nice features on top of it, more or less allowed us to scale a team while people were still interacting with AWS directly. If you were to look at actually applying resources, it wasn't a very advanced process. Someone would hop on their laptop, they'd go assume-role into an account, they'd get GeoEngineer, they'd check out a repository that contains all their resources, and they'd go apply and hopefully, they'd hit every environment. If you aren't familiar, assuming a role is just a simple way for us to not have an account in all of our VPCs; you just assume roll into one, and you have the right permissions and you can go make those changes.
We also had one cool tool, Terraform Mars. When you wanted to see what was happening, or a developer wanted to make changes themselves, they would need some way of knowing what their changes to AWS resources would actually do. Of course, not everyone at Coinbase has AWS permissions. So how do you give them access? Essentially, they would put up a PR, it would come into a Github, Terraform Mars would pick up a webhook and it would say, "Hey, I'm going to go use all these AWS access keys, get into each account, plan the changes you made, and post it back to the PR. Great. Everyone can use infrastructure role done. Well, there's a lot of problems that go into this. First of all, we have, kind of, a complicated setup. Like I was mentioning, we have a scale of complexity. We also have a scale of engineers. That was a little bit of a new story. We're rapidly hiring and how do you manage all these people?
The Problem
Let's dive into the journey a little bit. This is what engineering at Coinbase looks like. Maybe you have 100 developers. You have, of those, maybe 12 infrastructure including engineering managers. Of those, we have three secondaries, so people who have root privileges. I, as an infrastructure engineer, can do a fair amount of things, much more things than lots of our developers can do. But I can't do everything. I can, sure, terminate an instance. I can change our auto-scaling groups to get people more capacity. But I can't edit our security groups. I can't edit our load balancers. I can't get into S3 buckets or modify RDS instances.
There's kind of a push and pull here. Our developers wanted to build lots of stuff really fast. There are fewer infrastructure engineers, so all their problems get magnified on a small group of people, and then, of course, we can't do everything. We magnify those problems onto a group of three people. As you can imagine, before long, we end up in a situation like this. Infrastructure was cracking at the seams, there was too much work to do, everything was disparate. We had all these different products coming up. We didn't know all of them, and we have to somehow make heads or tails of it, build good PRs. Sometimes they would send us PRs, and we'd have to do really complicated code reviews. And hopefully, if we can apply it, we bundle that up to something nice and hand it off to our secondaries. Except that's not always how it worked. Eventually, they were on fire too. As you can imagine, being on fire sucks in an engineering organization. If you're burned out, you don't want to work. It's not a good place to be.
As a result of feeling burned out, what if they want to take a vacation? Now the entire organization stopped, and we've lost every advantage we have as a startup. We're not moving agilely. We're not moving nimbly. We're not building new product. We're just some slow, big company, and that sucks. No one wants to develop like that. So, we have to change something.
Like I said earlier, we have a scale of complexity. When I was talking about business diversification, I was pretty serious about it. We started with coinbase.com and our exchange, gdex.com, now pro. But we wanted to do so much more. We had our merchant tools, we had our wallet. We were building an index fund. We wanted custody. I mean, the list goes on and on, and each of these use cases are completely different. Us as infrastructure try to build good platforms for people, but it's not always easy. You're not in the code in the database. How are you supposed to know what everyone needs?
Our team's not built like an operations team. Operations at Coinbase used to be very small, and so, it made sense for infrastructure to kind of hold on to operations. It was a small ask, so we'd go do it. But as we have lots of one really big ask of staying reliable, turning into many, many, many, many small asks, all of a sudden operations becomes the entirety of your job. That also sucks for us as developers to operate in.
We had an error-prone process. Someone goes on their laptop and they check out code. Did you ever remember to pull? Since this controls all of our resources, you could roll back in ELB, you could change security group settings. You could close off a DBU on a wide production instance, and all of a sudden we're down. It wasn't because of load, it was just because of stupid mistakes. What if you forgot to apply in every environment? Then we have this communication problem of, "Hey, I thought my changes were applied. Why aren't they applying?" And I was like, "Oh, of course. I forgot about production."
We're growing and everyone wants to access. Everyone wants to be able to see what they're deploying to. Developers like knowing things. The more you try to hide it, the more they want to know. They get credentials, and they want to be able to read things, and that scares our security team. They're tasked with making sure we don't lose our crypto. And the more access we have into our crypto, the scarier it gets. And knowing that a sleepless security team is a bad security team- make sure you guys let your security team sleep, otherwise, you're going to make your life very hard.
And the last point here is, infra has always had a huge dream of actually segregating all these resources. How do we get everyone separate? So, our coinbase.com resources live in a separate place from our crypto resources, from our CR resources, from our exchange. We can have this beautiful ecosystem of services you can access but are separated.
Introducing Terraform Earth
We were in a hole, and we had to do something. We had incredibly limited capacity. This is just a tough place to be in. When you're here, you don't have many options. You do things crappy, you try to do things fast, and hopefully smart in the process, but you make mistakes. So, you try new things. We resorted to things we already knew. We built a tool called Terraform Earth. First of all, we're not very good at naming. Like I said, we use things we know.
If you were to really dive into this, it probably looks a lot like Terraform Mars at the end of the day. There's one extra component here. It's a sweet guy, Heimdall over there. Heimdall is a PR, it's like an approval bot. It's a code review bot. Github didn't have this at the time; it does today. Github's approvals still aren't super great. If anyone's in from Github here, there's some sweet stuff we can talk about. But the key part here is that we require MFA on almost everything we do at Coinbase. We want people to verify their identity. We want people to prove they are who they say they are. Github's API isn't super clean. It provides a lot of information. We need a really simple one, we wanted to simplify that for developers as well.
And most importantly, no single system in our ecosystem should be able to tear everything down. So, someone who has administrative access to Github shouldn't be able to deploy malicious code to Coinbase, or push things on without approval. This goes for a lot of our systems. Take Google apps for business, also a really scary one, you don't want people taking over your email addresses. These are all things you have to think about and build protections around.
How does it work? It looks almost exactly like Terraform Mars. There is one key difference here, it’s this instance profile right there. If you are familiar with AWS at all, there are two ways to access resources. You can use an access key and a secret key, which you essentially have to vendor with everyone. They're really hard to roll because if you delete them, then their service is down. There is now instance profiles. Instance profiles are sweet because they give your ECE 2 system, or your ECE 2 instance, just access to access to resources as you define them. We also use this assume-role technique where we have multiple AWS VPCs, and it can assume-role onto each one to be able to make changes. This allows us to give some granularity between permissions, the ability to walk down some environments more than others.
Most importantly, it allows us to have a single production deployment. A lot of the ways this has been done in the past is, "Hey, we spun up a new VPC." If you want infrastructure in there, you've got to deploy X, Y, and Z service. You need your deployer in there, you need your caching, you need your AWS resources, you need your logger. The way we like to do it at Coinbase is everything lives in the infrastructure VPC, and we just assume-role into the clouds we own.
This is a great strategy. It makes life super simple. All of a sudden, you're only worrying about one deployment and you're not killing yourself. Wondering like, "Hey, this one thing's not reporting the metrics I need." And it's like, "Oh, we forgot to deploy it." We deploy once, and we change our IM policies. And when you change your IM policies, all of a sudden it just works in a new VPC. So, on-boarding a new cloud is really simple. All of a sudden, you can start seeing how we're resolving issues we have on the infra team, we're making ourselves a little bit more efficient.
Flow Diagram
How does it work? It's really simple. Honestly, I'm here to share a journey with you more than how it actually works. Any of y'all could build this yourselves. There's not much going on here. Essentially, a PR comes into GitHub. We send off a webhook. Does it have consensus? Heimdall, yes, no? If not, fire the alarm bells. Someone merged something to master without getting approval. Something's wrong. Comment back on the PR. If yes, take out a lock for each environment. Take out a lock in that environment. Go try to plan and apply it. If anything fails, comment back on the PR, and that's more or less it, it's really simple. Anyone can do this. And, like I said, we're here on fire, so we needed simple solutions.
There's some nuance. Why take out locks? Well, you're probably fine not doing it, but you don't want to mess up. Concurrent changes can stack up sometimes. It's very rare when people operate on the same resource, but let's say you want to move from one sub-net to another. Well, first, you've got to move to a transitionary sub-net, then you've got to move to the one after. And let's say both those PRs come in at the same time. Well, guess what? You might not be where you thought you were, and then production's down. What if you want to make two changes to a database and they have to happen in a particular order? If you do it wrong, you might end up in a bad place. I recommend locking. You could probably get away without it, but it's nice to have.
The second is why commit SHAs? If any of y'all are familiar with GitHub, most people push to master. They push to the branch. Why do we check out a commit SHAs? Well, this is the real crux here. When you're building an automated system like this, and you're giving a system, essentially, root access over all your clouds, you have to do things in a very correct way. In our case, Heimdall records consensus, and we trust master because master has consensus. But what if someone's tampering with it? Well, Heimdall's response says, "This shot has consensus, I trust this code. Everything that's here is safe, you can trust it."
It's really important to check out your commit SHAs. We're worried about timing attacks, specifically. Sometimes applying resources can take a while, and you merge your PR with all your safe code and everything's good. And then someone force pushes to master while it's processing it. And we clone master and now someone stole all our bitcoin. Not quite that simple, but you get the idea.
The last part is failure. Everyone loves failure. Every system works forever, until it fails. Here's where it starts breaking down. Infrastructure takes a lot of manual touching. You want to make sure everything's correct. At the end of the day, you want to give white glove service to everything. And that's not always possible. You have to be a little bit careful here. So, we have some options. We can just retry or apply. What if that retry changes state? What if your retry is not idempotent? What if you just broke everything, or you added too many tags to something and now your deploy system's not working? That's a little bit scary.
To make matters worse, AWS has rate limits. I don't know if anyone's ever been burned by at this, but there's literally no amount of money you can pay them to adjust those rate limits. It's incredibly frustrating. One day you will have to go multi-VPC, sorry. AWS has failures. Everyone remembers the last S3 outage, or the last VC2 outage, or anytime the internet goes down. Probably something like that. We could cue and retry, take our classic software development tools. Again, same problems there, but if your changes aren't idempotent, also you have a lack of visibility into what's going on.
This is a system no one interacts with. It's just a PR Bot. You could manually replay the webhook. This one really sucks if you're giving everyone administrative access to your PRs, not so much fun. You could open it up to the world and let people interact with it. Security is not very happy about this. No one likes this, there aren’t great solutions. Do you have any answers? Please, please let me know. I would love to make this a better system.
There are also a whole host of other gotchas here. If your system that you build plans the changes too, I have to be worried about incoming pull requests that are malicious. What if someone adds code that changes your AWS resources because it's executing in a privileged context. So you've got to split everything up. It's complicated. The details are complicated, but the system itself is simple.
Lastly, you need to be able to test your code. I think most infrastructure kind of suffers from this problem of, "Hey, we have a production system but we don't have a development system." The common adage goes at Coinbase is like, "Oh, your development is our production," because devs need to deploy and that's a production system for us. We try hard. It's not always possible, but, build a dev logging system. You want staging, you want to be able to test changes. Same here, make something staging. In our case, we allow a staging deploy to actually create an S3 bucket in Dev. That's it. That's all it's allowed to do, just to test the flow, make sure everything works. We have a lambda that periodically goes and cleans up that S3 bucket. The next person that comes into staging is not left with a dirty environment. The testing procedure kind of sucks, but you know, it works. This is a system that other developers can use.
Great. We’ve solved the problem. I think coming out of this, it was a really big win for the team. Everything moved a lot quicker. Infra, in general, was empowered to make changes. We could more broadly help the rest of the organization. We basically gave ourselves the ability to assume root privileges with the power of quorum, and this helped speed us up in a lot of ways. It took pressure off our secondaries. It allowed us to launch more business units. It allowed us to move faster. But in practice, I think this looks a lot like a backed-up sync, and your classic queuing example. It's like you're unstopped one area, and then something else goes down. Your database is now crashed.
Team Scaling
So, if we were to look at the headcount we needed to support a system before this, it might look like this. Most people are familiar with the 20%, 30% rule with most organizations. You want 20% to 30% of your organization to be infrastructure. They take care of fundamental needs across the organization, and as it grows, you just need more and more headcount to support this. We were at a really high ratio. We were severely understaffed and we thought we needed way more headcount. So, we made Terraform Earth. And after that, we maybe did this.
We improved our coefficient. But instead, we were moving faster, and now devs knew they could rely on us more, and they knew they could ask us for more stuff, and that we could do more stuff, and that every single service that someone wanted to launch, we could do because we have this awesome Terraform Earth. In practice, this puts us back at square one. We are just as under fire as we used to be, and we have to go figure out how to get out of that again.
What we’d really like to be is something like this, where we're scaling sub-linearly with the rest of the organization. As we add a thousand engineers, we don't need 200 infrastructure engineers. I can't actually imagine my team as 200 infrastructure engineers. It's a lot of people. We're just under 20 now, and that's weird for me. I think I'd like to keep it, the tight-knit cohesion's always nice.
Resource Configuration Today
Let's talk about some future aspirational steps. We're still in the hole. This is something we have to solve still. This is by no means a done battle. How does it look? We basically have the same system we had before. Just replace Github with a laptop, and it's more or less the same thing. Every time a change comes in, still an engineer might put up a PR themselves because they get more frequent feedback, but we're still going to have to code review it. We're still going to have to make changes. It's a confusing system.
Our repository that configures all our resources is complicated at best. There's a lot of room for improvement there, as many of you are familiar. Business use cases change. You accrue all this cruft. It's just like a code base where you had the best intentions when you started, and everything was clean, and everything looked the same way. And then, the real world struck and you have to support an outage or you didn't right size everything, or you have a database that is way overscaled for like a temporary inflation. You have all these weird ghosts in the corner of your code base.
Of course, we have to be the custodians of this. We have to protect the devs from this. I think that's really the mindset we have to get out of, the idea of protecting your devs. It's really easy in this industry to say, "Infrastructure is mystical. Infrastructure is something I don't understand. I just do feature development." I would say that's a bad attitude. It's something we have to divorce ourselves from. We have to demystify it. We have to hand it back to devs. Everyone should be able to make these changes. If anyone's familiar with Kubernetes, a lot of it's founded like that. It's simple to use it. It uses the same mindset that you have when you build systems, "Hey, there's a load balancer. I got a database and it does this thing and I have an auto scaling group." Distributed systems, they're great.
Ownership
So, how do we get there? Well, this is kind of what our process looks like today. We've got infrastructure, "Hey, you got a new service? You want to go build something?" "Want to take a stab at our PR and our resources?" "Sure, I'll take a stab at it." Then we end up with a ton of configuration and they think they know what's going on. Some have a better idea than others. And there are some better examples than others, of course. Not all code bases are made equal. We'll take it, we'll try to manipulate it. We don't actually understand what they're doing. There's no way we can be abreast of every single service that's coming up at Coinbase.
We're rapidly approaching a service-oriented architecture. And the second you get to the hundreds range, it's a lost cause. You've got to do better than just trying to know everything, and talking to that one person who knows everything and you're like," Hey, can you help me debug what's going on over here?" So, they get it, and then they deploy their code to it. In their minds, they're deploying to a black box. And we don't know what their code is doing, but inevitably something breaks because things always break.
They come to us and they say, "Hey, my deploy failed." Or, "Hey, I'm getting a crash loop for some reason." Or, "Hey, my service isn't coming up." For us, it can be the same song and dance. "Oh, are you passing your health checks?" "What's a health check?" Or, "I didn't know I needed a health check." Or, "Hey, did you remember to reset your database password?" Or, "How do I get my database endpoint?" Or, "How do I connect my Redis cache?" Or, "How do I do all these things that infrastructure ...?" We know, we've internalized, we've dealt with these things, but it's really easy to lose that perspective and forget that your developers who are working on features might not know that, and you might not have built the systems they need to interact with these things. So, got to do something better.
This is a sneak peek- or I guess not really a sneak peek. This is what it looks like today. You build a project, we are not a monorepo, and in fact, each of our services lives in their own GitHub repository. You say, "Hey, here's my resource definition for Heimdall, our quorum bot for code reviews, and it's deployed to these AWS accounts." "Okay, cool. I got a service with an ELB." You're like, "I'm on an infrastructure. I know what that means. I got a security group, and I could go to our deploy tool code flow and put it in there and everything lines up, and it has a load bouncer. It has an ELB with the base configuration.
You can also pass up, the kitchen sink of options. What's your TCP timeout? What's your health check? What ports are you using? How long are you willing to wait on open connections? What's your queue size? And, “Oh, I want an RDS instance." "I'm going to make it a DVM3X large. I'm going to open it up to these security groups and these subnets, and I'm going to change the name of it, and I want an SQSQ. I want IEM permissions," and all these things that we think about because we're in infrastructure, but our developers don't really care about. And honestly, at the end of the day, does it really matter?
I would argue no. We should build a tool that makes sense, that uses their mental model, that uses the languages they know, uses simple things that like, "Hey, I have a CPU-intensive workload. I have a memory-intensive workload. I'm write-heavy, I'm read-heavy. I need a ton of servers. I don't need a database. I just want to talk from one service to another. How do I do that?” So, make your language easy.
The list goes on. We made maybe a good choice at the time, but a bad choice today of deciding to describe our infrastructure as code instead of configuration. Code comes with all its problems. No one writes code the same way. Everyone can walk into a code base and say, "Hey, I'm going to go reorganize this. This makes more sense. Are we using tabs or spaces? What's our indentation level look like? Oh, you need a space before your bracket." Or, "You should list out all your options ahead of time and passing the hash and keep all these temporary variables."
Like I said earlier, it doesn't match their mental model. We have almost zero opinion in this code base. It's the kitchen sink of options. You can do literally anything you want. At the time, it made sense. We were building all these different business units. They had to do different things. We had to support every single option. And that's just not a good system. It's not a good abstraction. You're just basically providing someone the ability to change any variable they want. So, we have to lock it down a little bit.
Resource Configuration Tomorrow
This is closer to this sneak peek I wanted to show you. It's by no means set in stone, and it's something we’re heavily debating today, because it's hard to make this decision. This is a big undertaking, and a big bang rewrite is always really scary and generally a mistake. We're a little bit concerned about it, but we want to try to do things the right way. We want to hand this back to our developers. Developer interviews using mental models they already know. We've already taught them one system with the deployment tools. Why make them learn more?
How can we do that? Well, we stopped talking about things as RDS instances. It's a database. I want Postgres, I want Redis, I want AQ. Well, it's no longer SQS or Amazon MQ. It's just MQ or, I don't know, just maybe a queue that you can talk to and then we build the library around it. And it's no longer in ELB. It's just load balanced. Maybe you have cool target groups. Maybe you have other stuff. And when you want people to talk to you, just refer to the service. People know common service names. "Hey, this service should be able to talk to me." Great. Now you, the developer, are in control for what services are your subscribers. If you've ever been a service owner in a microservices organization, the worst thing is when people just decide to start using your service without telling you, and then submitting you to this horrible workload that you never intended in the first place because every API is open. Well, great, now you're a code owner for your own infrastructure, and it makes sense to you, and you're like, "Oh, let's have a talk before you start using our service. Maybe I could give you some ground rules. This load, this rate limiting, this kind of stuff."
We think if we make these changes, and it's still an open discussion, but we think if we make these changes, we can go from something like this, where we're provisioning, we understand what's going on, and we're using arcane magic to provision it, and our developers have no idea what's going on, and they are just deploying to some black box, to this where the developer provisions, they deploy their code. They know they have a load balancer in front of it, because that's a common tool that developers are taught. We know it's a load balancer distributed system. We know we need auto-scaling when we're under CPU-intensive heavy workloads. We know we need more servers. I've already built my stateless application and I obey all the 12 factors. So, I'm good. I just know I want a Postgres instance, and I'm going to access it in this way. It should just be wired up and it just works and I don't have to configure everything, and everything's great.
There's a little bit more to it. How do you hand it off? We need to have permissions. Do we break everything up? Where infrastructure really wants to move is not necessarily being the custodian of all of the infrastructure, because there's a lot of it, and it's going to continue to grow. Instead, we want to know good patterns. We are building a reliable system that they can trust. We're building templates and ways to classify our infrastructure, but uses languages that the developers are familiar with. They should own their infrastructure. I'm not on-call for coinbase.com. Why should I own their infrastructure? Why should I be operating it? I don't want to be woken up at 2:00 in the morning for a coinbase.com outage. I should give them the tools they need to terminate their instances because they stayed around too long, or rightsize their database, or add a new service.
Design Considerations
We think with this, everything's going to look a little bit better. Of course, there are a lot of considerations here. Do you have a monorepo? Do you have multiple repos? In our case, we're leaning towards a monorepo. That's where we are today. It makes sense if you have a lot of services that need to interact with each other. My usual adage goes, do you need more than one PR to make a change? You probably have the wrong design boundary. In this case, I want to launch a new service and open up access to it. That should happen in one place, and a monorepo kind of makes sense there.
But you've got a trade-off. All of a sudden walking it down becomes way more complicated. GitHub owners are a great new feature, but they're complicated to use. Do you want to build any automated workflows? We started with Terraform Earth, but how can you do even better? How can you groom your backlog? How can you make sure PRs don't linger? And if they are lingering, maybe you just need to close them. I think everyone's really afraid of losing code, or losing Jira tickets, and no one ever prints their backlogs. You just got to let go a little bit. There's a lot of complication in the world. So, you’ve just got to let go.
Then the last part is exposing this as a public service. We're a configuration now. We can describe this. We're no longer code. That means maybe your deploy systems don't need configuration as well. Maybe your deploy systems can just hook into this and use it to describe everything. Make it just a little bit simpler. So, we solved everything. Not quite. We fixed management, and we fixed provisioning, and we made it a faster operation, so not everyone has to suffer and struggle through it. And now everyone can reason about their infrastructure just a little bit better. We're doing pretty well as a result. But the big thing, the big elephant in the room that we still haven't solved, how do you handle operations?
Account Stewardship Today
When I have an outage, how do I terminate an instance? How do I easily a scale an auto-scaling group without redeploying it? How do I go through a database migration and change its password, or do all those things? Well, comes back to the magnification problem, infra is bottlenecking everyone else. Infrastructure knows the resources, so they have elevated permissions. Everyone queues up all the requests, sends it into us. This, again, sucks, because developers are at arm's reach for the infrastructure, like developing, and you've made it easy for them to provision, but nothing's really improved here. We still have this big operational ask and our infra team is still turning into an operations team.
Before long, it's the same old song and dance. Infra's on fire. We're handling most of the operations. What do you do? You've got to get out of this. I'm sure you all are familiar with the multi-VPC strategy. Most companies that have the fortune to grow and get big, and develop lots of cool products and be successful end up going multi-VPC at some point or another. A lot of this talk is about taking one big workload on a few set of cores and parallelizing it. It's the same story here. The infra team focuses on what sits in between your VPCs. Let them focus on network connectivity, the shape of a VPC, the classification, so we can reason about it, the compliance, the stuff that the infra team wants to focus on.
But hand back the easy stuff, hand back your operations, hand back the capability. Focusing privileges doesn't actually make you more secure. I would argue it makes you less secure, because in danger time, what happens, your infra team usually sits together and something horrible could happen. So, we want to build multiple VPCs with a dev-prod setup where everyone's separated, where every engineer has greater access but in a limited context, in a way that we can trust what they can do, but they can only edit their resources. They can't edit someone else's resources. And then you lock down the scary privileges. Don't let them change the VPC, but let them terminate their instances. Let them change their auto-scaling group, let them modify their RDS instance. Sure, maybe you lock away your data, but you don't have to lock away everything.
And then developers can build scripts for their pseudo access to you. Each team now has a pseudo member and each team has a set of infra ops. And every team now looks like our infrastructure team, where you have a group of people with elevated permissions, you have a group of people who have general permissions and can build scripts, and then you can double check those. Honestly, we don't have a good solution with quorum on how to solve this, but going multi-VPC is a pretty good approach.
Complications
It's not all easy. Multi-VPC comes with the same problems that microservices come with. Anyone who’s moved to a microservice architecture, you know that finding the right domain is probably impossible. All of a sudden, you have network latency between all your services, even more than you used to have because now you're in a different VPCs. So, great, add 10 milliseconds, which is way worse. How much access is enough access is a pretty open question if we hand it out to people. It's difficult. I don't think there's a lot of great answers here. But, yes, try it out. Don't do it too early. Make sure you really need it. We were on fire, we're still on fire, and it's something we're working on improving. Ask yourself if you really need it at first.
I think if we do these things, if we go multi-VPC, if we fix the language we use, if we fix our resources, if we instead of focusing on everything, we own templates and we let those own everything else, we can go from this kind of situation, where we need 50% of the org's headcount in engineering to service everyone to something like this where, as the org grows, we don't have to. We can stay a small team. We can still be a two-pizza team, or go out to dinner and have a nice meal or something like that. The dev org can keep growing, and we're not going to get asked all the time, "Hey, my service didn't deploy." Or, "Hey, my service didn't come up." Or, "My service crashed and it can't come up."
The Future
Instead, teams have tools they need to solve their problems themselves. Don't try to be an arbiter for everything. I think this was a common infra mistake, I have to own it because I know it. Instead, be better. Build good tools for everyone else. Build the platforms, teach. If we do this, we can go from this situation where everyone's on fire, a few people take vacation and the org stops and we're no longer agile, we're no longer a startup, we're not having fun. Developers are angry because they can't get anything out to production. They can't launch a new microservice and there are hundreds of them and there are a few people who can launch them and you need to go talk to them. And if you want to debug something, you’ve got to go talk to Steve over there, because they know what's going on or when a queue backs up, they know what to do. And we can just become developers again.
Everyone's together. Why separate it? Everyone is good at what they do. We hire talented people, so trust them, let them build good things. We can all be a big, happy developer family. Hopefully, we can scale too. We can go from 100 engineers, I think it's about 150 today, to 1000. I truly think that this system can continue on this way. I have a lot of faith in it. We'll see you when we actually get there, but I think we can take vacation, we can grow, we can go have fun, we can go sip Mai Tais on the beach, and turn off our laptops, and close Slack and not worry about it when we go on vacation because we're a parallel company. Everyone can build everything. Everyone has permissions they need to operate. We built tools that everyone understands. They understand our deploy infrastructure, they understand how to provision new hardware, and the world doesn't stop when you're gone, which is great because everyone loves a vacation.
This is our talk on our journey towards secure infrastructure for developers, or self-service infrastructure, or whatever you want to call it. But maybe I should have called it infrastructure with vacation. Thank you, everyone.
See more presentations with transcripts