Transcript
Vasani: My name is Soam Vasani. I work in the Developer Environments team at Stripe. I'm going to talk about building and scaling developer environments. Stripe is building payments infrastructure for the internet. Millions of businesses worldwide use Stripe to accept payments, do payouts, and manage their business. Last year, more than half of UK adults bought something on Stripe. In terms of engineering, the software powering all of this is built by on the order of thousands of engineers who are also distributed across the world. This is millions of lines of code.
What Do We Aim For in Our Dev Environment?
Something to think about is that Stripe's software is extremely critical to its users, because we process their payments, everything that we ship is very business critical to them. We ask the question in developer productivity, how does that translate to requirements for our developer tools? One thing we talk about is accuracy. This is in other words, the DevOps idea of shifting left. In other words, it's much better to find a problem before it hits production than after. If there's a code change that would break something in production, how likely are we to catch it in general in the software lifecycle, but specifically in the dev environment itself? How well is a potential bug caught in dev, and vice versa? How well do problems caught in dev represent the problems that you would face in production? That's accuracy. We get a whole lot about fast feedback. This has all kinds of advantages. Of course, having fast feedback means that you go through the iterations of making code changes, testing them faster, and therefore you get faster at shipping features. It also means that, for example, developers stay in their flow state as they are iterating through their code changes. Of course, fast feedback means that you're faster at fixing problems, so if there's any breakage or incident, the developer environment shouldn't be getting in the way of you fixing that. Finally, maybe a less obvious point is that we want our tools to help our users become experts at the software they're working with. This is both because Stripe is growing, and there are lots of new people, but also, the set of software is quite large, and you might be working on something unfamiliar on any given day. How do the tools help you comprehend better the software that you're working with? This is the idea of building expertise.
These are some big goals. I want to share a quick story about small beginnings. Way back when Stripe was much smaller, and there wasn't much of a dev productivity organization, the dev environment consisted of something very low tech. Essentially, there was a set of EC2 instances that people would SSH into. They would find some way to get their code changes in there, start the services that they need, do whatever builds they need to, and test everything manually. These were pretty big instances, so they were shared across a lot of people in the company. We were coordinated using Slack channels. There was a Slack channel per box, and you went to the box number five Slack channel and said, "I want to test the API server. Is anyone else using it?" Someone else said, "No, hold on, wait for it some time," and so on. It was very low tech. Obviously, Stripe grew out of it very quickly. I just want to emphasize that when you're small, it's perfectly good and fine to start very low tech. If it solves the problems that you have, then do it, and then think about building the next things.
Sorbet
I'm going to fast forward a little bit from that story to Sorbet. Stripe uses a lot of Ruby, the majority of Stripe source code is Ruby. In 2019, Stripe open sourced a type checker for Ruby called Sorbet. I'm not going to go into a whole lot of depth on Sorbet because there are other talks about it, go to sorbet.org, and check them out if you're interested. It's also open source on GitHub. Since we're going to talk a bit about dev environments here, the other bits of Sorbet are quite interesting. Sorbet isn't just a type checker, it also has a server mode. This is a language server that essentially makes editors smart about the Ruby source code that you're working on. It gives you features like jump to definition, find references, it can even do simple refactorings for you. We're going to go a little bit into this.
Dev Environment Infrastructure
To recap, we talked a little bit about Stripe, and what we're looking for abstractly from a dev environment. We looked at Sorbet. Really, what we want to spend the bulk of time talking about is one layer below that, that's the infrastructure that runs Sorbet editors, tools, other tools that are needed. Essentially, how do we scale up that inner loop of development? I'm going to start this progression with the infrastructure that existed as of about a year ago, and then talk about our evolution through the year. Every developer gets an EC2 instance assigned to them. This instance works in concert with their laptop. We call this EC2 instance a devbox. Roughly speaking, the partitioning is as follows. The laptop contains the source code repo, editor. Of course, the browser and whatever CLIs or tools they might be using to run client tools they might be using. The devbox runs the Sorbet server we just spoke about. It has a full copy of the source code that's rsynced from the laptop. You can start various backend services on this devbox. It runs a database instance that's preloaded with some test data. Finally, there is infrastructure to manage all of these. There's one laptop and one devbox per developer, and there's infrastructure to hand these out and to manage their lifecycle, and so on.
Laptop + EC2 Instance
Let's talk with a little bit more detail about how the laptop coordinates with the devbox. The left side is the laptop, the right side is the devbox. Let's start from the bottom of the picture and go upwards. The source code on a laptop, you do a regular Git clone, and you get your repo, and you start the client side of our tools. This finds the devbox, makes a connection to it, and does an rsync. When devboxes are new, they're brought up with a copy of source code, so you generally don't have to rsync a whole lot, just recent changes. On the devbox, there's a dev service runner, which looks at the source code and runs the services that you've configured that you want running. Coming back to the left side, the editor makes changes to the source code. We have a tool continuously watching these files on the laptop side and calling rsync. Again, there's an rsync. The service runner on the devbox side sees the change to the files and restart services as needed. Perhaps there's a build in the middle, it depends on the exact languages and frameworks being used. To close the loop, the user can use whatever tools they use to hit the APIs of those backend services. Maybe they're using a test copy of the Stripe dashboard, and makes API requests over the web into this devbox. Or they may be using something like Postman, which makes HTTP requests, or they're using some kind of gRPC client, and so on. Of course, the Sorbet server runs on the devbox, and there's a protocol that the editor can use to talk to the Sorbet server and ask it questions like, where is so and so method defined? Then the Sorbet server sends the response back and the editor can navigate to that spot. That's a simplified picture of the laptop and devbox coordination.
What Worked Well
That's our dev environments as of a year or so ago, and largely a lot of that picture has remained working well. Let's look at what worked well. We talked about accuracy, this idea of how well do devs mimic production. Devboxes are much more accurate in predicting production problems than laptops would be, because the operating system is the same. Also, because there's all these other frameworks that our services depend on like configuration, feature flags, secrets, and so on. We can make devboxes mimic those systems much more easily than laptops. In terms of feedback speed, this system works pretty well. There's lots of available CPU and memory, so Sorbet, for example, can consume it when it needs to. The services that need to be run also have, again, lots of CPU and memory available to them, so they can respond quickly and they can restart quickly if they need to. One thing that's really nice is that these boxes are stateful, so you can, at the end of the day, disconnect your laptop, switch it off, walk away. The next morning when you reconnect, everything is exactly where you left it. The test data is in the same shape. The services that you started are still running, and so on. Finally, these are all isolated from each other, so if I break something horribly on my devbox, it doesn't affect anybody else's devbox. That's great, because you want to be able to work without fear of disrupting something.
What Did Not Work Well
What didn't work well? If you think about this model of syncing source code, it works really well when there are small changes. Think about what happens when you do a Git checkout to a branch, or worse, you're working on two branches in the same day, and you're making changes in both of them so you switch repeatedly between these branches. That triggers the worst case behavior of sync, because you have to sync a whole lot of files across the network to get up and running. Because there's a File Watcher pointed at the source code, when a big branch is synced to it, that File Watcher notices a lot of code changes, and it restarts a whole lot of services. You take a pretty big interruption if you're multitasking. Generally people who like to multitask a lot and have three or four branches that they work on every day, these people are pretty unhappy with these interruptions.
Identifying a Goal
What did we do about it? We looked at this problem and said, we do want to keep devboxes. We're getting a lot out of them, but the usage model is not quite right. The old model was that one developer gets one devbox, they're long-lived and the developer maintains them. They rsync source code from the laptop. We decided we want to switch to a new model, one that's much more ephemeral. A devbox is per branch instead of per developer, so a developer can work on as many as they want at the same time. These devboxes, we want them to be short-lived, so they are more tied to the lifecycle of a branch. We wanted to eliminate the idea of rsyncing completely, because as we'll see in a bit, we can't just have editors directly edit code on devboxes directly. Essentially, the idea of remote dev. Old model is one devbox per dev, long-lived, rsynced from laptops. The new model is much more ephemeral, tied to branches, and eliminates syncing.
Towards Per-Branch, Ephemeral, Remote
That's a pretty big complex change. How do we get there? This is a problem that any infrastructure software team faces when they have a big platform that lots of people are already using. It's one thing to come up with ambitious goals, but it's another to be able to iterate on them and deliver value, while you're making progress towards those ambitious goals, so we needed a concrete roadmap.
1 of 4: Multiple Devboxes per Developer
I'm going to talk a little bit about how we divided this plan into four steps, and how each of those delivered a little bit of value to users, and how we finally ended up with remote dev. The first one is to simply replicate the picture and allow developers to have more than one devbox. This isn't a big change in the usage model, you still have a long-lived devbox. It isn't per branch. The user has to decide which devbox they'll use for which branch. In fact, they have multiple copies, multiple working trees of source code on their laptop, because rsync cares about things like timestamps and stuff on the laptop, so you have a separate checkout. Once you get this set up correctly, it's a bit tricky, but you can have two editor windows, you can simply command tab between them, and you have two concurrent dev experiences. This was a pretty good win. What it did for our long term goal was that it broke that one is to one constraint between developers and devboxes. We had to change the routing things here and there. There's infrastructure things that we had to change to make this work. It shipped. It works. It's used by the power users of devboxes within Stripe.
2 of 4: Preview Devboxes
Next, we wanted to break that constraint that devboxes must be tied to laptops. Recall that we're trying to go towards this goal where editors running on the laptop can directly edit code on devboxes. We want there to be a way for a devbox to exist without a laptop. One really valuable feature that we built is the ability to build a devbox that follows a pull request. We call these preview devboxes, and we've integrated them into the pull request flow. When you open the pull request, to some frontend change to the dashboard, you can make a comment that triggers a webhook. This triggers our DevOps infrastructure to do a Git pull of that branch in the pull request. That's the bottom of the picture here. The rest of the infrastructure is the same. The service runner looks at the source code. It runs the services you've requested. Users can use their browser, other clients to talk to those services. Similarly, as the user pushes changes to that branch on GitHub, we use webhooks from GitHub to call the devbox infrastructure. That causes the devbox to go pull a new branch. Essentially, this preview always follows the latest state of the pull request. A reviewer can use this devbox to look at the services as they would be after that pull request is merged. There's a couple of nice advantages to this for users. One is of course, they get preview functionality. The other is that they can non-disruptively look at somebody's code review. It's pretty common especially in the frontend world for people to check out somebody else's branch while they're reviewing and set it up on their own devbox, click around through the UI, and then say, "Ok, ship it. That interaction looks good to me." With this, they can avoid disrupting their own workflow. It makes reviews much smoother, less disruptive, and ultimately faster. That's preview devboxes.
3 of 4: Stateless Devboxes
The third one is very much an infrastructural thing. There isn't a whole lot of change in the user visible model here. We realized that if we are going from a world where one developer has one devbox, to a world where one developer might have five devboxes, we would like to do it without multiplying our cost by that many, by whatever number of devbox is our average. We figured that we want to take out the stateful parts of the devbox and put them on a volume, on a persistent EBS Volume in this case. This allows us to switch devboxes off. If you look at this picture, this is when a user is actually using it, and you can simply switch it off, the instance, but keep the EBS Volume. When the user wants to work with this branch again, they simply switch back on and we reattach a volume to that user's new instance. There's a bit of tracking of volumes and instances and users here that we need to do. What we gain is the ability to give users much more in terms of devboxes, without essentially spending that much more. The EBS Volume adds some cost, but we can also save money by switching devboxes off on weekends and stuff like that. There's a good savings here. There isn't a whole lot of user visible benefit, other than small things like the volume can grow independently of instance size, so if your test dataset grows very large, we can grow the volume without thinking about what exact instance size we need, and all that. That's stateless devboxes.
4 of 4: Remote Dev
Now we come to the part that gets us to the final step, which is remote development. Recall that in the first picture, we had the source code on both sides, the laptop and the devbox side. We used rsync and file watching to make a copy that was always in sync. Here, the idea is that the editor simply edits the files directly on the devbox. It does that generally by using SSH. There's support in editors for doing this. VS Code has pretty good remote dev support. It's a little more complicated than just editing files, because VS Code has extensions. I haven't drawn the Sorbet server in this picture, but there's also a Sorbet extension, which is also open source. That extension runs in something called an extension host, which also runs on the devbox. There's a little more complexity for remote dev than what's shown here. The general idea is that the source of truth for the source code moves from being on the laptop to being on the devbox. This also means that the box has to be able to actually access Git, which is something we had to solve. That's remote dev.
To recap, when we have all of this, we get devboxes that are ephemeral, we can switch them off. The state of source code remains on the volume, which means it's now cost acceptable to give users lots of devboxes. Because we're not syncing anymore, we don't have to think about the tricky branch switching problem. We didn't optimize it, we just eliminated it. You don't branch switch anymore, you simply spin up a new devbox and you delete the one that you're done with. On the way we also get people preview devboxes.
Where We Are Right Now
Where are we right now? Multiple devboxes have been in use for a while in Stripe. Because they're a bit tricky to use, they tend to be used by power users, but those users do use them a lot. There's a small contingent of frequent users. Preview devboxes are very popular with frontend and full stack teams because that review flow is just more useful when you're doing frontend code. Stateless devboxes are shipped and in use. Again, there's not a whole lot of user facing value, but it allows us to keep our budget within limits. Remote dev, we're at the early stages of a beta. Everyone's very excited about them. It hasn't rolled out completely yet, so we're still on the multiple preview rsynced devboxes world, but it's in progress.
What we're working on Next
I want to talk a bit about what the dev environment team is working on next besides finishing the remote dev and rolling that out. One thing I didn't cover much is that Stripe also has a shared test environment. This is one environment that the company uses, and that you can deploy software into. It isn't a dev environment, there's no inner loop of development. You have to merge a change, use the deploy tooling. There are ad hoc integrations of the dev environment with this test environment, and we're working on making it much more systematic. The Envoy proxy is used widely throughout Stripe. We're using the same mechanisms of the service mesh to integrate the dev environment there. We're also working on helping developers configure the production service mesh, to configure it better by surfacing errors that are more config related in the dev environment itself.
Secondly, we're working on improving accuracy. We said that devboxes look a lot like the production environment when you compare them to laptops, but they look quite a bit different from the production environment, when you compare it in absolutes. For example, the resource usage is quite different. The available CPU and memory can depend on what else you're running on the devbox, and so on. We're working on all these ideas, and on making the compute platform of the dev environment be more production like. The third one is related to Stripe scaling as a company, which is that, in general, we're thinking much more of the dev environment as a platform to run services on. How do we empower users to make their services run well in that platform? How can they observe how their services are doing on the dev environment, make them more reliable, and so on. We also encountered this problem that there are small teams that want custom their experiences, either for a new language or a framework, and so on, and we can't always prioritize them. We want to think about what are the extension points of the dev environment, and can a team use those extension points to do just enough work and to build a custom dev experience for their user?
Scaling Dev Environments with Stripe
Finally, I want to zoom out and talk about a process that you can follow to make big changes, to execute on ambitious big roadmaps in the dev tools team. I've pasted one of my favorite bits of feedback after shipping something. It says this is a great sign that Stripe is quickly addressing the dev productivity feedback. The process I'd like to recommend is a very simple one. You start by listening to your users. Surveys and user interviews are a good way or a good inexpensive thing to do. You can sample your set of users, send them a simple Google form, and that's a good start. You can also interview them on a regular cadence. Something that's much more impactful I've found is to actually embed in a team. Find the team that has the most pain in using your tools, either through metrics, or by talking to them, and spend a couple of weeks doing the tasks, doing their day-to-day work using your tools. The lessons you learn from that are invaluable. Next, take those lessons and make ambitious plans. Figure out what are the really big things that you could change. If you had infinite time and resources, what would you do? Make optimistic plans. Then, and this is the crucial part, execute incrementally on those plans. Your users want to see change. It's just much better to have a regular cadence of small improvements than it is to have a very irregular rare cadence of huge improvements. Listen, learn ambitiously, and execute incrementally.
Questions and Answers
Synodinos: You talked about Sorbet, the open source server. All the work that Stripe is doing with the devbox, is some of that also open source? Are you using some other open source tools to put it together?
Vasani: None of that is open source. Unfortunately, a lot of it ties into other Stripe infrastructure bits, so it's really hard to extract out and make it open source. I love to attempt that someday, though.
Synodinos: For teams that would like to build similar infrastructures, what tools should they be looking at?
Vasani: I talked about Sorbet. That's one of the big reasons we have large CPU and memory requirements from these devboxes. I think if you're not using a language like that, and for example, if these language servers are local to users, you can have different infrastructures for running your backend services. We do actually run both containers and non-containerized services in these machines, and we also talk to stuff outside the machine.
What tools would you use if you wanted to build infra like this? There are some open source tools we're using. We're using rsync and Watchman. These are like age old tools. Somebody else has built a modern rsync, which is more configurable. Yes, tools like that, which help you sync files. If you are still in that file syncing model, it makes sense. If you're on remote dev, and I glossed over this, but VS Code is the editor with really good remote dev support. If you are using other IDEs, they don't necessarily have the same support. I know that IntelliJ IDEA IDEs are working on this, but it's still very early for them. You'll have to make sure your IDE supports whatever infra you build for this.
Synodinos: You also talked about incremental execution. Not necessarily all attendees work in organizations that have the size of an engineering team, as big as Stripe's. Having gone through that evolution, and learning for some of the mistakes that you might have done in the past, what would be the optimal evolution path for an organization that is currently growing and evolving? Is there a Pareto rule, like you can get 80% of the benefit if you really focus on some basic thing? This has come up on a few discussions with attendees that were coming from smaller companies where they didn't necessarily have the budget that Stripe would have to build everything. What will be a minimal version?
Vasani: I think some of our oldest infrastructure is the minimal version. To be clear, it took Stripe to a pretty good size. Our team is only less than 2 years old. Stripe is much older than that. The developer environment was built on the side by a few people in the dev productivity team. They were busy with various other things. The core idea of syncing a repo and having a quick little service supervisor on the dev side is, you can do this incrementally. I'd say that's the minimal version. You end up with some pretty rough edges around branch switching, as we talked about. You can live with them if you don't want to invest too much in that.
Synodinos: What additions are the power users clamoring for?
Vasani: This is something we've been working on for the last few months, which is just better control over how you run a service mesh, and what services are in dev. We actually have a shared environment that you can deploy to, so that's deployment. That's slow. That goes to CI, and it's a shared environment. This is our QA, staging, whatever you call it, the non-production environment. You can deploy into that. The feature of that is that it's slower to deploy to, but it's more production like. The feature of development is that it's really fast. You edit a file, and it's there in seconds. The service restarts quickly. Now users actually have to choose what services we have, because you typically don't want to run every single service in your graph in one machine. You mostly just can't. Somewhat, this story has so far grown incrementally at Stripe and it's very messy for users to configure this. This is what we're working on, just a better way of dealing with a big graph of services. That's something that's top of mind for the last couple months, something we're addressing right now.
When using the remote devbox scenario, do devs face UI latency in a typical interaction?
No. This is where it matters what IDE you use. If you're using VS Code remote dev, there the IDE actually hides the latency. There is a local buffer going on, and that buffer is asynchronously synced over SSH to the remote end, which is running VS Code on the thing automatically. In other words, VS Code SSHs and runs its stuff on the other end of SSH. All this is hidden from the user, even from the administrator. VS Code just needs a config file, and then it does all that. If you're not using VS Code, like I tried this with Emacs, and Emacs has something called TRAMP, which is a way to edit files over SSH. There, yes, you see the latency, absolutely. It's quite annoying. We're thinking about ways to fix this. The important thing about remote dev is that the source of truth is remote, is on the devbox. It doesn't mean that every file has to only live there. You could imagine some virtual file system that exposes those remote files locally, nearer to your IDE, under some sync. TBD, but we're exploring some file system based solution there to reduce latency for editors that don't support remote natively.
Synodinos: I remember in the past people were using things like macFUSE.
Vasani: There is some churn in the world of macFUSE. Because as I understand it, macOS is trying to deprecate the APIs that FUSE depends on, on Mac. NFS still exists, so the inverse protocol can be used for exposing arbitrary file systems. That may be a future area. This is all very hand wavy, and very researchy work that we need to make some progress on someday. That's the stage that it's at right now.
Synodinos: Would you like to elaborate a little bit on the complexities? You briefly mentioned complexities that come with the fact that you have many services, you cannot have everything in the devbox, instead of a script behind where the devboxes talk to it. You probably have hundreds, thousands of services that you need to worry about?
Vasani: We do need to worry about a lot of services. That order of magnitude sounds right. The general complexity, first of all, is resources. There is some limit to how much budget we want to spend per developer, whatever it may be. It may be different for each org. There is generally some budget and you don't want to explode that into just running every service for every developer. This comes to the second point, which is you don't really need to. Most developers are changing something very local, and you want to be able to locally reason about that. The testing difficulty comes from the fact that there is some complex graph. What's happening in the dev environment is not just testing, but also comprehending and understanding of your own software. You're often changing code in an area where you're not an expert, or you're trying to become an expert. You don't know exactly how to invoke your service if it's deep in the graph of services. We see this commonly. People say, I want to run the API gateway and this API service and that thing. Finally, when that whole graph is set up, then my service gets called. If your service is pretty far downstream, you don't necessarily know the exact dance to get it into the state that you want.
The question is, if you're using a shared environment, and there are all these poor user environments, we can't really have the shared environment call into the producer environment without some kind of per service routing. I know Jake from Lyft talked about per service routing. That's one interesting area. The other is that, for efficiency purposes, you just want to mock certain services, and they also just don't make sense. If you have a fairly complex authorization service, and you're testing something completely unrelated to authorization, you don't need to run that service, you just want to return true or something. You want very minimal mocks in certain cases. We want the ability to add the service mesh layer, add a fake, which is something Google has written a lot about in their software engineering book. It's something I'm very interested in, and we're looking at that too as well.
Synodinos: How do you keep your mocks up to date when APIs change?
Vasani: I'm running a little bit ahead of myself, because we haven't actually put these in production yet. When I say in production, I mean among our users internally. I don't have a good answer there other than, the tests need tests. You need to be able to know when the mocks are out of date. We're looking at being able to generate mocks from some record-replay solution, maybe. Then you'll version control those just like any other. They might have some rudimentary smoke tests of their own, and when they break, you'll have to regenerate them. Again, super handy wavy answer. It hasn't been borne out by actual use.
See more presentations with transcripts