InfoQ Homepage Presentations How GitHub Copilot Serves 400 Million Completion Requests a Day

How GitHub Copilot Serves 400 Million Completion Requests a Day

View Presentation

Speed:

Download

49:24

Summary

David Cheney discusses the intricate architecture of GitHub Copilot's code completion service, explaining the challenges of achieving low-latency responses for millions of daily requests. He delves into HTTP/2 optimizations, global scaling strategies, and the critical role of their internal proxy.

Bio

David Cheney is an open source contributor and project member for the Go programming language. David is a well-respected voice within the tech community, speaking on a variety of topics such as software design, performance, and the Go programming language.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Cheney: GitHub Copilot is the largest LLM powered code completion service in the world. We serve hundreds of millions of requests a day with an average response time of under 200 milliseconds. The story I'm going to cover in this track is the story of how we built this service.

I'm the cat on the internet with glasses. In real life, I'm just some guy that wears glasses. I've been at GitHub for nearly five years. I've worked on a bunch of backend components, which none of you know about, but you interact with every day. I'm currently the tech lead on copilot-proxy, which is the service that connects your IDEs to LLMs hosted in Azure.

What is GitHub Copilot?

GitHub is the largest social coding site in the world. We have over 100 million users. We're very proud to call ourselves the home of all developers. The product I'm going to talk to you about is GitHub Copilot, specifically the code completion part of the product. That's the bit that I work on. Copilot does many other things, chat, interactive refactoring, things like that, and so on. They broadly use the same architecture and infrastructure I'm going to talk to you about, but the details vary subtly. GitHub Copilot is available as an extension. You install it in your IDE. We support most of the major IDEs, VS Code, Visual Studio, obviously, the pantheon of IntelliJ IDEs, Neovim, and we recently announced support for Xcode. Pretty much you can get it wherever you want. We serve more than 400 million completion requests. That was when I pitched this talk. I had a look at the number. It's much higher than that these days.

We peak at about 8,000 requests a second during that peak period between the European afternoon and the U.S. work day. That's our peak period. During that time, we see a mean response time of less than 200 milliseconds. Just in case there is one person who hasn't seen GitHub Copilot in action, here's a recording of me just working on some throwaway code. The goal, what you'll see here is that we have the inbuilt IDE completions, those are the ones in the box, and Copilot's completions, which are the gray ones, which we notionally call ghost text because it's gray and ethereal. You can see as I go through here, every time that I stop typing or I pause, Copilot takes over. You can see you can write a function, a comment describing what you want, and Copilot will do its best to write it for you. It really likes patterns. As you see, it's figured out the pattern of what I'm doing. We all know how to make prime numbers. You pretty much got the idea. That's the product in action.

Building a Cloud Hosted Autocompletion Service

Let's talk about the requirements of how we built the service that powers this on the backend, because the goal is interactive code completion in the IDE. In this respect, we're competing with all of the other interactive autocomplete built into most IDEs. That's your anything LSP powered, your Code Sense, your IntelliSense, all of that stuff. This is a pretty tall order because those things running locally on your machine don't have to deal with network latency. They don't have to deal with shared server resources. They don't have to deal with the inevitable cloud outage. We've got a pretty high bar that's been set for us. To be competitive, we need to do a bunch of things.

The first one is that we need to minimize latency before and between requests. We need to amortize any setup costs that we might make because this is a network service. To a point, we need to avoid as much network latency as we can because that's overhead that our competitors that sit locally in IDEs don't have to pay. The last one is that the length and thus time to generate a code completion response is very much a function of the size of the request, which is completely variable. One of the other things that we do is rather than waiting for the whole request to be completed and then send it back to the user, we work in a streaming mode. It doesn't really matter how long the request is, we immediately start streaming it as soon as it starts. This is quite a useful property because it unlocks other optimizations.

I want to dig into this connection setup is expensive idea. Because this is a network service, we use TCP. TCP uses the so-called 3-way handshake, SYN, SYN-ACK, ACK. On top of that, because this is the internet and it's 2024, everything needs to be secured by TLS. TLS takes between 5 to 7 additional legs to do that handshaking, negotiate keys in both directions. Some of these steps can be pipelined. A lot of work has gone into reducing these setup costs and overlaying the TLS handshake with the TCP handshake. These are great optimizations, but they're not a panacea. There's no way to drive this network cost down to zero. Because of that, you end up with about five or six round trips between you and the server and back again to make a new connection to a service.

The duration of each of those legs is highly correlated with distance. This graph that I shamelessly stole from the internet says about 50 milliseconds a leg, which is probably East Coast to West Coast time. Where I live, on the other side of an ocean, we see round trip times far in excess of 100 milliseconds. When you add all that up and doing five or six of them, that makes connection setup really expensive. You want to do it once and you want to keep it open for as long as possible.

Evolution of GitHub Copilot

Those are the high-level requirements. Let's take a little bit of a trip back in time and look at the history of Copilot as it evolved. Because when we started out, we had an extension in VS Code. To use it, the alpha users would go to OpenAI, get an account, get their key added to a special group, then go and plug that key into the IDE. This worked. It was great for an alpha product. It scaled to literally dozens of users. At that point, everyone got tired of being in the business of user management. OpenAI don't want to be in the business of knowing who our users are. Frankly, we don't want that either. What we want is a service provider relationship. That kind of thing is what you're used to when you consume a cloud service. You get a key to access the server resources. Anytime someone uses that key, a bill is emitted. The who is allowed to do that under what circumstances is entirely your job as the product team. We're left with this problem of, how do we manage this service key? Let's talk about the wrong way to do it.

The wrong way to do it would be to encode the key somehow in the extension that we give to users in a way that it can be extracted and used by the service but is invisible to casual or malicious onlookers. This is impossible. This is at best security by obscurity. It doesn't work, just ask the rabbit r1 folks. The solution that we arrived at is to build an authenticating proxy which sits in the middle of this network transaction. The name of the product is copilot-proxy because it's just an internal name. I'm just going to call it proxy for the rest of this talk. It was added shortly after our alpha release to move us out of this period of having user provided keys to a more scalable authentication mechanism.

What does this workflow look like now? You install the extension in your IDE just as normal, and you authenticate to GitHub just as normal. That creates a kind of OAuth relationship, like each where there's an OAuth key which identifies that installation on that particular machine for that person who's logged in at that time. The IDE now can use that OAuth credential to go to GitHub and exchange that for what we call a short-lived code completion token. The token is just like a train ticket. It's just an authorization to use it for a short period of time. It's signed. When the request lands on the proxy, all we have to do is just check that that signature is valid. If it's good, we swap the key out for the actual API service key, forward it on, and stream back the results. We don't need to do any other further validation. This is important because it means for every request we get, we don't have to call out to an external authentication service. The short-lived token is the authentication.

From the client's point of view, nothing's really changed in their world. They still think they're talking to a model, and they still get the response as usual. This token's got a lifetime in the order of minutes, 10, 20, 30 minutes. This is mainly to limit the liability if, say, it was stolen, which is highly unlikely. The much more likely and sad case is that in the cases of abuse, we need the ability to shut down an account and therefore not generate new tokens. That's generally why the token has a short lifetime. In the background, the client knows the expiration time of the token it was given, and a couple of minutes before that, it kicks off a refresh cycle, gets a new token, swaps it out, the world continues.

When Should Copilot Take Over?

We solved the access problem. That's half the problem. Now to talk about another part of the problem. As a product design, we don't have an autocomplete key. I remember in Eclipse, you would use command + space, things like that, to trigger the autocomplete or to trigger the refactoring tools. I didn't use that in the example I showed. Whenever I stop typing, Copilot takes over. That creates the question of, when should it take over? When should we switch from user typing to Copilot typing? It's not a straightforward problem. Some of the solutions we could use is just a fixed timer. We hook the key presses, and after each key press, we start a timer. If that timer elapses without another key press, then we say, we're ready to issue the request and move into completion mode.

This is good because that provides an upper bound on how long we wait and that waiting is additional latency. It's bad, because it provides a lower bound. We always wait at least this long before starting, even if that was the last key press the user was going to make. We could try something a bit more science-y and use a tiny prediction model to look at the stream of characters as they're typed and predict, are they approaching the end of the word or are they in the middle of a word, and nudge that timer forward and back. We could just do things like a blind guess. Any time there's a key press, we can just assume that's it, no more input from the user, and always issue the completion request. In reality, we use a mixture of all these strategies.

That leads us to the next problem, which is, despite all this work and tuning that went into this, around half of the requests that we issued are what we call typed through. Don't forget, we're doing autocomplete. If you continue to type after we've made a request, you've now diverged from the data we had and our request is now out of date. We can't use that result. We could try a few things to work around this. We could wait longer before a request. That might reduce the number of times that we issue a request and then have to not use the result. That additional latency, that additional waiting penalizes every user who had stopped and was waiting. Of course, if we wait too long, then users might think that Copilot is broken because it's not saying anything to them anymore. Instead, what we've built is a system that allows us to cancel a request once they've been issued. This cancellation request using HTTP is potentially novel. I don't claim it to be unique, but it's certainly the first time I've come across this in my career.

Canceling a HTTP Request

I want to spend a little bit of time digging into what it means to cancel a HTTP request. You're at your web browser and you've decided you don't want to wait for that page, how do you say, I want to stop, I want to cancel that? You press the stop button. You could close the browser tab. You could drop off the network. You eat your laptop away. These are all very final actions. They imply that you're canceling the request because you're done. Either you're done with using the site or you're just frustrated and you've given up. It's an act of finality. You don't intend to make another request. Under the hood, they all have the same networking behavior. You reset the TCP stream, the connection that we talked about setting up on previous slides. That's on the browser side. If we look on the server side, either at the application layer or in your web framework, this idea of cancellation is not something that is very common inside web frameworks.

If a user using your application on your site presses stop on the browser or if they Control-C their cURL command, that underlying thing translates into a TCP reset of the connection. On the other end, in your server code, when do you get to see that signal? When do you get to see that they've done that? The general times that you can spot that the TCP connection has been reset is either when you're reading the request body, so early in the request when you're reading the headers in the body, or later on when you go to start writing your response back there.

This is a really big problem for LLMs, because the cost of the request, that initial inference before you generate the first token, is the majority of the cost. That happens before you produce any output. All that work is performed. We've done the inference. We're ready to start streaming back tokens. Only then do we find that the user closed the socket and they've gone. As you saw, in our case, that's about 45% of the requests. Half of the time, we'd be performing that inference and then throwing the result on the floor, which is an enormous waste of money, time, and energy, which in AI terms is all the same thing.

If this situation wasn't bad enough, it gets worse. Because cancellation in HTTP world is the result of closing that connection. In our case, in the usage of networking TCP to talk to the proxy, the reason we canceled that request is because we want to make another one straightaway. To make that request straightaway, we don't have a connection anymore. We have to pay those five or six round trips to set up a new TCP TLS connection. In this naive idea, in this normal usage, cancellation occurs every other request on average. This would mean that users are constantly closing and reestablishing their TCP connections in this kind of signaling they want to cancel and then reestablishing connection to make a new request. The latency of that far exceeds the cost of just letting the request that we didn't need, run to completion, and then just ignoring it.

HTTP/2, and Its Importance

Most of what I said on the previous slides applies to HTTP/1, which has this one request per connection model. As you're reading on this slide, HTTP version numbers go above number 1, they go up to 2 and 3. I'm going to spend a little bit of time talking about HTTP/2 and how that was very important to our solution. As a side note, copilot-proxy is written in Go because it has a quite robust HTTP library. It gave us the HTTP/2 support and control over that that we needed for this part of the product. That's why I'm here, rather than a Rustacean talking to you. This is mostly an implementation detail. HTTP/2 is more like SSH than good old HTTP/1 plus TLS. Like SSH, HTTP/2 is a tunneled connection. You have a single connection and multiple requests multiplexed on top of that. In both SSH and HTTP/2, they're called streams. A single network connection can carry multiple streams where each stream is a request. We use HTTP/2 between the client and the proxy because that allows us to create a connection once and reuse it over and again.

Instead of resetting the TCP connection itself, you just reset the individual stream representing the request you made. The underlying connection stays open. We do the same between the proxy and our LLM model. Because the proxy is effectively concatenating requests, like fanning them in from thousands of clients down onto a small set of connections to the LLM model, we use a connection pool, a bunch of connections to talk to the model. This is just to spread the load across multiple TCP streams, avoid networking issues, avoid head-of-line blocking, things like that. We found, just like the client behavior, these connections between the proxy and our LLM model are established when we start the process and they leave assuming there's no upstream problems.

They remain open for the lifetime of the process until we redeploy it, so minutes to hours to days, depending on when we choose to redeploy. By keeping these long-lived HTTP/2 connections open, we get additional benefits to the TCP layer. Basically, TCP has this trust thing. The longer a connection is open, the more it trusts it, the more it allows more data to be in fly before it has to be acknowledged. You get these nice, warmed-up TCP pipes that go all the way from the client through the proxy, up to the model and back.

This is not intended to be a tutorial on Go, but for those who do speak it socially, this is what basically every Go HTTP handler looks like. The key here is this req.Context object. Context is effectively a handle. It allows efficient transmission of cancellations and timeouts and those kind of request-specific type meta information. The important thing here is that the other end of this request context is effectively connected out into user land to the user's IDE. When, by continuing to type, they need to cancel a request, that stream reset causes this context object to be canceled. That makes it immediately visible to the HTTP server without having to wait until we get to a point of actually trying to write any data to the stream.

Of course, this context can be passed up and down the call stack and used for anything that wants to know, should it stop early. We use it in the HTTP client and we make that call onwards to the model, we pass in that same context. The cancellation that happens in the IDE propagates to the proxy and into the model effectively immediately. This is all rather neatly expressed here, but it requires that all parties speak HTTP/2 natively.

It turns out this wasn't all beer and skittles. In practice, getting this end-to-end HTTP/2 turned out to be more difficult than we expected. This is despite HTTP/2 being nearly a decade old. Just general support for this in just general life was not as good as it could be. For example, most load balancers are happy to speak HTTP/2 on the frontend but downgrade to HTTP/1 on the backend. This includes most of the major ALB and NLBs you get in your cloud providers. It, at the time, included all the CDN providers that were available to us. That fact alone was enough to spawn us doing this project. There are also other weird things we ran into.

At the time, and I don't believe it's been fixed yet, OpenAI was fronted with nginx. nginx just has an arbitrary limit of 100 requests per connection. After that, they just closed the connection. At the request rates that you saw, it doesn't take long to chew through 100 requests, and then nginx will drop the connection and force you to reestablish it. That was just a buzz kill.

All of this is just a long-winded way of saying that the generic advice of, yes, just stick your app behind your cloud provider's load balancer, it will be fine, didn't work out for us out of the box. Something that did work out for us is GLB. GLB stands for GitHub Load Balancer. It was introduced eight years ago. It's one of the many things that has spun out of our engineering group. GLB is based on HAProxy. It uses HAProxy under the hood. HAProxy turns out to be one of the very few open-source load balancers that offers just exquisite HTTP/2 control. I've never found anything like it. Not only did it speak HTTP/2 end-to-end, but offered exquisite control over the whole connection. What we have is GLB being the GitHub Load Balancer, which sits in front of everything that you interact with in GitHub, actually owns the client connection. The client connects to GLB and GLB holds that connection open. When we redeploy our proxy pods, their connections are gracefully torn down and then reestablished for new pods. GLB keeps the connection to the client open. They never see that we've done a redeploy. They never disconnected during that time.

GitHub Copilot's Global Nature

With success and growth come yet more problems. We serve millions of users around the globe. We have Copilot users in all the major markets, where I live in APAC, Europe, Americas, EMEA, all over the world. There's not a time that we're not busy serving requests. This presents the problem that even though all this HTTP/2 stuff is really good, it still can't change the speed of light. The round-trip time of just being able to send the bits of your request across an ocean or through a long geopolitical boundary or something like that, can easily exceed the actual mean time to process that request and send back the answer. This is another problem. The good news is that Azure, through its partnership with OpenAI, offers OpenAI models in effectively every region that Azure has. They've got dozens of regions around the world. This is great. We can put a model in Europe, we can put a model in Asia. We can put a model wherever we need one, wherever the users are. Now we have a few more problems to solve.

In terms of requirements, we want users, therefore, to be routed to their "closest" proxy region. If that region is unhealthy, we want them to automatically be routed somewhere else so they continue to get service. The flip side is also true, because if we have multiple regions around the world, this increases our capacity and our reliability. We no longer have all our eggs in one basket in one giant model somewhere, let's just say in the U.S. By spreading them around the world, we're never going to be in a situation where the service is down because it's spread around the world. To do this, we use another product, again, that spun out of GitHub's engineering team, called octoDNS. octoDNS, despite its name, is not actually a DNS server. It's actually just a configuration language to describe DNS configurations that you want. It supports all the good things: arbitrary weightings, load balancing, splitting, sharing, health checks. It allows us to identify users in terms of the continent they're in, the country.

Here in the United States, we can even work down to the state level sometimes. It gives us exquisite control to say, you over there, you should primarily be going to that instance. You over there, you should be primarily going to that instance, and do a lot of testing to say, for a user who is roughly halfway between two instances, which is the best one to send them to so they have the lowest latency? On the flip side, each proxy instance is looking at the success rate of requests that it sees and it handles. If that success rate drops, if it goes below the SLO, those proxy instances will use the standard health C endpoint pattern. They set their health C status to 500. The upstream DNS providers who have been programmed with those health checks notice that.

If a proxy instance starts seeing its success rate drop, they vote themselves out of DNS. They go take a little quiet time by themselves. When they're feeling better, they're above the SLO, they raise the health check status and bring themselves back into DNS. This is now mostly self-healing. It turns a regional outage when we're like, "All of Europe can't do completions", into a just reduction in capacity because traffic is routed to other regions.

One thing I'll touch on briefly is one model we experimented with and then rejected because it just didn't work out for us was the so-called point of presence model. You might have heard it called PoP. If you're used to working with big CDNs, they will have points of presence. Imagine every one of these dots on this map, they have a data center in, where they're serving from. The idea is that users will connect and do that expensive connection as close to them as possible and speed up that bit.

Then those CDN providers will cache that data, and if need to, they can call back to the origin server. In our scenario, where I live in Asia, we might put a point of presence in Singapore. That's a good equidistant place for most of Asia. A user in Japan would be attracted to that Singapore server. There's a problem because the model is actually still hosted back here on the West Coast. We have traffic that flows westward to Singapore only to turn around and go all the way back to the West Coast. The networking colloquialism for this is traffic tromboning. This is ok for CDN providers because CDN providers, their goal is to cache as much of the information, so they rarely call back to the origin server. Any kind of like round tripping or hairpinning of traffic isn't really a problem.

For us doing code completions, it's always going back to a model. What we ended up after a lot of experimentation was the idea of having many regions calling back to a few models just didn't pay for itself. The latency wasn't as good and it carried with it a very high operational burden. Every point of presence that you deploy is now a thing you have to monitor, and upgrade, and deploy to, and fix when it breaks. It just didn't pay for us. We went with a much simpler model which is simply, if there is a model in an Azure region, we colocate a proxy instance in that same Azure region and we say that is the location that users' traffic is sent to.

A Unique Vantage Point

We started out with a proxy whose job was to do this authentication, to authenticate users' requests and then mediate that towards an LLM. It turns out it's very handy to be in the middle of all these requests. Some examples I'll give of this are, we obviously look at latency from the point of view of the client, but that's a very fraught thing to do. It's something I caution you, it's ok to track that number, just don't put it on a dashboard because some will be very incentivized to take the average of it, or something like that. You're essentially averaging up the experience of everybody on the internet's requests, from somebody who lives way out on bush on a satellite link to someone who lives next to the AMS-IX data center, and you're effectively trying to say, take the average of all their experiences. What you get when you do that is effectively the belief that all your users live on a boat in the middle of the Atlantic Ocean.

This vantage point is also good, because while our upstream provider does give us lots of statistics, they're really targeted to how they view running the service, their metrics. They have basic request counts and error rates and things like that, but they're not really the granularity we want. More fundamentally, the way that I think about it, to take food delivery as an example, use the app, you request some food, and about 5 minutes later you get a notification saying, "Your food's ready. We've finished cooking it. We're just waiting for the driver to pick it up". From the restaurant's point of view, their job is done, they did it, their SLO, 5 minutes, done. It's another 45 minutes before there's a knock on the door with your food. You don't care how quickly the restaurant prepared your food. What you care about is the total end-to-end time of the request. We do that by defining in our SLOs that the point we are measuring is at our proxy. It's ok for our service provider to have their own metrics, but we negotiate our SLOs as the data plane, the split where the proxy is.

Dealing with a Heterogeneous Client Population

You saw that we support a variety of IDEs, and within each IDE, there is a flotilla of different client versions out there. Dealing with the long tail of client versions is the bane of my life. There is always a long tail. When we do a new client release, we'll get to about 80% population within 24 to 36 hours. That last 20% will take until the heat death of the universe. I cannot understand why clients can use such old software. The auto-update mechanisms are so pervasive and pernicious about getting you to update, I don't quite understand how they can do this, but they do. What this means is that if we have a bug or we need to make a fix, we can't do it in the client. It just takes too long, and we never get to the population that would be successful with rolling out that fix. This is good because we have a service that sits in the middle, the proxy that sits in the middle, that we can do a fix-up on the fly, hopefully.

Over time, that fix will make it into the client versions and they'll sufficiently roll out, there'll be a sufficient enough population. An example of this, one day out of the blue, we got a call from a model provider that said, you can't send this particular parameter. It was something to do with log probabilities. You can't send that because it'll cause the model to crash, which is pretty bad because this is a poison pill. If a particular form of request will cause a model instance to crash, it will blow that one out of the water and that request will be retried and it'll blow the next one out of the water and keep working its way down the line. We couldn't fix it in the client because it wouldn't be fast enough. Because we have a proxy in the middle, we can just mutate the request quietly on the way through, and that takes the pressure off our upstream provider to get a real fix so we can restore that functionality.

The last thing that we do is, when we have clients that are very old and we need to deprecate some API endpoint or something like that, rather than just letting them get weird 404 errors, we actually have a special status code which triggers logic in the client that puts up a giant modal dialog box. It asks them very politely, would they please just push the upgrade button?

There's even more that we can do with this, because logically the proxy is transparent. Through all of the shenanigans, the clients still believe that they're making a request to a model and they get a response. The rest is transparent. From the point of view of us in the middle who are routing requests, we can now split traffic across multiple models. Quite often, the capacity we receive in one region won't all be in one unit. It might be spread across multiple units, especially if it arrives at different times. Being able to do traffic splits to combine all that together into one logical model is very handy. We can do the opposite. We can mirror traffic. We can take a read-only tap of requests and send that to a new version of the model that we might be either performance testing, or validating, or something like that.

Then we can take these two ideas and mix and match them and stack them on top of each other and make A/B tests, experiments, all those kinds of things, all without involving the client. From the client's point of view, it just thinks it's talking to the same model it has yesterday and today.

Was It Worth the Engineering Effort?

This is the basic gist of how you build a low latency code completion system with the aim of competing with IDEs. I want to step back and just ask like, as an engineering effort, was this worth it? Did the engineering effort we put into this proxy system pay for itself? One way to look at this is, for low latency, you want to minimize hops. You certainly want to minimize the number of middlemen that are in the middle, the middleware, anything that's kind of in that request path adding value but also adding latency. What if we just went straight to Azure instead, we had clients connect straight to Azure? This would have left authentication as the big problem, as well as observability. They would have really been open questions. It would have been possible to teach Azure to understand GitHub's OAuth token. The token that the IDE natively has from GitHub could be presented to Azure as an authentication method. I'm sure that would be possible. It would probably result in Azure building effectively what I just demonstrated on this.

Certainly, if our roles were reversed and I was the Azure engineer, I would build this with an authentication layer in front of my service. Some customer is coming to me with a strange authentication mechanism, I'm going to build a layer which converts that into my real authentication mechanism. We would have probably ended up with exactly the same number of moving parts just with more of them behind the curtain in the Azure side. Instead, by colocating proxy instances and model instances in the same Azure region, we, to most extent, ameliorated the cost of that extra hop. The inter-region traffic is not free, it's not zero, but it's pretty close to zero. It's fairly constant in terms of the latency you see there. You can characterize it and effectively ignore it.

War Stories

I'm going to tell you a few more war stories of what's happened over the life of this product just to emphasize that the value of having this intermediary really paid for itself over and over. One day we upgraded to a new version of the model which seemed to be very attracted to a particular token. It really liked emitting this token. It was some end of file marker, and it was something to do with a mistake in how it was trained that it just really liked to emit this token. We can work around this essentially in the request by saying, in your response, this very particular token, weight it down negative affinity, never want to see it. If we didn't have an intermediary like the proxy to do that, we would have had to do that in the client. We would have had to do a client rollout which would have been slow and ultimately would not have got all the users.

Then the model would have been fixed and we'd have to do another client change to reverse what we just did. Instead, it was super easy to add this parameter to the request on the fly as it was on its way to the model. That solved the problem immediately and it gave us breathing room to figure out what had gone wrong with the model training and fix that without the Sword of Damocles hanging over our head.

Another story is that, one day I was looking at the distribution of cancellation. For a request that was cancelled, how long did it live until it was cancelled? There was this bizarre spike effectively at 1 millisecond, effectively immediately. It was saying, a lot of requests come from the clients and are immediately cancelled. As in, you read the request and then instantly afterwards the client is like, I'm sorry, I didn't mean to send that to you, let me take it back. The problem is by that time we've already started the process of forwarding that to Azure and they're mulling on it. We immediately send the request to Azure and then say to them, sorry, I didn't mean to send that to you. May I please have it back? Cancellation frees up model resources quicker but it's not as cheap as just not sending a request that we know we're going to cancel is.

It took us some time to figure out what was exactly happening in the client to cause this fast cancellation behavior, but because we had the proxy in the middle, we could add a little check that just before we made the request to the model, we would check, has it actually been cancelled? There were mechanisms in the HTTP library to ask that question. We saved ourselves making and then retracting that request. Another point talking to metrics, from the metrics that our upstream model provider provides us, we don't get histograms, we don't get distributions, we barely get averages. There would be no way we would have been able to spot this without our own observability at that client proxy layer. If we didn't have the proxy as an intermediary, we still could have had multiple models around the world.

As you saw, you can have OpenAI models in any Azure region you want. We would just not have a proxy in front of them. We probably would have used something like octoDNS to still do the geographic routing, but it would have left open the question of what do we do about health checks. When models are unhealthy or overloaded, how do we take them out of DNS? What we probably would have had to do is build some kind of thing that's issuing synthetic requests or pinging the model or something like that, and then making calls to upstream DNS providers to manually thumbs up and thumbs down regions. HTTP/2 is critical to the Copilot latency story. Without cancellation, we'd make twice as many requests and waste half of them. It was surprisingly difficult to do with off-the-shelf tools.

At the time, CDNs didn't support HTTP/2 on the backend. That was an absolute non-starter. Most cloud providers didn't support HTTP/2 on the backend. If you want to do that you have to terminate TLS yourself. For the first year of our existence of our product, TLS, like the actual network connection, was terminated directly on the Kubernetes pod. You can imagine our security team were absolutely overjoyed with this situation. It also meant that every time we did a deploy, we were literally disconnecting everybody and they would reconnect, but that goes against the theory that we want to have these connections and keep them open for as long as possible.

GitHub's Paved Path

This is very GitHub specific, but a lot of you work for medium to large-scale companies, you probably have a tech stack that is, in GitHub paths, we call it the paved path. It is the blessed way, the way that you're supposed to deploy applications inside the company. Everything behind GLB, everything managed by octoDNS made our compliance story. You can imagine, we're selling this to large enterprise companies. You need to have your certifications. You need to have your SOC 2 tick in the box. Using these shared components really made that compliance story much easier. The auditors say, this is another GLB hosted service, using all the regular stuff, not exactly a tick in the box but got a long way towards solving our compliance story. The flip side is because these are shared components rather than every individual team knowing every detail up and down of terminating TLS connections on pods hosted in Kubernetes clusters that they run themselves, we delegate that work to shared teams who are much better at it than that.

Key Takeaways

This is a story of what made Copilot a success. It is possible that not all of you are building your own LLM as a service-service. Are there broader takeaways for the rest of you? The first one is, definitely use HTTP/2. It's dope. I saw a presentation by the CTO of Fastly, and he viewed HTTP/2 as an intermediary. He really says HTTP/3 is the real standard, the one that they really wanted to make. From his position as a content delivery partner whose job is just to ship bits as fast as possible, I agree completely with that. Perhaps the advice is not use HTTP/2, the advice would probably be something like, use something better than HTTP/1. If you're interested in learning more, if you look that up on YouTube, that's a presentation by Geoff Huston talking about HTTP/2 from the point of view of application writers and clients, and how it totally disintermediates most of the SSL and the middle VPN nonsense that we live with day to day in current web stuff.

The second one is a Bezos quote, if you're gluing your product together from parts from off-the-shelf suppliers and your role in that is only supplying the silly putty and the glue, what are your customers paying you for? Where's your moat? As an engineer, I understand very deeply the desire not to reinvent the wheel, so the challenge to you is, find the place where investing your limited engineering budget, in a bespoke solution, is going to give you a marketable return. In our case, it was writing a HTTP/2 proxy that accelerated one API call. We're very lucky that copilot-proxy as a product is more or less done, and has been done for quite a long time, which is great because it gives our small team essentially 90% of our time to get dedicated to the operational issues of running this service.

The last one is, if you care about latency, if your cloud provider is trying to sell you this siren song that they can solve your latency issues with their super-fast network backbone. That can be true to a point, but remember the words of Montgomery Scott, you cannot change the laws of physics despite what your engineering title is. If you want low latency, you have to bring your application closer to your users. In our case that was fairly straightforward because code completion at least in the request path is essentially stateless. Your situation may not be as easy. By having multiple models around the globe, that turns SEV1 incidents into just SEV2 alerts. If a region is down or overloaded, traffic just flows somewhere else. Those users instead of getting busy signal, still get a service albeit at a marginally higher latency. I've talked to a bunch of people, and I said that we would fight holy wars over 20 milliseconds. The kind of latencies we're talking about here are in the range of 50 to 100 milliseconds, so really not noticeable for the average user.

See more presentations with transcripts

Recorded at:

Mar 24, 2025

David Cheney

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?