BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Navigating LLM Deployment: Tips, Tricks, and Techniques

Navigating LLM Deployment: Tips, Tricks, and Techniques

39:49

Summary

Meryem Arik shares best practices for self-hosting LLMs in corporate environments, highlighting the importance of cost efficiency and performance optimization. She details quantized models, batching, and workload optimizations to improve LLM serving. Insights cover model selection and infrastructure consolidation, emphasizing the differences between enterprise and large-scale AI lab deployments.

Bio

Meryem Arik is a recovering physicist and a co-founder of TitanML. The TitanML platform automates much of the difficult MLOps and Inference Optimization science to allow businesses to build and deploy state-of-the-art language models with ease. She has been recognized as a technology leader in the Forbes 30 Under 30 list.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Arik: I've called this, navigating LLM deployment: tips, tricks, and techniques 2.0. I could also rename it, how to deploy LLMs if you don't work at Meta, OpenAI, Google, Mistral, or Anthropic. I'm specifically interested in how you deploy LLMs if you're not serving it as a business, and instead you're serving it so you can build applications on top of it, and you end up deploying it in fairly different ways if you work at a normal company versus one of these guys.

Hopefully you're going to get three things out of this session. Firstly, you're going to learn when self-hosting is right for you, because you're going to find out it can be a bit of a pain, and it's something you should only do if you really need to. Understanding the differences between your deployments and the deployments of AI Labs. Then, also, I'm going to give some best practices, tips, tricks, techniques, a non-exhaustive list for how to deploy AI in corporate and enterprise environments. Essentially, we build infrastructure for serving LLMs.

Evaluating When Self-Hosting is Right for You

Firstly, when should you self-host? To explain what that is, I'll just clarify what I mean by self-hosting. I distinguish self-hosting compared to interacting with LLMs through an API provider. This is how you interact with LLMs through an API provider. They do all of the serving and hosting for you. It's deployed on their GPUs, not your GPUs. They manage all of their infrastructure, and what they expose is just an API that you can interact with. That's what an API hosted model is. All of those companies I mentioned at the beginning host these models for you. Versus being self-hosted. When you self-host, you're in control of the GPUs. You take a model from Hugging Face or wherever you're taking that model from, and you deploy it and you serve it to your end users. This is the broad difference. It's essentially a matter of who owns the GPUs and who's responsible for that serving infrastructure. Why would you ever want to self-host? Because when people manage things for you, your life is a little bit easier.

There are three main reasons why you'd want to self-host. The first one is you have decreased costs. You have decreased costs when you're starting to scale. At the very early stages of POCing and trying things out, you don't have decreased costs. It's actually much cheaper to use an API provider where you pay per token, and the per token price is very low. Once you get to any kind of scale where you can fully utilize a GPU or close to it, it actually becomes much more cost efficient. The second reason why you'd want to self-host is improved performance. This might sound counterintuitive, because on all the leading benchmarks, the GPT models and the Claude models are best-in-class to those benchmarks.

However, if you know your domain and you know your particular use case, you can get much better performance when self-hosting. I'm going to talk about this a bit more later. This is especially true for embedding models, search models, reranker models. The state of the art for most of them is actually in open source, not in LLMs. If you want the best of the best breed models, you'll have a combination of self-hosting for some models and using API providers for others. You can get much better performance by self-hosting. Some privacy and security. I'm from Europe, and we really care about this. We also work with regulated industries here in the U.S., where you have various reasons why you might want to deploy it within your own environment. Maybe you have a multi-cloud environment, or maybe you're still on-prem. This sits with the data that a16z collected, that there's broadly three reasons why people self-host: control, customizability, and cost. It is something that a lot of people are thinking about.

How do I know if I fall into one of those buckets? Broadly, if I care about decreased cost, that's relevant to me if I'm deploying at scale, or it's relevant to me if I'm able to use a smaller specialized model for my task run than a very big general model like GPT-4. If I care about performance, I will get improved performance if I'm running embedding or reranking workloads, or if I'm operating in a specialized domain that might benefit from fine-tuning. Or I have very clearly defined task requirements, I can often do better if I'm self-hosting, rather than using these very generic models.

Finally, on the privacy and security things, you might have legal restrictions, you'll obviously then have to self-host. Potentially, you might have region-specific deployment requirements. We work with a couple clients who, because of the various AWS regions and Azure regions, they have to self-host to make sure they're maintaining that sovereignty in their deployments. Then, finally, if you have multi-cloud or hybrid infrastructure, that's normally a good sign that you need to self-host. A lot of people fall into those buckets, which is why the vast majority of enterprises are looking into building up some kind of self-hosting infrastructure, not necessarily for all of their use cases, but it's good to have as a sovereignty play.

I'm going to make a quick detour. Quick public service announcement that I mentioned embedding models, and I mentioned that the state of the art for embedding models is actually in the open-source realm, or they're very good. There's another reason why you should almost always self-host your embedding models. The reason why is because you use your embedding models to create your vector database, and you've indexed vast amounts of data. If that embedding model that you're using through an API provider ever goes down or ever is depreciated, you have to reindex your whole vector database. That is a massive pain. You shouldn't do that. They're very cheap to host as well. Always self-host your embedding models.

When should I self-host versus when should I not? I've been a little bit cheeky here. Good reasons to self-host. You're building for scale. You're deploying in your own environment. You're using embedding models or reranker models. You have domain-specific use cases. Or my favorite one, you have trust issues. Someone hurt you in the past, and you want to be able to control your own infrastructure. That's also a valid reason. Bad reasons to self-host is because you thought it was going to be easy. It's not easier, necessarily, than using API providers. Or someone told you it was cool. It is cool, but that's not a good reason. That is why. That's how you should evaluate whether self-hosting is right for you. If you fall into one of these on the left, you should self-host. If not, you shouldn't.

Understanding the Difference between your Deployments and Deployments at AI Labs

Understanding the difference between your deployments and the deployments in AI Labs. If I'm, for example, an OpenAI, and I'm serving these language models, I'm not just serving one use case. I'm serving literally millions of different use cases. That means I end up building my serving stack very differently. You, let's say I'm hosting with an enterprise or corporate environment. Maybe I'm serving 20 use cases, like in a more mature enterprise. Maybe I'm serving now just a couple. Because I have that difference, I'm able to make different design decisions when it comes to my infrastructure. Here are a couple reasons why your self-hosting regime will be very different to the OpenAI self-hosting regime. First one is, they have lots of H100s and lots of cash.

The majority of you guys don't have lots of H100s, and are probably renting them via AWS. They're more likely to be compute bound because they're using the GPUs like H100s, rather than things like A10s. They have very little information about your end workload, so they are just going on a, we're just trying to stream tokens out for arbitrary workloads. You have a lot more information about your workload. They're optimizing for one or two models, which means they can do things that just don't scale for regular people, self-hosting. If I'm deploying just GPT-4, then I'm able to make very specific optimizations that only work for that model, which wouldn't work anywhere else.

If you're not self-hosting, you're likely using cheaper, smaller GPUs. You're probably using also a range of GPUs, so not just one type, maybe you're using a couple different types. You want it to be memory bound, not compute bound. You have lots of information about your workload. This is something that is very exciting, that for most enterprises the workloads actually look similar, which is normally some kind of long-form RAG or maybe extraction task, and you can make decisions based on that. You'll have to deal with dozens of model types, which is a luxury. You just don't have the luxury of those AI Labs. Here are some differences between the serving that you'll do, and the serving that your AI Labs will have to do.

Learning Best Practices for Self-Hosted AI Deployments in Corporate and Enterprise Environments

Best practices, there are literally an infinite number of best practices that I could give, but I've tried to boil them down to I think it's now six non-exhaustive tips for self-hosting LLMs. This isn't a complete guide to self-hosting, but hopefully there are some useful things that we've learned over the last couple years that might be useful to you. The first one is, know your deployment boundaries and work backwards. Quantized models are your friend. Getting batching right really matters. Optimize for your workload. That goes back to what I said earlier about, you can do optimizations that they just can't. Be sensible about the models you use. Then, finally, consolidate infrastructure within your org. These are the top tips that I'm going to talk through now for the rest of the session.

1. Deployment Boundaries

Let's start with deployment boundaries. Assuming you don't have unlimited compute, and even AI Labs don't have unlimited compute. It's a really scarce resource at the moment. You need to be very aware of what your deployment requirements are, and then work backwards from what you know about them. If you know you only have a certain available hardware, maybe you're, for example, just CPU bound, and you're deploying completely on-prem, and you don't have GPUs, then you probably shouldn't be looking into deploying a Llama 4 billion or 5 billion. Knowing that boundary is important from the get go. You should have an idea of your target latency as well. You should have an idea of your expected load.

If you have all of these together, and you can construct some sentence of, I would like to deploy on an instance cheaper than x, which will serve y concurrent users in an average latency of less than z. If you can form that sentence, it makes everything else that you have to do much easier, and you won't have that bottleneck of being like, "I built this great application. I just have no idea how to deploy it". I can use that information to then work backwards and figure out what kind of models I should be looking at and how much effort I should be putting into things like prompting and really refining my search techniques, rather than upgrading to bigger models.

2. Quantize

Which leads me on to my second tip, which is, you should pretty much always be using quantized models, for a couple different reasons. The first reason I'm going to say is that you should always use quantized models, more or less, is because the accuracy is pretty much always better than a model of that same memory requirement that you've quantized it down to, so you retain a lot of that accuracy. The second reason that you should pretty much always quantize is actually the accuracy loss isn't that different from the original model. I'll reference two pieces of research that show this. There was this paper that came out in 2023, by Tim Dettmers, amongst others, who's a bit of a legend in the field, and it's called, "The case for 4-bit precision". What he showed in this paper, actually, the highlight figure, is that for a fixed model bit size, the accuracy of the model is far higher if you're using a quantized model.

Given that we know when we have a model with more parameters, as I scale up the parameters, the accuracy of the model goes up. What's interesting is, if I take one of those large models and quantize it down to a natively smaller size, it retains a lot of that accuracy that we had in the beginning, which is very good. I'm pretty much always going to have better performance using a quantized 70 billion parameter model than I will of a model of a native size. This goes on to some great research that Neural Magic did on this. They showed that, firstly, if I quantize models, so here they have the accuracy recovery. This is when they take original bit precision models and then quantize them. It pretty much retains all of the accuracy from the original model, which is great. You get like 99% accuracy maintenance, which is awesome. You can see that even though there are slight dips in the performance, so, for example, if I look at this 405 here, with one of the quantized variants, it's a slight dip in the original, like a couple of basis points.

It's still far higher than the 70 billion parameter or the 8 billion parameter model. It retains much of that accuracy that you were looking for. If you know your deployment boundaries, and you don't have unlimited amounts of compute to call from, using that quantized model, using the best model that will fit within that is a great piece of advice. If I know that this is the piece of infrastructure I'm working with, this is the GPU I'm working with, I can then say, ok, which is the biggest model that when quantized down to 4-bit, is going to perform the best?

In this case, at least a couple months old. Mixtral is not really a thing anymore, but you get the idea that I would rather use the Mixtral 4-bit, which is 22-gig, than the Llama 13B, because it's retained a lot of that performance. I put at the bottom places that you can find quantized models. We retain an open-source bank of quantized models that you can check out. TheBloke also used to be a great source of this as well, although he's not doing it as much recently.

3. Batching Strategies

The third thing I'm going to talk about is batching strategies. This is something that's very important to get right. The reason it's very important to get right is because it's a very easy way to waste GPU resources. When you're first deploying, you're probably not using any batching, which means you end up with GPU utilization coming like this. It's not great. We then see people going straight to dynamic batching. What this means is I have a set batch size, so let's say my batch size is 16. I will wait until 16 requests have come in to process, or I'll wait a fixed amount of time, and you end up with this spiky GPU utilization, which is still not great either. What's far better if you're deploying generative models, a very big piece of advice is to use something like continuous batching, where you can get consistently high GPU utilization.

It's a state-of-the-art technique designed for batching of generative models, and it allows requests to interrupt long in-flight requests. This batching happens at the token level rather than happening at the request level. I could, for example, have my model generate the 65th token of one response, and then the fifth token of another response, and I end up with a utilization that is far more even. This is just one of the examples of inference optimizations that you can do that make a really big difference. Here, I think I've gone from a 10% utilization to near enough 80. Really significant improvements. If you're in a regime where you don't have much compute resource, this is a very valuable thing to do.

4. Workload Optimizations

I'm going to talk about workload optimizations. It's something that we've been researching a lot actively, and we think is really promising. You know something that the AI Labs don't know, which is you know what your workload looks like. What that means is you can make a lot of decisions based on what your workload looks like, that it would just never, ever make sense for them to make. I'm going to give you a couple examples of techniques that you can use that make sense if you know what your workload looks like, that don't make sense if you're serving big multi-tenant environments and multiple user groups. One of them is prefix caching. This is probably one of the things that I'm most excited by that happened this year.

One of the most common use cases that we see from our clients is situations where they have really long prompts, and often these prompts are shared between requests. Maybe what I have is I have a very long context RAG on a set document that maybe I just pass through the whole document and I'm asking questions of it. Or maybe I have a situation where I have a very long set of user instructions to the LLM. These are instances where I have very long prompts. Traditionally, what I would have to do is I'd have to reprocess that prompt every single request, which is super inefficient. What we can do, if you know what your workload looks like, and you know that you're in a regime where you have a lot of shared prompts, or very long shared prompts, you can use something called prefix caching, which is essentially when you pre-compute the KV cache of text and you reuse that when the context is reused, when you do your next generation.

My LLM doesn't need to reprocess that long prompt every single time, they can process just the difference between them. If I have a very long-shared prompt and then slightly different things each time, it can just process that difference and then return. On the right, I have some results that are fresh off the press. What we have here is our server with the prefix caching turned on and turned off. In the green lines, we have it turned off. The light green line is with two GPUs, and the dark green line is with one GPU. With the blue line, we have it with the prefix caching turned on. What you can see is we have very significant throughput improvement. It's throughput on the Y and then batch size on the X. About 7x higher throughput, which is very significant. It means that you can process many more requests, or you can process those requests much cheaper.

This has like no impact, if you didn't know that you had a long-shared prompt. It doesn't really have an impact if I'm serving multiple dozens of users at any one time. This only really works if I know what my use case looks like. Another example of caching that we are really excited by. We call this internally, SSD, which is a form of speculative decoding, but you can also think of it like caching, which only makes sense if you know what your workload looks like. This is a use case where, if your model might predict similar tokens between tasks. Let's say I'm doing a task where I have phrases that are very frequently repeated, what I can do is I can essentially cache those very frequently repeated phrases.

Instead of computing the whole phrase every single time, token by token, I can just pull a cache hit and inference on that. We did benchmarking on this, and for a workload of the JSON extractive workload, we got about a 2.5 latency decrease on this, which is pretty significant. Again, would have literally no impact if I'm using the OpenAI API, because there's no way they could cache your responses, and 75 million other people's responses. It only really makes sense if you're deploying in your own environment.

5. Model Selection

I have two more that I'm going to go through. This one is model selection. I think everyone knows this by now, but the only real reason I put it up is because last time people really liked the image. This is something that I still see people getting wrong, which is, they get their large language models, which are the most difficult things to deploy and the most expensive things to deploy, to do pretty much all of the workload. That's not a good idea. The reason it's not a good idea is because you then have more of this difficult, expensive thing to manage. There are a lot of parts to an enterprise RAG pipeline. In fact, the most important part of your enterprise RAG pipeline is actually not the LLM at all. The most important part is your embedding and retrieval part, by far. They don't need GPT-4 level reasoning, obviously.

If you have really good search and retrieval pipelines, you can actually use much smaller models to get the same results. We've seen clients move from GPT-4 type setups to move to things as small as the Llama 2 billion models, if they get their retrieval part right, which means you can run the whole application very cheaply. I would advise people to check out the Gemma and the small Llama models for this. We think they're very good. Very good in extractive workloads as well, especially if you're using a JSON structured format like Outlines or LMFE. Pick your application design carefully and only use big models when you really need to use big models.

6. Infrastructure Consolidation

Finally, I'm going to talk about infrastructure consolidation. In traditional ML, most teams actually work in silos still, unfortunately, which is, they deploy their own model and they're responsible for deploying their own model. In GenAI, it does not make sense to do that, and that's why these API providers are making an absolute killing, because it does make sense to deploy once, manage centrally, and then provide it as a shared resource within the organization. If you do this, it enables your end developers to focus on building this application level rather than building infrastructure. When I speak to my clients, sometimes we talk about building an internal version of the OpenAI API. That's the experience you want to be able to provide to your developers, because otherwise they'll have to learn what prefix caching is every single time they want to deploy a model, which is just not an efficient use of time.

I think this kind of centralization is a really exciting opportunity for MLOps teams to take that ownership within the org. This still means you can understand the kind of workloads you work with and optimize for, for example, long context RAG situations. That's still something that you can do. You can actually end up making a cheaper and better version of something like the OpenAI API. Serving is hard. Don't do it twice or even three. We've seen clients who do it over and again. You'll have a team deploying with Ollama, another team deploying with vLLM, and trying to do it. Much better to have a central team manage that. GPUs are very expensive, don't waste them.

I can talk about what this would actually look like in practice. This central consolidation of infrastructure, you can think of it as having like an internal Bedrock, or an internal OpenAI API, or something like that. Individual ML teams can work from these APIs and then build applications on top of them using techniques like LoRA, so that's a form of fine-tuning, or they use things like RAG, where they're plugging into it in the same way they use the OpenAI API. We have a quick case study, so a client of ours did this. They started off with a bunch of different models. Each app was essentially attached to its own model and attached to its own GPU setup. This is obviously not ideal, because I'm deploying, for example, at the time, it was a Mixtral. We were deploying that multiple times, over and again and wasting GPU resources.

Instead, what we did is we just pooled it into one and deployed that Mixtral on a centrally hosted infrastructure, which meant we could use much less GPU resource than we were using otherwise. Something that I like the pattern of, if you're doing this central deployment and central hosting, is give your team access to a couple different models of different sizes, give them optionality. Don't say something like, you can only use this particular model, because they want to be able to try things. Deploying a large, medium, and small model that you guys have checked out and that you're happy with is a really good place to start with a bunch of maybe those auxiliary models, so table parses, embedding models, reranker models. If you do that, you give your team optionality, and you still maintain the benefit of being able to consolidate that infrastructure and make those optimizations.

There's my six non-exhaustive tips for self-hosting LLMs. Know your deployment boundaries and work backwards. If you do know this, everything else becomes much easier. Quantized models are your friends. They might seem scary because they affect the model, but they're actually, on the whole, much better. Getting batching right really matters for your GPU utilization. Optimize for your particular workload. You have a superpower that the API providers do not, which is, you know what your workload will look like. Be sensible about the models you use. Then, finally, consolidate infrastructure within your organization.

Summary

We've gone through, why you should self-host. There's a bunch of infrastructural reasons why self-hosting is a great idea, for deploying at scale. If you want to use embedding models, reranker models, if you, for example, are using domain-specific models, all very good reasons to self-host. Your needs in deploying will be different to the needs of AI Labs and mass AI API providers, because you're using different types of hardware and your workloads are different. I've also given six tips, best practices that we've learned over the last couple years for deploying LLMs in this kind of environment.

Questions and Answers

Luu: For those tips that you gave, if I have a bursty workload, are any of those tips more relevant than others?

Arik: If you have a bursty workload, it's rough. Because there's always going to be some cold start problem when you're deploying these. For bursty workloads, they're slightly less well suited for self-hosting, unless you're deploying smaller models. If you're deploying smaller models, the scaling up and scaling down is much easier, but if you're deploying bigger models, consider whether you do need to actually self-host, or whether you can use other providers. We, within our infrastructure, have obviously built Kubernetes resources that can do that scaling up and scaling down.

If you have bursty workloads with very low latency requirements, that is just challenging. What you could do, and what we have seen people do is, if they have a bursty workload, migrating to smaller and cheaper models during the periods of high burst and then going back to the higher performance models when things are a bit more chill. You could imagine a regime where you had a 70 billion parameter model servicing most of the requests, and then let's say you get to periods of very high load, while you're waiting to scale up, you move to something like an 8 billion parameter model or something slightly smaller. You take a small accuracy hit, and then go from there. That's something that we've seen work as well.

Participant 1: How would you architect a central compute infra, if each downstream application requires different fine-tuning, where you can't really have an internal OpenAI API like serving infrastructure?

Arik: That's very difficult if you want to do something like full fine-tuning, which we've been used to over the last couple years, which essentially you fine-tune the whole model. Because what that means is, every single time you fine-tune the model, you have to deploy a separate instance of it. Fortunately, over the last two years, there's been advancements in PEFT methods, which is essentially a way that you can fine-tune, but you don't have to fine-tune the whole model. You can just fine-tune a small subsection of it. Let's say I have a situation where I'm centrally hosting and I have 10 different teams, and every single team has done a fine-tuned version of a Llama model, for example. What I used to have to do is go in and deploy each of those on their own GPUs or shared GPUs. What I can do now instead is just deploy a centrally hosted Llama model, and then deploy on the same GPU still, in my same server, all of these different LoRA fine-tuned versions. These small adapters that we add to all of the models.

On inference time, we can hot swap them in and out. It's like you're calling separate models, but actually it's one base model and these different adapters. That's really exciting, because it means we can deploy hundreds if not thousands of fine-tuned, domain-specific models with the exact same infrastructure and resource that we need to deploy one. That's the way I would go about it, PEFT methods. We prefer LoRA as a method, but there are other available on the market.

Participant 2: You're speaking a lot about the models themselves, which is awesome, but in your diagrams, when you're referring to GPUs, and all the work that you did, should we assume most of this is based on HB100 Blackwell, NVIDIA, or did you actually go across Gaudi and Instinct from AMD and so on? Just curious, because we're looking to build something. Support for all models is not ubiquitous right now. Where did you stop and start with how far your suggestions go here, on some of that hardware level?

Arik: We very rarely see clients deploying on like H100 or B100s. The kinds of clients that we work with are enterprise and corporates, which might have access to the previous generations of GPUs or maybe slightly cheaper GPUs. We tend to do most of our benchmarking on those kinds of GPUs. We support NVIDIA and AMD. We're rolling out support for Intel as well. We have seen good results on AMD. We think they're a really interesting option. Then for Intel, we're waiting for the software support to catch up there. We also know that like Inferentia are doing interesting things, and the TPUs are doing interesting things as well. There's definitely more out there than the B100s and H100s that you see.

For most enterprises, you really don't need them, and that you'll probably find with previous generations of NVIDIA hardware. My recommendation would be to stick with NVIDIA, mainly because the software stack is a bit more evolved. I would build your infrastructure and applications such that you can move to AMD or you can move to other providers in a couple years, because that gap is going to close pretty quickly, so you don't want to tie yourself to like an NVIDIA native everything.

Participant 2: We're using OpenAI off of Azure, so we're using NVIDIA, but I'm looking to build something out locally, because we'll balance loads between cloud and on-prem for development, or even just spilling over extra load. Keeping mind open to everything.

Arik: One of our selling points is that we are hardware agnostic, so we can deploy on not just NVIDIA, but others as well. The conversation always goes something like, can I deploy on AMD? We're like, yes. Do you use AMD? No, but I might want to. It's a good thing to keep your options open there.

Participant 3: Could you elaborate on some of the batching techniques?

Arik: Let's assume we're not going to do no batching, and we're going to do some kind of batching. If I'm doing dynamic batching, what's happening is this is batching on a request level. Let's say my batch size is 16. I'm going to wait for 16 requests to come in, and then I'm going to process them. If I'm in a period of very low traffic, I'll also have some time as well, so maybe either 16 or a couple seconds, and then I'll process that batch. What that means is I get this spiky workload where I'm waiting for long requests to finish and stuff like that. What we can do instead is this continuous batching. What this allows you to do is it allows you to interrupt very long responses with other responses as well. This means that we can do essentially like token level processing rather than request level.

I might have a situation where I have a very long response, and I can interrupt that response every so often to process shorter responses as well. It means that my GPU is always at very high utilization, because I'm constantly feeding it new things to do. We end up with continuous batching, which is much better GPU utilization.

Luu: Just token level.

Participant 4: How does interrupting and giving it a different task improve GPU utilization? Why not just continue with the current task, gives you the peak utilization at all times.

Arik: Essentially, it's because otherwise you have to wait for that whole request to finish, and the GPU is able to process things in parallel, so it doesn't actually affect the quality of the output at all. It just means that each pass, I can just have it predict different tokens from different requests. It's constantly being fed with new things, rather than waiting for the longest response to finish, and it's only working on that one response while we're waiting for a new batch.

Participant 4: You're suggesting I can put multiple requests in at the same time, in a single request to the GPU.

Arik: Exactly. What you end up with is your latency, on average, is a little bit higher, but my throughput is much better. If I batch these requests together and process them together, I end up getting much better throughput than if I was just to process them one by one.

Participant 4: When I do give it a long-running request, what is GPU doing if it's not being utilized 100% percent? What is taking that long? Why is the request taking that long while GPU is not being utilized? What is going on in that setup?

Arik: When the GPU is not being utilized?

Participant 4: Yes. You're saying, I'm putting in a long-running request, so it is running. What is running if the GPU is not being utilized at the same time to maximum capacity?

Arik: It's essentially doing each request one by one, so I have to wait until the next one. I have a fairly good article, I think Baseten wrote, which also was very good in this.

 

See more presentations with transcripts

 

Recorded at:

Mar 28, 2025

BT