Transcript
Arik: I'm Meryem. I'm co-founder and CEO of TitanML. My background is, I was a physicist, originally, turned banker, then turned AI, but always really interested in emerging tech. We at TitanML built the infrastructure to make serving LLMs efficiently, much better. I'm going to frame today through a conversation that I had at a wedding last summer. No one really understands what we do, at least they didn't before ChatGPT came out. They're starting to now. I always find myself having to have this conversation over again. Fortunately, it wasn't actually me having this conversation. It was my co-founder who I was at the wedding with, we're all university friends. This is the conversation. Russell, he's also a friend of mine from university. He's a data scientist at a hedge fund. Really smart guy. This is Jamie. He is my co-founder. He's our chief scientist. He essentially is the person that makes our inference server really fast.
Outline
What I'm going to do is I'm firstly going to explain why LLM deployment is hard, because a lot of people don't necessarily appreciate that it is. Then I'm going to give an assortment, I think it's seven, that I landed on, tips, tricks, and techniques for better LLM deployments.
Why is LLM (AI) Deployment Hard?
We'll start with this conversation. Typically, it's like, what have you been up to? Then he's like, I've been working on making LLM serving more easy. Then he says, is LLM deployment even hard, don't I just call the OpenAI API? Then he's like, sort of. Because everyone, when they think of LLMs, just thinks of OpenAI. APIs are really easy to call. You might be like, why is she even here talking? I can't just call the OpenAI API. Everyone here knows how to do that. However, there are more than one ways that you can access LLMs. You can use hosted APIs. I have a bunch of them here, OpenAI, Cohere, Anthropic, AI21 Labs. These are all situations where they've done the hosting for you and they've done the deployment for you. All you have to do is call into them. I don't want to minimize it too much, because there's still complexity you have there. You still have to do things like hallucination reduction, but they've done a lot of the heavy lifting. For a lot of use cases, you might want to self-host. This is when you're calling into like a Mistral, or you're hosting a Llama, or one of the others. Essentially, you're hosting it in your own environment, whether that's VPC or on-prem environment.
He's like, but why would I want to self-host anyway? To which we say, lots of reasons. There's broadly three reasons why you might want to self-host. Firstly, there's decreased cost at scale. It is true that if you're just doing proof of concepts, then OpenAI API based models are much cheaper. If you're deploying at scale, then self-hosting ends up being much cheaper. Why does it become much cheaper? Because you only have one problem to solve, which is your particular business problem. You're able to use much smaller models to solve the same problem. Whereas OpenAI, they're hosting a model that has to solve both coding and also writing Shakespeare, so they have to use a much bigger model to get the same output.
At scale, it's much cheaper to use self-hosted models. Second reason why you might want to self-host is you have improved performance as well. When you're using a task specific LLM, or you fine-tuned it, or you've done something to make it very narrow to your task, you end up typically getting much better performance. Here's a couple of snippets from various blogs, although I think they're a bit old now, but the point still stands. Then the third reason, which is why most of our clients self-host, which is privacy and security. If you're part of a regulated industry maybe for GDPR reasons, or your compliance team, then you might have to self-host as well. These are the three main reasons why you should self-host. If these aren't important to you, use an API.
Typically, we find that the reasons why enterprises care about open source, and I have, I think, a couple graphs from a report by the VC, a16z. The three main reasons are control, customizability, and cost. The biggest one by far is control. Being able to have that AI independence, that if OpenAI decides to fire its CEO again, that you will still have access to your models, which is important, especially if you're building really business important applications. The majority of enterprises also seem to agree that these reasons are important to them. The vast majority of enterprises, apart from 18%, expect to shift to open source, either now or when open sources matches the performance of a GPT-4 quality model. If you are looking to self-host, you are very much not alone, and most enterprises are looking to build up that self-hosted capability.
Russell, he works at a hedge fund, he's like, privacy is really important for my use case, so it makes sense to self-host. How much harder can it really be? I hear this all the time, and it infuriates me. The answer is a lot harder. You really shouldn't ignore the complexity that you can't see. When you call an API based model, you benefit from all of the hard work that their engineers have done under the hood to build that inference and serving infrastructure. In fact, companies like OpenAI have teams of 50 to 100 managing this infra. Things like model compression, like Kubernetes, batching servers, function calling, JSON forming, runtime engines, are all the things you don't have to worry about when you're using the API based model, but you do suddenly have to worry about when you're self-hosting.
He's like, but I deploy ML models all the time. You might have been deploying XGBoost models or linear regression models in the past. How much harder can it really be to deploy these LLMs? To which we say, do you know what the L stands for? It's way harder to deploy these models. Why? The first L in LLM stands for large language model. I remember when we started the company, we thought a 100 million parameter BERT model was large. Now a 7 billion parameter model is considered small, but that is still 14 gig, and that is not small. GPUs are the second reason why it is much harder. GPUs are much harder to work with than CPUs. They're much more expensive, so using them efficiently really matters. Doesn't really matter if you don't use your CPUs super efficiently, because they're a couple orders of magnitude cheaper.
That cost, latency, performance tradeoff triangle that we sometimes talk about is really stark with LLMs in a way that it might not have been previously. The third reason why it's really hard is the field is evolving crazy fast. Half of the techniques that we use to serve and deploy and optimize models didn't exist a year ago. Another thing that I don't have here, but maybe it's worth mentioning, is also the orchestration element. Typically, with these large language model applications, you have to orchestrate a number of different models. RAG is a perfect example of this. You have to orchestrate in the very classic sense, an embedding model and a generation model. If you're doing state of the art RAG, you'll probably need a couple models for your parses, maybe an image model and a table model, and then you'll need a reranker. Then you end up with five or six different models. That gets quite confusing. Plus, there's all the other reasons why deploying applications is hard, like scaling and observability.
Tips to Make LLM Deployment Less Painful
He then says something like, that sounds really tricky. What can I do? Then Jamie says, "Luckily, Meryem has some tips and tricks that make navigating LLM deployment much easier." That's what exactly he said. We'll go through my tips to make LLM deployment less painful. It'll still suck, and it'll still be painful, it might be less painful.
1. Know Your Deployment Boundaries
My first tip is that you should know your deployment boundaries. You should know your deployment boundaries when you're building the application. Typically, people don't start thinking about their deployment boundaries until after they've built an application that they think works. We think that you should spend time thinking about your requirements first. It'll make everything else much easier. Thinking about stuff like, what are your latency requirements? What kind of load are you expecting? Are you going to be deploying an application that might have three users at its peak, or is this going to be the kind of thing like DoorDash, where you're deploying to 5 gazillion users? What kind of hardware do you have available? Do you need to deploy on-prem, or can you use cloud instances? If you have cloud instances, what kind of instances do you have to have?
All of these are the kind of things that you should map out before. You might not know exactly, so it's probably a range. It is acceptable if my latency is below a second, or above X amount. It's just good things to bear in mind. Other things that I don't have here is like, do I need guaranteed JSON outputs? Do I need guaranteed regex outputs? These are the kinds of things that we should bear in mind.
2. Always Quantize
If you have these mapped out, then all of the other decisions will be made much easier. This goes on to my next point, which is, always quantize. I'll tell you why it links to my first point earlier. Who knows who Tim Dettmers is? This guy is a genius. Who knows what quantization is? Quantization is essentially model compression. It's when you take a large language model and you reduce the precision of all of the weights to whatever form you want. 4-bit is my favorite form of quantization, going from an FP32. The reason why it's my favorite is because it's got a really fantastic accuracy compression tradeoff. You can see here, in this we have accuracy versus model bits, so the size of the model. Let's say the original is FP16. It's actually not, it's normally 32.
That's your red line there. We can see that when we compress the model down, we'll go 10 to the 10, for a given resource size, you can see that the FP16, red line, is actually the worst tradeoff. You're way better off using a FP8 or an INT4 quantized model. What this graph is telling you is that for a fixed resource, you're way better off having a quantized model of the same size than the unquantized model. We start with the infra and we work backwards. Let's say we have access to L40S, and we have that much VRAM. Because I know my resources that I'm allowed, I can look at the models that I have available to me, and then work backwards. I have 48 gigs of VRAM. I have a Llama 13 billion, so that's 26. That's all good. That fits. I have a Mixtral which is current state of the art for open-source models. That's not going to work.
However, I have a 4-bit quantized Mixtral which does fit, which is great. I now know which models I can even pick from, and I can start experimenting with. That graph that I showed you earlier with Tim Dettmers, that tells me that my 4-bit model will be better performing, probably. Let's say my Llama was also the same size, my 4-bit model will be better performing than my Llama model, because my model retains a lot of that accuracy from when it was really big and compressed down. We start with our infra and work backwards. We essentially find the resources that we can fit, and then find the 4-bit quantized model that'll fit in those resources. The chances are that's probably the best accuracy that you can get for that particular model.
3. Spend Time Thinking About Optimizing Inference
Tip number three, spend a little bit of time thinking about optimizing inference. The reason why I tell people spend just a little bit of time optimizing inference is because the naive things that you would do when you're deploying these models is typically completely the wrong thing to do. You don't need to spend a huge amount of time thinking about this, but just spending a little bit of time can make multiple orders of magnitude difference to GPU utilization. I can give one example of this, batching strategies. Essentially, batching is where multiple requests are processed in parallel. The most valuable thing when you're deploying these models that you have is your GPU utilization. GPUs, I think I said earlier, are really expensive, so it's very important that we utilize them as much as we can. If I'm doing no batching, then this is more or less the GPU utilization that I'll get, which is pretty bad. The naive thing to do would either be to do no batching or dynamic batching.
Dynamic batching is the standard batching method for non-Gen AI applications. It's the kind of thing that you might have built previously. The idea is that you wait a small amount of time before starting to process a request. Group any of those requests that arrive during that time, and then process them together. In generative models, this leads to a downtime in utilization. You can see that it starts really high and then it goes down, because users will get stuck in the queue waiting for longer generations to finish. Dynamic batching is something that you might try naively, but it actually tends to be a pretty bad idea. If you spend a little bit of time thinking about this, you can do something like continuous batching. This is what we do.
This is a GPU utilization graph that we got a couple weeks ago, maybe. This the state-of-the-art batching technique designed for generative models. You let incoming requests interrupt in-flight requests in order to keep that GPU utilization really high. You get much less queue waiting, and much higher resource utilization as well. You can see going from there to there is maybe one order of magnitude difference in GPU costs, which is pretty significant. I've not done anything to the model, nothing will impact accuracy there.
Second example I can give you is with parallelism strategies. For really large models, you often can't inference them on a single GPU. For example, a Llama 70 billion, or a Mixtral, or a Jamba, for example, they're really hefty models. Often, I'll need to split them across multiple GPUs in order to be able to inference them. You need to be able to figure out how you're going to essentially do that multi-GPU inference. The naive way to do this, and actually this is probably the most popular way to do this, in fact, common inference libraries like Hugging Face's Accelerate, does this, is you split the model layer by layer. It was a 90-gigabyte model. I have 30 on one, 30 on one, and then 30 on the third GPU. At any one time only one GPU is active, which means that I'm paying for essentially three times the number of GPUs that I'm actually using at any one time.
That's just because I split them in this naive way, because my next GPU is having to wait for my previous GPU. That's really unideal. This is what happens in Hugging Face Accelerate library, if you want to look into that. Tensor Parallel is what we think is the best one, which is, you essentially split the model lengthwise so that every GPU can be fully utilized at the same time for each layer, so it makes inference much faster, and you can support arbitrarily large models as well with enough GPUs. Because at every single point, all of your GPUs are firing, you don't end up paying for that extra resource. In this particular example, we've got, for this particular model, a 3x model, a GPU utilization improvement. Combining that with the order of magnitude we had before, that's a really significant GPU utilization improvement. It's not a huge amount of time to think about this, but if you just spend that little bit of time, then you might end up improving what you can put on those GPUs.
4. Consolidate Infrastructure
What have I done so far? I've done, think about your deployment requirements, quantize, inference optimization. Fourth one is, consolidate your infrastructure. Gen AI is so computationally expensive that it really benefits from consolidation of infrastructure, and that's why central MLOps teams like Ian runs, make a lot of sense. For most companies, ML teams tend to work in silos, and therefore are pretty bad at consolidation of infrastructure. It wasn't really relevant for previous ML sources. Deployment is really hard, so it's better if you deploy once, you have one team managing deployment, and then you maintain that, rather than having teams individually doing that deployment, because then each team individually has to discover that this is a good tradeoff to make. What this allows is it allows the rest of the org to focus on that application development while the infrastructure is taken care of.
I can give you an example of what this might look like. I will have a central compute infrastructure, and maybe as a central MLOps team, I've decided that my company can have access to these models, Llama 70, Mixtral, and Gemma 7B. I might periodically update the models and improve the models. For example, when Llama 7 comes out, instead of Llama 2, I might update that. These are the models that I'll host centrally. Then all of those little yellow boxes are my application development teams. They're my dispersed teams within the org. Each of them will be able to get access to my central compute infrastructure, and personalize it in the way that works for them. One of them might add a LoRA, which is essentially a little adapter that you can add to your model when you fine-tune it. It's very easy to firstly do, and then also add into inference. Then maybe I'll add RAG as well. RAG is when we give it access to our proprietary data, so our vector store, for example.
I have each of my application teams building LoRA's RAGs, LoRA's RAGs. Maybe I don't even need LoRAs, and I can just do prompt engineering, for example, and my central compute is all managed by one team, and it's just taken care of. The nice thing about this is what you're doing is you're giving your organization the OpenAI experience, but with private models. If I'm an individual developer, I don't think about the LLM deployment. Another team manages it. It sits there, and I just build applications on top of the models we've been given access to. This is really beneficial. Things to bear in mind. Make sure your inference server is scalable. LoRA adapter support is super important if you want to allow your teams to fine-tune. If you do all of this, you'll get really high utilization of GPUs. Because, remember, GPU utilization is literally everything. I say literally everything. There's your friends, and there's your family, and then there's GPU utilization. If we centrally host this compute, then we're able to get much higher utilization of those very precious GPUs.
I can give you a case study that we did with a client, RNL, it's a U.S. enterprise. What they had before was they had four different Gen AI apps. They were pretty ahead at the time. They built all of this last year. Each app was sitting on its own GPU, because they were like, they're all different applications. They've all got their own Embedders, their own thing going on. They gave them each their own GPUs, and as a result, got really poor GPU utilization, because not all the apps were firing all the time. They weren't all firing at capacity. What we did with them is something like this. It doesn't have to be Titan, it can be any inference server. They had Mixtrals and Embedders, essentially, is all they had. We hosted a Mixtral and an Embedder on one server and exposed those APIs. The teams then built on top of those APIs, sharing that resource. Because they were sharing the resource, they could approximately half the number of GPUs that they needed. We were able to manage both the generative and the non-generative in one container. It was super easy for those developers to build on top of. That's the kind of thing that if you have a central MLOps team, you can do, and end up saving a lot of those GPU times.
5. Build as if You Are Going to Replace the Models Within 12 Months
My fifth piece of advice is, build as if you're going to replace the models within 12 months, because you will. One of our clients, they deployed their first application with Llama 1 last year. I think they changed the model about four times. Every week they're like, this new model came out. Do you support it? I'm like, yes, but why are you changing it for the sixth time? Let's think back to what state of the art was a year ago. A year ago, maybe Llama had come out by then, but if before that, it might have been the T5s. The T5 models were the best open-source models. What we've seen is this amazing explosion of the open-source LLM ecosystem. It was all started by Llama and then Llama 2, and then loads of businesses had built on top of that.
For example, the Mistral 70B was actually built with the same architecture that Llama was. We had the Falcon out of the UAE. We had Mixtral by Mistral. You have loads of them, and they just keep on coming out. In fact, if you check out the Hugging Face, which is where all of these models are stored, if you check out their leaderboard of open-source models, the top model changes almost every week. Latest and greatest models come out. These models are going to keep getting better. This is the performance of all models, both open source and non-open source, as you can see the license, proprietary or non-proprietary. The open-source models are just slowly scaling that leaderboard. We're starting to get close to parity between open source and non-open source. Right now, the open-source models are there or thereabouts, with GPT-3.5. That was the original ChatGPT that we were all amazed by.
My expectation is that we'll get to GPT-4 quality within the next year. What this means is that you should really not wed yourself to a single model or a single provider. Going back to that a16z report that I showed you earlier, most enterprises are using multiple model providers. They're building their inference stack in a way that it's interoperable, in a way that if OpenAI has a meltdown, I can swap it out for a Llama model. Or, in a way that if Claude is now better than GPT-4 as it is now, I can swap them really easily. Building with this interoperability in mind is really important. I think one of the greatest things that OpenAI has blessed us with is not necessarily their models, although they are really great, but they have actually counterintuitively democratized the AI landscape, not because they've open sourced their models, because they really haven't, but because what they've done is they've provided uniformity of APIs to the industry. If you build with the OpenAI API in mind, then you'll be able to capture a lot of that value and be able to swap models in and out really easily.
What does this mean for how you build? API and container-first development makes life much easier. It's fairly standard things. Abstraction is really good, so don't spend time building custom infrastructure for your particular model. The chances are you're not going to use it in 12 months. Try and build more general infra if you're going to. We always say that at this current stage where we're still proving value of AI in a lot of organizations, engineers should spend their time building great application experiences rather than fussing with infrastructure. Because right now, for most businesses, we're fortunate enough to have a decent amount of budget to go and play and try out this Gen AI stuff.
We need to prove value pretty quickly. We tend to say, don't work with frameworks that don't have super wide support for models. For example, don't work with a framework that only works with Llama, for example, because it'll come back to bite you. Whatever architecture you pick or infrastructure you pick, making sure that when Llama 3, 4, 5, Mixtral, Mistral comes out, they will help you adopt it. I can go back to this case study that I talked about before. We built this in a way, obviously, that it's super easy to swap that Mixtral for Llama 3, when Llama 3 comes out. For example, if a better Embedder comes out, like a really good Embedder came out a couple weeks ago, we can swap that out easily too.
6. GPUs Look Really Expensive, Use Them Anyway
My sixth one, GPUs look really expensive. You should use them anyway. GPUs are so phenomenal. They are so phenomenally designed for Gen AI and Gen AI workloads. Gen AI involves doing a lot of calculations in parallel, and that happens to be the thing that GPUs are incredibly good at. You might look at the sticker price and be like, it's 100 times more expensive than a CPU. Yes, it is, but if you use it correctly and get that utilization you need out of it, then you'll end up processing orders of magnitude more, and per request, it will be much cheaper.
7. When You Can, Use Small Models
When you can, use small models. GPT-4 is king, but you don't get the king to do the dishes. What the dishes are: GPT-4 is phenomenal. It's a genuinely remarkable piece of technology, but the thing that makes it so good is also that it is so broad in terms of its capabilities. I can use the GPT-4 model to write love letters, and you can use it to become a better programmer, and we're using the exact same model. That is mental. That model has so many capabilities, and as a result, it's really big. It's a huge model, and it's very expensive to inference. What we find is that you tend to be better off using GPT-4 for the really hard stuff that none of the open-source models can do yet, and then using smaller models for the things that are easier. You can massively reduce cost and latency by doing this. When we talked about that latency budget that you had earlier, or those resource budgets that you had earlier, you can go a long way to maximizing that resource budget if you only use GPT-4 when you really have to.
Three commonly seen examples are like RAG Fusion. This is when your query is edited by a large language model, and then all queries are searched against, and then the results are ranked to improve the search quality. For example that, you can get very good results by not using GPT-4, only using GPT-4 when you have to. You might, for example, with RAG, use a generative model just to do the reranking, so just check at the end that the thing that my Embedder said was relevant, was really relevant. Small models, especially fine-tuned models for things like function calling are really good. One of the really common use cases for function calling is if I need my model to output something like JSON or regex, there are broadly two ways that I could do this. Either I could fine-tune a much smaller model, or I could add controllers to my small model. A controller is really cool. A controller is essentially when, if I'm self-hosting the model, I can ban my model from saying any tokens that would break a JSON schema or that would break a regex schema that I don't want. Stuff like that, which actually is majority of enterprise use cases, you don't necessarily need to be using those API based models, and you can get really immediate cost and latency benefits.
Summary
Figure out your deployment boundaries and work backwards. Because you know your deployment boundaries, you know that you should pick the model that when you've quantized it down is that size. Spend time thinking about optimizing inference so that can make the difference of genuinely multiple orders of magnitude. Gen AI benefits from consolidation of infrastructure, so try to avoid having each team being responsible for their deployments, because it will probably go wrong. Build as if you're going to replace your model in 12 months. GPUs look expensive, but they're your best option. When you can, you'll use small models. Then we said all of this to Russell, and then he was like, "That was so helpful. I'm so excited to deploy my mission critical LLM app using your tips." Then we said, "No problem, let us know if you have any questions".
Questions and Answers
Participant 1: You said, build for flexibility. What are the use cases for frequent model replacements? The time and effort we have spent on custom fine-tuning, on custom data, will have to be repeated? Do you have any tips for that in case of frequent model replacements?
Arik: When would you want to do frequent model replacement? All of the time. With the pace of LLM improvement, it's almost always the case that you can get better performance, literally just by swapping out a model. You might need some tweaks to prompts, but typically, just doing a one-to-one switch works. For example, if I have my application built on GPT-3.5 and I swap it out for GPT-4, even if I'm using the same prompt, the chances are my model performance will go up, and that's a very low effort thing to do. How does that square with things like the engineering effort required to swap? If it is a month's long process, if it's not a significant improvement, then you shouldn't make that switch. What I would suggest is trying to build in a way where it's not a month's long process and actually can be done in a couple days, because then it will almost always be worth that switch.
How does that square as well with things like fine-tuning? I have a spicy and hot take, which is, for the majority of use cases, you don't need to fine-tune. Fine-tuning was very popular in deep learning of a couple years ago. As the models are getting better, they're also better at following your instructions as well. You tend to not need to fine-tune for a lot of use cases, and can just get away with things like RAG, prompt engineering, and function calling. That's what I would tend to say. If you are looking for your first LLM use case, speaking of swapping models, a really good first LLM use case is to just try and swap out your NLP pipelines. A lot of businesses have preexisting NLP pipelines. If you can swap them for LLMs, typically, you'll get multiple points of accuracy boost.
Participant 2: How do you see the difference for the on-prem hardware, between enterprise grade hardware and consumer maxed out hardware, because I chose to go for consumer maxed out hardware because you go up to 6000 meg transfers on the memory, and the PCI lanes are faster.
Arik: Because people like him have taken all the A100s, when we do our internal development, we actually do it on 4090s, which is consumer hardware. They're way more readily accessible, much cheaper as well than getting those data center hardware. That's what we use for our development. We've not actually used consumer grade hardware for at-scale inference, although there's no reason why it wouldn't work.
If it works for your workload. We use it as well. We think they're very good. They're also just much cheaper, because they're sold as consumer grade, rather than data center grade.
Participant 3: You're saying that GPU is a whole and it's most important. I'm a bit surprised, but maybe my question will explain. I made some proof of concept with small virtual machines with only CPUs, and I get quite good results with few requests per second. I did not ask myself about scalability. I'm thinking about how much requests shall we switch to GPUs?
Arik: Actually, maybe I was a bit strong on the GPU stuff, because we've deployed on CPU as well. If the latency is good enough, and that's typically the first complaint that people get, is latency, then CPU is probably fine. It's just that when you're looking at economies of scale and when you're looking at scaling up, they will almost always be more expensive per request. If you have a reasonably low number of requests, and the latency is fine, then you can get away with it. I think one of our first proof of concepts with our inference server was done on CPU. One thing that you will also know is that you'll be limited in the size of model that you can go up to. For example, if you're doing a 7 billion quantized, you can probably get away with doing CPU as well. I think GPU is better if you are starting from a blank slate. If you're starting from a point where you already have a massive data center filled with CPUs and you're not using them otherwise, it is still worth experimenting whether you can utilize them.
Participant 4: I have a question regarding the APIs that are typically used, and of course, it's OpenAI's API that are typically used also by applications. I also know a lot of people who do not really like the OpenAI API. Do you see any other APIs around? Because a lot of people are just emulating them, or they are just using it, but no one really likes it.
Arik: When you say they don't like it, do they not like the API structure, or don't like the models?
Participant 4: It is about the API structure. It is about documentation. It is about states, about a lot of things that happen that you can't fully understand.
Arik: We also didn't really like it, so we wrote our own API that's called as our inference server, and then we have an OpenAI compatible layer, because most people are using that structure. You can check out our docs and see if you like that better. I think because it was the first one to really blow up, it's what the whole industry converged to when it comes to that API structure.
See more presentations with transcripts