Key Takeaways
- Businesses decide to self-host for three main reasons: privacy & security, improved performance, decreased cost at scale.
- Self-hosting is hard for three reasons: model size, expensive GPUs, and a rapidly evolving field.
- To address model size, quantize. For a fixed model size budget, you will almost always get better performance by using larger models that have been quantized down to 4-bit, rather than using a full-precision version.
- Optimizing inference by using batching and parallelism can provide significant GPU efficiency gains.
- Future-proof your application by using abstractions, tools, and frameworks that treat LLMs as building blocks that can be interchanged easily.
This article is part of the Practical Applications of Generative AI article series, where we present real-world solutions and hands-on practices from leading GenAI practitioners. |
When most people think of Large Language Models (LLMs), they may think of one of OpenAI’s models. These are large, capable models which are hosted on OpenAI’s servers and invoked using a web-based API. These API-based models are a way to quickly experiment with LLMs.
However, it is also possible for a business to deploy its own LLM. Deploying, or self-hosting, an LLM is challenging. It's not as simple as calling the OpenAI API. The first question you might ask is: if self-hosted LLM deployments are so difficult, why bother? Businesses decide to self-host for three main reasons:
- Privacy & Security: Deploying within their own secure environment (VPC or on-premise).
- Improved Performance: State-of-the-art models for many domains require self-hosting, especially in the Retrieval-Augmented Generation (RAG) domain.
- Decreased Cost at Scale: While API-based models might seem cheap initially, self-hosting can be much more cost-effective for large-scale deployments.
A report by A16Z revealed that 82% of enterprises intend to build self-hosted capabilities. Given this, the next question is: why is self-hosting so hard? Here are three reasons:
- Model Size: Large language models are massive. A seven-billion parameter model is considered small, but it really isn’t: it consumes 14GB of RAM.
- Expensive GPUs: GPUs are a scarce and costly resource, making efficient utilization crucial.
- Rapidly Evolving Field: The field is advancing quickly, requiring future-proof deployments.
Given these challenges, here are seven tips and tricks for developing and deploying self-hosted LLM applications:
1. Figure out your production requirements and work backwards
We often find that teams really struggle to put their AI-powered applications into production because they don’t consider the production stage until the very end. However, we suggest to our clients that they think about their requirements up front and then evaluate the best way to build given those constraints.
Specifically, we suggest that our clients evaluate:
- Latency Requirements: Real-time or batch processing?
- Expected Load: Serving 10 or 10,000 concurrent users?
- Hardware Availability: Specific hardware needs, especially for on-premise deployments.
With these well understood, the teams can then make transparent decisions about the best ways to build given these constraints.
2. Always quantize
Given that you don’t have unlimited hardware availability (most businesses don’t) and you will have production requirements (most teams do), you will almost always be better off using quantized versions of LLMs rather than the unquantized versions. In his December 2022 paper "The case for 4-bit precision: k-bit Inference Scaling Laws", Tim Dettmers showed that for a fixed resource size, you will almost always get better performance by using larger models that have been quantized down in 4 bit to that size, rather than using the full precision version of the model.
Figure 1. Bit-level scaling laws for mean zero-shot performance across four datasets for 125M to 176B parameter OPT models. Zero-shot performance increases steadily for fixed model bits as we reduce the quantization precision from 16 to 4 bits. At 3-bits, this relationship reverses, making 4-bit precision optimal. Source: Dettmers, The case for 4-bit precision: k-bit Inference Scaling Laws
Figure 2. Representation of how to select which model to try out given your hardware requirements
Given that you have figured out your hardware requirements (see tip 1), you will be able to work backwards to evaluate which is the best model that when quantized to 4-bit will fit within your boundaries. To find 4-bit quantized versions of popular LLMs, check out TitanML’s page on HuggingFace.
3. Spend time thinking about optimizing inference
GPUs are expensive, so deploying self-hosted language models can look like a very expensive endeavor. However, with a little bit of effort thinking about how to optimize inference, deployment can become a lot easier, GPU utilization can be a lot higher, and compute costs a lot lower. There are many different ways that you can optimize inference, but I will give just two examples to demonstrate what a huge impact it can have on GPU utilization.
Batching
There are various batching strategies that you can implement when deploying Generative AI.
The most naive way is to have no batching at all: this is what we see in inference frameworks like ollama which are optimized for hacker-style chatbot applications. This leads to awful GPU utilization. Given that you want to try batching, a lot of teams try dynamic batching: this is when you wait between requests, and process in groups; however this leads to a very spiky GPU utilization curve, which is not ideal.
The best way to batch for generative models is to use continuous batching, this allows incoming requests to interrupt in-flight requests in order to keep utilization high, this is what we do in our inference stack (Titan Takeoff). Just by implementing a different batching strategy, without any impact on the model we can see around 5x improvement in GPU utilization, which then provides a much better user experience.
Figure 3. Batching Strategies and their GPU Utilization
Parallelism
Another example of how spending time thinking about inference optimization can have a big impact is with multi-GPU deployments. These are used when a model will not fit onto a single GPU and needs to be split across multiple GPUs.
Let's say we have a 90GB model and GPUs that are only 30GB. This model won’t fit in a single GPU but if it is split, it will fit on 3 different GPUs. There are two ways that I can split my model up. The naive way (used in the HuggingFace Accelerate library) is layer splitting, which splits the model up layer by layer.
However, this leaves the GPU idle for a significant amount of the inference time - which is not good for efficiency. The better way to split models across GPUs is Tensor Parallel - this is what we do in Titan Takeoff. Tensor Parallel doesn’t allow the GPU to go idle through the inference process.
Figure 4. Parallelism Strategies and their GPU utilization
In just the two examples I’ve provided, you can see that there are very significant efficacy gains to be made just by thinking a little bit about your inference optimization and GPU utilization. So either spend that time or work with an inference stack that does spend that time figuring it out.
4. GenAI benefits from the consolidation of infrastructure
Unlike previous generations of data science, because LLMs are so computationally expensive, there is a large incentive to consolidate infrastructure centrally.
Deploying open-source language models is hard, much harder than just building with the OpenAI API. So this should be done centrally with good tooling, rather than being left for individual ML teams to navigate. The central team can offer an OpenAI-like API for the rest of the company to work with. This will result in:
- Reduced inference cost because of higher GPU utilization
- Easier and faster development
- More scalable applications
As a result, it is important that teams work with an inference stack that they trust that has LoRA adapter support and gets very high utilization of GPUs.
Figure 5. Representation of deploying LLMs on unconsolidated vs consolidated infrastructure
5. Build as if you are going to replace your model in 12 months
This tip should go without saying, but given the rate of pace that we are seeing in not only the AI space, but specifically the Open Source space, we should build as if the models that we are working with today will look relatively "stupid" compared with what we will build with in 12 months.
Figure 6. Source: Everypixel - model releases of 2023
What does this mean for how you should build? Well it means that abstraction is good and you should build with tools and a framework that treat LLMs as building blocks that can be interchanged easily.
6. GPUs look expensive, but they’re your best option
Quite often we see our clients concerned about the sticker price of GPUs, correctly identifying that per hour they are considerably more expensive than CPUs. However, GPUs are the clear choice for generative AI workloads due to their superior performance and efficiency compared to CPUs.
Generative AI models require massive computational power to process vast amounts of data and generate human-like text, images, or code. GPUs are specifically designed to handle these types of complex and data-intensive tasks, as they have thousands of cores that can perform calculations in parallel.
On the other hand, CPUs are better suited for tasks that require sequential processing. So despite the higher sticker price, the per token price of GPUs is significantly lower than CPUs. And therefore, given that you are able to get high GPU utilization, you should always use GPUs for AI workloads.
7. When you can use small models
Large models are impressive, but smaller models often suffice for many enterprise use cases and are easier to deploy. Consider non-generative models or smaller LLMs like Llama3-8B for parts of your application.
Conclusion
Deploying LLMs is tough, but it's a challenge worth tackling. Self-hosting offers big benefits in privacy, performance, and cost-efficiency, making it a smart move for many businesses despite the hurdles.
Start by figuring out your deployment boundaries early on, use quantized models, and focus on optimizing inference. These steps will help you achieve high GPU utilization and save costs. Centralizing your infrastructure and staying ready to swap out models as the tech evolves will keep your deployments scalable and adaptable. Always opt for GPUs for their performance edge, and consider smaller models when they fit the bill.
The field of AI is moving fast, and being agile and informed is essential. I hope that tips and tricks will help you stay ahead of the curve and get the most out of self-hosted LLMs.
By following these practices, you can build AI applications that are not only efficient and cost-effective but also scalable and future-proof. This way, your organization can fully leverage the power of LLM technology.
This article is part of the Practical Applications of Generative AI article series, where we present real-world solutions and hands-on practices from leading GenAI practitioners. |