Transcript
Cheptsov: My name is Andrey. I'm the founder of dstack, where we are building an alternative to Kubernetes, a container orchestrator for managing AI infrastructure. Since it's used by companies that develop AI models, especially large language models, we're quite passionate about open-source LLMs, and of course want to help more companies understand how to adopt them without feeling overwhelmed, and to see more value in this. The talk isn't really about covering all their best practices or dirty hacks for deploying open-source LLMs in production. This would be hardly possible. Instead, the goal is to give you a rough sense of what to expect from using open-source LLMs, when and why they might be the right choice, and what the development process might feel like.
Predictions (Closed Source vs. Open Source)
The maturity of AI remains a debated topic, with annual predictions from experts ranging from skepticism about LLMs, utility beyond chatbots, to concerns about AGI and existential risks from it. Open-source AI is also subject to frequent speculations, with ongoing questions whether it can ever rival the quality of proprietary models such as OpenAI and Anthropic. Clem, the founder of Hugging Face, the leading platform for publishing open-source models, actually predicted last year that open-source models would match the quality of closed source ones by 2024. Now that we are in the second half of 2024, do you think that open-source will catch up? You'll be surprised to learn that in just six months, Meta, the company behind Facebook, released Llama 3.1, an open-source model that for the first time matched the quality of closed source models.
In this chart, carefully compiled by Maxime Labonne, an ML engineer at Liquid AI, it shows the release timelines of both closed source and open-source models over the last two years. The chart tracks each model score on the MMLU benchmark, which is Massive Multitask Language Understanding benchmark, which is one of the most comprehensive measures of a model performance across a wide range of tasks. For open-source models, you'll notice that the model names here include the number of parameters. If you look closely, you can see the names of the open-source models, also had this number of parameters.
Generally, the higher the benchmark score, basically the better the model is, the larger the model means the more parameters the model has. The model that finally achieved parity with closed source models has 405 billion parameters, which we'll talk more about what that really means. In this upper right corner of the chart, near Llama 3.1, 405 billion, you can also spot Qwen 2.5, 72 billion, which was released soon after. Despite this significantly smaller size, it nearly matches the performance of Llama 3.1, 405 billion, the largest one by Meta. This Qwen model was released by Alibaba Cloud, the team that trains these foundational models there.
This is basically the current best open-source model, competing directly with the closed source ones. As you see, despite early doubts, open-source models are keeping pace with closed source ones, going head-to-head in quality. Meta announced Llama 3.2. Probably some of you already know a multimodal model which surpassed the best closed source multimodal models in performance. A multimodal model means that it not only generates text, but it also understands pictures, and then can generate pictures as well.
Benchmarks - Llama 3.1 Instruct
Let's take a closer look at Llama 3.1, why it's significant, and how it performs on various benchmarks. Llama 3.1 is available in three sizes, 7 billion parameters, 70 billion parameters, and 405 billion parameters. It supports eight languages, offers 127k token context window, which includes the length of text that it can accept, plus the length of the text that it can generate. The longer the context window, the larger text LLM can understand and generate. It allows you to use the model commercially.
Basically, the license allows you to use Llama 3.1 in commercial projects. It's not only used for inference, but also for fine-tuning. It is capable of also generating synthetic data and doing all sorts of knowledge distillation as well, which we'll probably talk about in further slides. On the left hand you can see several benchmarks that assess the performance of each model on different tasks. This benchmark, I already mentioned, MMLU, this is one of the most common benchmarks. It measures model performance across a range of language tasks. There are two types of this benchmark. One is known as MMLU. It focuses on general knowledge understanding. There's another one which is called MMLU-PRO, also known as 5-shot. It assesses, in addition to general knowledge, also reasoning capabilities.
Another key benchmark here is this HumanEval. This benchmark evaluates model code generation skills. Code generation is a valuable use case for LLMs, and not only to be used in code completion in your IDE, which you probably heard of, but also it enables the use of tools and helps in building automated agents with LLMs. As shown, the largest version of Llama 3.1 scores highly on both of these benchmarks, if we compare to other models, including those proprietary ones like GPT-4 and Claude 3.5. It's worth noting that Llama 3.1 is more than just a model. It's basically a developer platform. Meta actually refers to it as Llama Stack, basically a developer stack, which includes not only the model itself, in different sizes, but also numerous tools and integrations. Tools that help you make the model more safe. Tools for building agents, doing evals and fine-tuning as well. It's a lot of different tools. It's not just a model.
Qwen 2.5 Instruct
Here we see Qwen 2.5, which I mentioned when we looked at this chart of models. This is the newest version of Qwen. We had Qwen 1.0, 2.0, and now this is the most recent one released a couple of months ago. As I said, it's created by the team behind Alibaba Cloud that specialize in foundational models. Besides demonstrating strong capabilities in fundamental knowledge like reasoning and code generation, it also speaks 29 languages, compared to 8 languages supported by Llama 3.1. It knows more languages, which is great. Notably, it delivers impressive performance while being five times smaller. As you see, the name of the model is 72 billion parameters, which is more than 5 times smaller than the largest Llama 3.1 model. Still, it comes really close to the quality of Llama 3.1, 405 billion parameters. Qwen models come in different sizes, and the majority of models are available under Apache 2.0 license, which is the most permissive license. This is great.
There are basically no conditions except probably the largest one, 72 billion parameters, which comes with a proprietary license. However, it's still open weight model, and it's allowed to be used commercially if your number of monthly active users is less than a specific number. It's a pretty big number. If you are not Google, you certainly don't have to be concerned about this. You actually can use this Qwen model for commercial purposes. Later, we'll of course examine how model size greatly influences its practical applications. Because, should I understand that the smaller the model is, the much easier it is to use the model.
When to Use Open-Source Models
We've finally arrived to the key question, when should we use open-source models and what are the reasons behind it? Based on my numerous conversations with people, what I see is that most companies tend to strategically underestimate the importance of open-source models and fail actually to recognize why relying on closed source models is not a viable long-term option for the industry. Pretty strong statement. I'd love to elaborate on this, because I think this is important. As we're witnessing today, much like the internet or computers before, GenAI is becoming integral nearly to every service, product, or human-computer interaction. It basically transforms how we work, how we communicate, how we interact. As we enter the GenAI era, it significantly impacts both economics and the distribution of competitive advantage.
Before GenAI, companies gained the competitive edge through, of course, like technical talent, and through their proprietary technologies, which the company owns. What we see in the GenAI era, this competitive advantage of a company will increasingly stem from how AI is leveraged and applied to the specific use cases and data. We are yet to see this transition from the tech talent and the technology the company has, to how AI is applied in very specific use cases. If a company aims to maintain its competitive edge, in my personal view, it must certainly take GenAI seriously and avoid outsourcing these to external companies, just as they did previously with software development.
Hardware Requirements
Now that we talked about theory a bit, let's maybe look under the hood, what it actually feels to use open-source models. Do not expect me to go into maybe specific applications or frameworks, but I'd rather spend more time talking about most fundamental aspects of using LLMs. If you are already into LLMs, pretty deep, you might not find some significant insight here. Those people that are curious about the adoption of open-source LLMs, and maybe not necessarily the actual research, but the use of this, the results of this research within the companies, that might be super helpful, because then you understand what are the main constraints. Here's this slide with the hardware requirement.
You've probably heard many times that in order to use LLMs, you need basically GPUs. They are very expensive, and you need a lot of them. We'll talk about that. If we look at Llama 3.1 which comes in three different sizes, we see that the larger the model is, the more parameters it has, the bigger this GB number, which means the number of GPU memory which you need in order to run inference here. The column is called FP16, means that float point 16 is the weight of the model these tensors are stored in. This is also known as half precision, half of 32. This is mostly how LLMs are stored once they are trained. We can see that in order to run an inference of the smallest model, we would need at least 16 gigabytes. This is pretty close to one of the smallest GPUs. If we talk about, for example, NVIDIA GPUs, the most used GPUs today, like A100 which is 40 gigabytes or 80 gigabytes, or H100 which is 80 gigabytes as well.
If you are not into this yet, all you need to know right now is that the more GPUs you have, the more memory you, in total, have. This is your constraint. Depending on that, it depends which model you can actually run for inference. However, it's not as simple as that. Of course, if you are, for example, using your model in production, and you have multiple users concurrently accessing your model, and you need to scale, it means that you would need to run more instances of your model concurrently. It means that you would need to use more GPU. For example, if you look at the largest model here, 400B, you see that it requires 810 gigabytes, which won't even fit into one GPU. You've already probably heard that this GPU is pretty expensive. In order to use this model, you would need eight GPUs, the most expensive GPUs, and then you would need two machines like that, just to run this large Llama 3.1 model.
The situations with fine-tuning are a little bit more complex, because when you are running inference, you are only using your memory for the forward path, for basically generating predictions. When you are doing fine-tuning, you would also need memory for the backward propagation, and then for storing the entire batch, because you are actually training in batch. Then, also, some memory is going to be used for the optimizer, which is used for other utilitarian purposes.
Basically, all of that means simply that you just need a lot more memory for fine-tuning. For example, if we look at what amount of memory would be needed for full fine-tuning of the largest Llama model, you would see that basically it's going to be six nodes like that, which brings us to this famous meme about how I can ever do this. On the right hand and left-hand sides, you actually see that left hand one is non-experts at all. On the right-hand side is experts. In the middle, you have basically the majority. The majority of companies and teams, they actually do all kinds of optimizations in order to reduce this memory. We'll talk about that. If you look at those experts and non-experts, you're going to see that you simply have to buy more GPUs. That's the hard truth that you need to know.
Optimization Techniques - Quantization
However, you don't have to be on that end of spectrum. There are a lot of optimizations, but we're going to talk about two most important ones. First one is quantization. A model basically consists of layers, and each layer, basically just think of it as like a dancer. It's a multi-dimensional matrix. Basically, those are numbers that use this FP16 float point to half precision format.
You can understand that it takes memory, and this is exactly what memory is used for. In order to run inference or fine-tuning, you have to load your model into the GPU memory, and that's how it works. The bigger the model is, the more memory it takes. There are certain tricks that can significantly reduce the amount of memory which you need, and one of them is quantization. Instead of storing the full precision, you're converting this float point to int, and by that, you're basically lowering the rank. With this lower precision, it takes less memory for inference and for fine-tuning as well. This is a research topic, because you're lowering their precision, there's a loss in the quality of the prediction.
However, this loss is not as significant. In most cases, you actually can just dismiss that. However, of course, there are cases when cannot you do that, but if you, for example, look at the Llama 3.1 release, you'll see that they actually recommend you to use FP8, which is a quantized version, which doesn't have much loss at all. Now if you apply that to these numbers, you can see that if you basically cut the precision twice, and you go from FP16 to FP8, you linearly reduce the amount of memory needed. If you go further, and, for example, switch the model to INT4 precision, then basically it's a significant drop in memory required. Now if you look at 70 billion model, you would need just one GPU and it will just fit.
Optimization Techniques - Low Rank Adaptation (LoRA)
What about the fine-tuning? There's another technique which is pretty useful, especially in both inference and fine-tuning, which is called Low Rank Adaptation. This is how it works. Think of this model weight set of tensors. You divide these weights into two parts, pre-trained weights, and you call them frozen weights. You're saying, we're not going to use those weights for training. We're going to only take a subset of this weight, and only use this much smaller subset of weights for fine-tuning, and we only load this smaller set of weights. This is how you reduce the amount of memory required for fine-tuning.
Instead of loading the whole weights, you're loading only what is called adapter weights, and then you fine-tune adapter weights. Once you've trained the model, then you can merge them together, and this is how you get a fine-tuned version of a model without using all the weights. For short, it's called LoRA. This technique is pretty notable, not only because you can use it for fine-tuning, but also for inference. Imagine that you'd like to actually use multiple fine-tuned models on the same GPU? If you are not using LoRA, you would have to load each model every time, all the weights.
However, when you are using LoRA, you can load pre-trained weights once, and then switch adapters from different models. This way you can actually run inference for multiple models. However, if we look at how LoRA is applied for fine-tuning, we're going to see that it significantly reduces the amount of memory needed for tuning. There's also a big body of research how this affects the quality. Now, if we combine both techniques, we're going to see that by using quantization and LoRA, we can go even further. This is basically how you actually train and run inference without having a lot of GPUs.
Development Process: Pre-Training and Post-Training
Let's go into the actual development process, from start to end, just to get a rough sense of what it is to actually get a model. We can split the whole process into several parts. First one is pre-processing. This is where you collect all the data and process this data, prepare it for the training. Then there is pre-training. This is where you take this bulk data you have, and you just train your model without any specific assistance here. Once the model learns the basic knowledge from this data, then you can go to the post-training phase, where now you can educate your model, or another way to say it would be, to align your model with specific tasks. This is how you make the model work for very specific tasks, make it to follow instructions at all. This is a very complex area, and there are so many different approaches to that. If you want to learn more about this, you would probably read some technical reports.
Every model, when somebody releases a new model, there is typically a technical report which goes into the very specific process, how data was prepared, how it was pre-trained, and then how it was aligned. One example that might be interesting to know is supervised fine-tuning. Basically, when you train a large model, you basically give it bulk data, like internet kind of data, Reddit data, and then you want to switch from just pre-training to supervised training, where you give it very specific, curated datasets, so it starts to learn high quality data. This is called supervised fine-tuning, where you prepare these additional datasets. This is typically done after the base models is trained.
Another interesting thing here, and sometimes it can be used either in addition or instead of supervised fine-tuning, is, now that we have those large models, we can actually use very high-quality models for generating the pre-training data, and this is called synthetic data. For example, there are proprietary models like Claude, which allow you to generate datasets which you can later use for pre-training your model. If you, for example, look at the technical report of Llama 3.1 or some other models, you're going to basically see there that the team actually generated a lot of synthetic data to train it on. It's not only this low-quality internet data, it is actually high-quality generated data, generated by most expensive LLMs.
Of course, this would be strange if we don't mention RLHF, this is Reinforcement Learning from Human Feedback. This is the main technique, how you make the model not only generate text, but actually follow instructions. The main trick here is that instead of just giving it text that it can learn to generate, it also learns whether the generated text is good or bad. Then you basically come up with some labels like good generation, bad generation, or a number like from 0 to 10, how great their text generation is, so the model can learn from this feedback to not generate bad results and always generate good ones. In general, this works through the reinforcement learning, where you actually train a model which learns how to score what is good, what is a bad result. Then you've trained this model, a model learned that, then the model is used when you actually post-train the model.
This is a complicated process, because you first have to train this reward model and only then you actually use that reward model in an actual training. Because this process is a bit complex, there is an alternative to that known as DPO, like Direct Preference Optimization, where instead of training this intermediate reward model, you would directly just provide this labeled data and the trainer will just use it without training the intermediate model. Of course, I'm just giving you an overview, and if you are interested in this, you would go and read in more detail about how this works.
Development Process: Frameworks and Tools
I would like also to mention frameworks and tools that are very important. As I previously said, I'm not here to provide all the hacks, how to leverage open-source models in production, but rather give you intuition of what it feels and what tools you can use. Typically, when you are into open-source models, there are different approaches. One approach is when you actually want to go deeper, and that's when you would need researchers. Those researchers would go into the architecture of the model and into a very specific process. In order to understand how it is done, you would need to go and read how other models are trained. This is best done by reading those technical reports.
However, in most cases, we don't have those resources to be very much involved into the research, pre-training. We would probably decide to focus on less expensive parts of the development process and rely on base pre-trained models and rely on all kind of tools. If we go back a little bit here, and we're going to see that after we've done this post-training, there's this stage called optimization. You already train the model, so now you want to use that model in production, for example, for inference or further fine-tuning, for example. Today, there are enough tools that help you with that.
First of all, let me mention CUDA, ROCm, and XLA. When you want to use open-source models, you can use accelerators, basically GPUs. However, there are different kinds of them. There is NVIDIA. Then there is also AMD, and they actually start to offer some very good accelerators that start to compete with NVIDIA. Then, of course, there are other alternative accelerators, like Google, for example, offers TPU, which can be also an alternative. It's good to basically know about this choice. Even though there is a lot of NVIDIA GPUs, sometimes, for example, you'd like to use on-prem. There might be cases when you can consider using, for example, AMD. Or if you're, for example, using Google, there are enough cases when it's very good to use TPU. CUDA, ROCm, and XLA, those are different drivers. For NVIDIA, you would use CUDA.
For AMD, you would use ROCm. For TPU, for example, you would use XLA. Simply because there's a team behind those drivers, they've done an enormous number of optimizations, which you don't have to care about. For example, if you just take XLA TPU, there's a dedicated team that try to optimize the inference and fine-tuning using PyTorch. You don't even have to think of it. You just basically stand on the shoulders of these giants without worrying about the low-level optimizations, things like optimizing kernels at all, even though, of course, you can do that if you want.
Then there are frameworks for inference, for example. Most known are vLLM, TGI, and NIM. What you really need to know is that they are slightly different. However, they have a lot of commonalities between each other. They offer pretty much everything that you would need for inference. You might have heard of many different optimizations, speculative decoding, batching, and other optimizations that are already built in, so you don't have to worry about this at all. vLLM, TGI, they are cross-platform. NIM is only NVIDIA. When it comes to training, TRL, this is the most known framework. This is by Hugging Face. It helps you do RLHF, Reinforcement Learning from Human Feedback, supervised fine-tuning, and DPO.
It has all sorts of optimizations for fine-tuning, which you also don't have to worry about. This is a very good developer experience library, so I totally recommend. If you want to fine-tune one of the most recent LLMs, you would just go to TRL. There's a lot of different tutorials, examples how to use it. It's pretty easy to use. Finally, there's this Axolotl. It's a wrapper around basically tools like TRL, which makes it much simpler to fine-tune. It's basically a framework for fine-tuning. Most of the time, if you want to do very classical or typical fine-tuning, you will just go with Axolotl.
Dstack is a project which my team is working on. Maybe just providing a few insights about this one. dstack is a container orchestrator which is vendor agnostic. Means that it can run on any accelerator, on any cloud, or on-prem. Think of it as Docker, except that its interface is designed for AI researchers to simplify the development, training, and deployment. You would probably just define what you want as a simple YAML file, and then don't worry about what cloud provider you use, or whether you are using on-prem, you would basically be able to run any AI workload without going into managing of the containers yourself.
Questions and Answers
Participant 1: Regarding the first slide about the necessity of the fine-tuned models. If we would talk about the big LLMs, and not the high loss in terms of the consumption of the models, does it make sense to just go into this rabbit hole of fine-tuning, rather than just selecting the more pricey model, just spend a little bit more for tokens, and just don't do that thing. What is, in your opinion, the criteria to just do the fine-tuning, rather than using maybe proprietary, maybe open source, so a bigger model hosted somewhere, grok or something, and pay.
Cheptsov: When do we need to fine-tune a model? When should we use simply a bigger model? Instead of an open-source model, when should we use a proprietary model?
If you ever face that situation in reality, you would quickly understand that this highly depends on the resources which you have at hand, and then the necessity to reduce the costs. Most of the time, a team starts with something that provides you a baseline of the quality, and you see, ok, so this model actually does exactly what I want. Now let me think how to optimize it. This is also a chance to go into this premature optimization topic.
Basically, the idea is that now that you know that this model works for you, and sometimes it might be even a proprietary model, you basically take OpenAI, you use it, and then you see that it works. Now you're basically thinking, so how do I make it work given my resources? Then you realize that the only way to do it would be to fine-tune. This is going to be experimentation anyway. You would probably do several different approaches, and you would compare two options. Basically, you try to fine-tune, and you're going to see where you have better performance, and then you compare. Basically, then you just choose between what you have. If a smaller model which you fine-tune yourself is getting you where you want, then, of course, you would use it, because it will reduce the cost. For fine-tuning, the fine-tuning costs are a lot less than the inference costs.
Especially if you are into fine-tuning, fine-tuning is done from time to time, but inference is done every second. It depends, of course, on the scale, but you would always optimize for inference. That means that sometimes you actually have to fine-tune, and that's the best way. The better the fine-tuning step is going to be, of course, the less cost is going to be for the inference.
How do I choose between a proprietary model and an open-source model? It's not even about the quality of the model. It's always about whether you are allowed to use the proprietary one or not. That's one. It's a separate question, probably there are other concerns. If you can get where you want by using the proprietary one, you should go there.
Participant 2: In which field, and what are the most used use cases today in AI, according to your observations? What are the observability tools which we can enable to measure how performant a model is? Are there any tools known to the market so that it measures the performance, for example, of the response, so that we also think about autoscaling, probably.
Cheptsov: What are the use cases? Everybody is now trying to figure this out. Based on what I've seen, there's no single cluster of use cases. It's basically everywhere. There are companies that use LLMs to generate clothes design. There are companies that actually use it for food design. If we make this list, of course, at the top we're going to have those chatbots which everybody is talking about, and then all kinds of copilots.
Then, if you take every industry, like financial industry, healthcare industry, whatnot, and then there will be always those chatbots and copilots. Then, at least, what I can speculate about personally is that it's going to be a rabbit hole, and we're going to see more use cases, and wherever we look, we're going to see those use cases basically everywhere. That's why I mentioned when I was talking about the impact of AI and why companies should really consider investing more into making this GenAI a part of the competitive advantage, is that it's going to affect all the use cases, in a way. That's why it's much easier maybe to answer which use cases are not going to be affected. Then we have more freedom to think of it. Then we can brainstorm, ok, so let's come up with 10 use cases which GenAI is not going to affect, which will be much easier to come up with, rather than coming up with which use cases.
The question is not, of course, like which use case is going to be affected or not. You probably would like to know which use cases I can already use LLM for now? Which is a totally different topic. This is why we need R&D, research and development. That's why you need to take your use case, you need to take those LLMs, and then you need to do some research and experiments. Without actually doing that, you never know whether this particular use case is going to work.
Getting back to evaluations. Everybody asks about observability, and how can I solve observability? They think that now somebody will tell them. Then finally they know, and then they will tell everybody else. Nobody knows, but now this guy actually will tell, and then everything is clear now. Everybody keeps on thinking of evaluations because it's a hard problem. It is a hard problem, and we have certain tools for evaluation, but whenever you look, AI researchers or AI developers, you're going to hear the same story. We don't have enough good evaluation tools. That's why we need to improve them. Those benchmarks is one way. It depends on the use case, but sometimes you actually can now leverage also LLMs as a judge for evaluating your LLM.
For example, whenever you have an expensive LLM and a less expensive LLM, you can always ask a more expensive LLM to judge it. Ideal situation, you would involve a human. However, you can save cost. As it turns out, LLMs are cheaper than humans, so that's why you can actually use LLMs as well. Finally, of course, there are so many observability tools right now in the market, you go and they somehow help you track metrics. In the end, those are just metrics.
See more presentations with transcripts