InfoQ Homepage Articles Llama 3 in Action: Deployment Strategies and Advanced Functionality for Real-World Applications

AI, ML & Data Engineering

Llama 3 in Action: Deployment Strategies and Advanced Functionality for Real-World Applications

Sep 17, 2024 14 min read

Tingyi Li
Enterprise Solutions Architect @AWS

reviewed by

Anthony Alford
Senior Director, Development at Genesys Cloud Services

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

Llama 3 base models come pre-trained and instruction-tuned in 8B and 70B versions, with 400+B coming soon. Within one month of release, HuggingFace had more than 3000+ variants.
You can easily deploy Llama 3 on AWS for your production workloads on GPU-based EC2 instances, through SageMaker Jumpstart, or access it via a proprietary API through Amazon Bedrock.
Llama 3 democratizes fine-tuning because the entry bar has been significantly reduced.
The enhanced capabilities brought by Llama 3 will significantly drive the productionization of enterprise-level LLM-based applications, with easy construction of an RAG application based on the 8B version without any internet connection on your local machine.
Function-calling-enhanced Llama 3 variants have demonstrated excellent tool-calling capabilities, which unlock the potential of leveraging Llama 3 in agentic workflows.

This article is part of the Practical Applications of Generative AI article series, where we present real-world solutions and hands-on practices from leading GenAI practitioners.

Meta released LLaMA, the first version of their open-source, open-weight large language model (LLM), in early 2023. That first model had performance comparable to larger models such as GPT-3 and PaLM, and unlike those models, Meta made LLaMA’s weights available for download. LLaMA was soon followed by Llama 2 in July of 2023. This model was also openly sourced and featured better accuracy and a longer context length than the first generation.

On April 19th, 2024, the open-source community carnival in the LLM space continued. After nine months since Llama 2, Meta released the official versions of Llama 3. The 8 billion parameter (Llama3-8B) and 70 billion parameter (Llama3-70B) versions, each with a base and instruction-tuned variety, are now open-sourced. They are also free for commercial use, provided monthly active users stay under 700 million.

Differences between Llama2-7B and Llama3-8B

In general, there’s minimal difference between the model architectures of Llama 2 and Llama 3. This is good news for current Llama 2 users because applications that incorporate the Llama 2 model could seamlessly be ported to Llama 3.

Data engineering is the main catalyst for performance-boosting

Training data size: During the pre-training phase, Llama 3 used more than 15T tokens from publicly available sources, more than seven times the total amount used to train Llama 2. Coding-related data is more than four times that for Llama 2. During the fine-tuning phase, besides using public instruction datasets, Meta produced more than 10 million manually annotated example datasets.
Data quality: According to Meta, "To ensure Llama 3 is trained on data of the highest quality, we developed a series of data-filtering pipelines. These pipelines include using heuristic filters, NSFW filters, semantic deduplication approaches, and text classifiers to predict data quality". Regarding instruction fine-tuning, "Some of our biggest improvements in model quality came from carefully curating this data and performing multiple rounds of quality assurance on annotations provided by human annotators".
Data mixing ratio: "We also performed extensive experiments to evaluate the best ways of mixing data from different sources in our final pre-training dataset. These experiments enabled us to select a data mix that ensures that Llama 3 performs well across use cases, including trivia questions, STEM, coding, historical knowledge, etc".

Also, check out this blog post for details on Meta’s GenAI infrastructure for training Llama.

According to Meta, Llama 3 has been assessed using various benchmarks, including MMLU (undergraduate-level knowledge), GSM-8K (grade-school math), HumanEval (coding), GPQA (graduate-level questions), and MATH (math word problems). These benchmarks demonstrate that the 8B model outperforms open-weight models such as Google’s Gemma 7B and Mistral 7B Instruct, and the 70B model is competitive against Gemini Pro 1.5 and Claude 3 Sonnet.

The release of Llama 3 further demonstrates the importance of data engineering: with the model architecture unchanged, more high-quality data can significantly improve model performance. While common capabilities such as long sequences and multi-modalities are not supported yet, Meta claims that the 400B+ version is on its way, with new capabilities including multimodality, the ability to converse in multiple languages, a much longer context window, and stronger overall capabilities.

Llama 3 Deployment in Production

Llama 3 Playground

You can now chat with Llama 3 in multiple playgrounds, including but not limited to the official meta.ai (not available in Europe), HuggingFace’s HuggingChat, Perplexity Labs, Groq, and more.

Llama 3 Hosting for Production Workloads

To deploy Llama 3 in production, you must allocate enough computational resources, RAM/VRAM space, and disk space, considering your requirements on inference speed, costs, etc. First, it is possible to deploy and run Llama 3 without a GPU. I managed to run full FP16 Llama3-8B on my M1 Macbook Pro with CPU only and approximately 60GB of RAM available. But the latency is huge, taking around 30 seconds per token—not ready for production purposes at all.

To deploy it for production usage, you need to allocate GPU instances equipped with sufficient VRAM capacity to support the execution of the models. You will need adequate disk space to save them and sufficient VRAM to load them. Table 1 shows the requirements for the 8B and 70B models; you can verify these numbers by deploying a model to an EC2 instance described below.

Table 1: Computational resources needed to deploy Llama 3 models

With the highest precision weight representation, float16 or bfloat16, each model weight uses 16 bits or 2 bytes. So, if you want to run the model at its full original precision for the highest quality output and the full capabilities of the model, you need 2 bytes for each weight parameter, plus a lot more VRAM space since you will need inference dependencies, OS requirements, etc. during model loading and inference runtime.

Depending on your VRAM space and GPUs available, the model will either be loaded into RAM and limited to CPU or swapping layers back and forth. Both will result in being unable to leverage all of the compute resources on your machine. You can consider using quantized model versions to shrink model sizes. However, it is a trade-off between performance and cost for commercial uses. While quantization down to around q_5 still preserves most language understanding skills, it is observed that coding skills, in particular, can decay significantly due to quantizations.

Here is a good explanation of computational overhead and a good read on calculating LLM memory.

Deployment on AWS EC2 instances

Let’s take AWS, the most commonly used cloud platform in production, as an example deployment platform. There are several ways for you to deploy Llama 3 on AWS.

First, you can access a diverse set of purpose-built EC2 instances equipped with GPUs and other graphic cards in different versions and sizes optimized to fit different use cases. Under accelerated computing instance types, you can access various GPU-based instances such as the AWS P family and G family instances types or Inf2 instances equipped with AWS Inferentia2, which is AWS-developed silicon delivering up to 40% better price performance for Generative AI inference workloads than comparable GPUs.

Table 2: Examples of EC2 instances for deploying Llama 3 in AWS

You need at least a g5.2xlarge instance to run Llama3-8B FP16 using the following instructions.

Figure 2: Launching an EC2 instance for deploying Llama 3 in AWS

Once your instance is running, connect to it, and then you can download Llama3-8B via the Meta website, HuggingFace, Ollama, etc.

I recommend downloading with HuggingFace here, following the git clone instructions with git-lfs on their website. HuggingFace will download both the HuggingFace format models and the original Meta versions, and you are less likely to encounter errors in this case.

After downloading, make sure you run the command below to verify that the checksum of the model file you downloaded aligns with the SHA256 key from HuggingFace. Otherwise, you might encounter problems during inference.

sha256sum ./original/consolidated.00.pth

Configure the environment using the following commands:

# Install virtual environment
conda create -n llama3 python=3.11
conda activate llama3

# Clone Meta Llama3 official repo to your path
git clone https://github.com/meta-llama/llama3.git

cd llama3

# Install dependencies required
pip install -e .

Then, you can run the inference script below based on transformers. Ensure you have at least 24GB VRAM to load the checkpoints successfully.

import transformers
import torch

# Your model path
model = "./Meta-Llama-3-8B"

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
    max_length=128 # Set this up to limit the length of completion
)
print(pipeline("Hey how are you doing today?"))

Inference results are as follows (completion mode):

Figure 3: Console output with Llama 3 inference script results

Launch Inference Server using vLLM

Alternatively, you can use vLLM to deploy your model inference as a service. vLLM is a library designed for rapid and easy LLM inference and deployment. Its efficiency is attributed to various sophisticated methods, including paged attention for optimal management of attention key and value memory, real-time processing of incoming queries in batches, and personalized CUDA kernels.

First of all, run the following command to install vLLM.

pip install vllm

Two inference modes are using vLLM.

1. Completion mode

Deploy the mode inference service:

python -m vllm.entrypoints.openai.api_server --model ./Meta-Llama-3-8B --dtype auto --api-key "your_string"

Run the inference using the following script:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"  # Same as --api-key in the deployment command
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

print("Connection Success!")

# Completion API
completion = client.completions.create(
    model="./Meta-Llama-3-8B", # Same as --model in the deployment command
    prompt="A robot may not injure a human being",
    max_tokens=128,
)

print("Completion results:", completion)

Set up max_tokens to limit the length of the generated output, given that the context window is large. The inference results are as follows:

Figure 4: Llama 3 inference results using vLLM

2. Chat mode

Similarly, for the Instruction-tuned version,

python3 -m vllm.entrypoints.openai.api_server --model ./Meta-Llama-3-8B-Instruct --dtype auto --api-key 123456
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY" # Same as --api-key in the deployment command
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # Defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

print("Connection Success!")

chat_completion = client.chat.completions.create(
    messages=[{
        "role": "system",
        "content": "You are a helpful assistant."
    }, {
        "role": "user",
        "content": "Who won the world series in 2020?"
    }, {
        "role":
        "assistant",
        "content":
        "The Los Angeles Dodgers won the World Series in 2020."
    }, {
        "role": "user",
        "content": "Where was it played?"
    }],
    model="./Meta-Llama-3-8B-Instruct", # Same as --model in the deployment command
    max_tokens=128,
)

print("Chat completion results:", chat_completion.choices[0].message)

Deployment with Amazon SageMaker Jumpstart

You can also deploy Llama 3 through a managed AWS service like Amazon SageMaker.

With SageMaker JumpStart, you can choose from a broad selection of publicly available foundation models. ML practitioners can deploy foundation models to dedicated SageMaker instances in a network-isolated environment and customize models using SageMaker for model training and deployment.

Currently, eight variants of Llama 3 are available on SageMaker Jumpstart, as shown below, where you can easily configure and deploy Llama 3 models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK. There are also Neuron versions, which you can deploy on an AWS Inferentia-based instance. You can also fine-tune models on an AWS Trainium instance using AWS Neuron, the SDK used to run deep learning workloads on AWS Trainium and AWS Inferentia-based instances.

Figure 5: Deploying Llama 3 using Amazon Sagemaker

You can now derive the combined advantages of Llama 3 performance, and MLOps controls with Amazon SageMaker features such as SageMaker Pipelines, SageMaker Debugger, or container logs. In addition, the model will be deployed in an AWS secure environment under your VPC controls, helping provide data security.

Access through Amazon Bedrock

Finally, you can now access Llama3-8B-instruct and Llama3-70B-instruct in Amazon Bedrock via chat playground or a web-service API, which you can easily integrate into your production applications in a fully managed manner.

Llama 3 in Action

In less than a month since Meta released the original four versions of Llama 3, over 3,000 model variants have flooded HuggingFace. These variants extend the capabilities of the initial models, ranging from expanded context window lengths to quantization to support for diverse languages and highly specialized domains.

A few exciting applications are:

1. Expanded context window

As mentioned previously, there’s a version of Llama3-8B-Instruct-Gradient-1048k that extends the context window length from 8k to 1048k on HuggingFace by adjusting rope_theta to 3580165449.0. This showcases how to effectively manage context length while minimizing training requirements.

Figure 6: Llama3-8B-Instruct-Gradient-1048k on Huggingface

2. Offline Retrieval Augmented Generation (RAG)

The enhanced capabilities brought by Llama 3 will significantly drive the productionization of enterprise-level LLM-based applications. You can now easily construct a RAG application without any internet connection on your local machine. Requirements such as 100% local RAG for companies with regulations around sensitive data have now become much more feasible.

Check out code example1 using LangChain and code example2 using LlamaIndex to develop a RAG application with Llama 3.

3. Fine-tuning for verticalized domains

There are many fine-tuned versions of Llama 3, and individuals can easily fine-tune the 8B version on their machines. A good example is that Nvidia has released a competitive Llama3-70B model, Llama3-ChatQA-1.5-70B, a QA/RAG fine-tuned version for question answering and information retrieval tasks, available on HuggingFace.

Figure 7: Llama3-ChatQA-1.5-70B on Huggingface

4. Function calling and tool using

The original versions of Llama 3 do not support function calling.

Figure 8: Llama 3 does not support function calling "out of the box"

However, there are a few versions of Llama 3 fine-tuned on function-calling data, such as Meta-Llama-3-8B-Instruct-function-calling and Hermes-2-Pro-Llama-3-8B with well-structured outputs and tool-call parsing tags.

An example prompt template is below:

<|begin_of_text|><|start_header_id|>function_metadata<|end_header_id|>
You have access to the following functions. Use them if required:
[
       {
            "type": "function",
            "function": {
                "name": "search_merchant",
                "description": "Search for merchants in the catalog based on the term",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "name": {
                            "type": "string",
                            "description": "name to be searched for finding merchants.",
                        }
                    },
                    "required": ["name"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "search_item",
                "description": "Search for items in the catalog based on various criteria.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "term": {
                            "type": "string",
                            "description": "Term to be searched for finding items, with removed accents.",
                        },
                        "item_price_to": {
                            "type": "integer",
                            "description": "Maximum price the user is willing to pay for an item, if specified.",
                        },
                        "merchant_delivery_fee_to": {
                            "type": "integer",
                            "description": "Maximum delivery fee the user is willing to pay, if specified.",
                        },
                        "merchant_payment_types": {
                            "type": "string",
                            "description": "Type of payment the user prefers, if specified.",
                            "enum": [
                                "Credit Card",
                                "Debit Card",
                                "Other",
                            ],
                        },
                    },
                    "required": ["term"],
                },
            },
        }
]<|eot_id|><|start_header_id|>user<|end_header_id|>

Get the list of the five most preferred payment types<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Generated Response:

{
    "name": "search_item",
    "arguments": {
        "number": 5,
        "region": "US"
    }
}<|eot_id|>

You can use the following script to run inference with function calling by switching to the right model versions.

from openai import OpenAI
openai_api_key = "EMPTY" #same as --api-key in the deployment command
openai_api_base = "http://localhost:8000/v1"

print("Connection Success!")

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

response = client.chat.completions.create(
    model="./Hermes-2-Pro-Llama-3-8B", #your model path
    messages = [
        {
          "role": "system",
          "content": "You are a digital waiter Pay attention to the user requests and use the tools to help you."
        },
         {
          "role": "user",
          "content": "search for burger and a maximum price of 10 dollars. at the same time, look for merchant named 'Burger King'"
        },
    ],
    tools = [
        {
            "type": "function",
            "function": {
                "name": "search_merchant",
                "description": "Search for merchants in the catalog based on the term",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "name": {
                            "type": "string",
                            "description": "name to be searched for finding merchants.",
                        }
                    },
                    "required": ["name"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "search_item",
                "description": "Search for items in the catalog based on various criteria.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "term": {
                            "type": "string",
                            "description": "Term to be searched for finding items, with removed accents.",
                        },
                        "item_price_to": {
                            "type": "integer",
                            "description": "Maximum price the user is willing to pay for an item, if specified.",
                        },
                        "merchant_delivery_fee_to": {
                            "type": "integer",
                            "description": "Maximum delivery fee the user is willing to pay, if specified.",
                        },
                        "merchant_payment_types": {
                            "type": "string",
                            "description": "Type of payment the user prefers, if specified.",
                            "enum": [
                                "Credit Card",
                                "Debit Card",
                                "Other",
                            ],
                        },
                    },
                    "required": ["term"],
                },
            },
        }
    ],
    tool_choice="auto"
)

print(response)

Hermes-2-Pro-Llama-3-8B has demonstrated excellent tool-calling capabilities on Groq, providing efficient and cost-effective AI processing. Check out the code samples here.

Conclusion

The release of the open-source Llama 3 LLM with enhanced capabilities and the rapid proliferation of its derivatives underscore the true power and significance of open-source Generative-AI development. It enables the global community to freely build upon, refine, and tailor these foundational language models to address a vast array of challenges and use cases. I can’t wait for the 400B+ version!

This article is part of the Practical Applications of Generative AI article series, where we present real-world solutions and hands-on practices from leading GenAI practitioners.

About the Author

Tingyi Li

Tingyi Li is an enterprise solutions architect, a public speaker, and a thought leader in the field of artificial intelligence and machine learning. In her current role as Enterprise Solutions Architect at Amazon Web Services, she leads strategic engagements with major Nordics enterprises on their cloud-optimized digital transformations. As the founder and leader of the AWS Nordics Generative AI community, her work is dedicated to democratizing Generative AI, and transforming how industries leverage the technologies to drive innovations and unlock business value. Tingyi is a frequent speaker at premier conferences globally including AWS Re:Invent, QCon, TDC, IEEE WIE Leadership Summit etc. and is the featured instructor for the flagship GenAI courses at the University of Oxford, where she continues to impart cutting-edge knowledge and inspire the next generation of tech leaders. Prior to AWS, she worked as Data & AI Engineer at Intel, Foxconn and Huawei, building large-scale intelligent industrial information and data integration systems with advanced data pipelining and AI/ML technologies. In her spare time, she also works as a part-time illustrator who writes novels and plays the piano.

Show moreShow less

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?