BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Legare Kerrison and Cedric Clyburn on LLM Performance and Evaluations

Legare Kerrison and Cedric Clyburn on LLM Performance and Evaluations

Listen to this article -  0:00

Effectively measuring the performance of applications that leverage Large Language Models (LLMs) is critical to the adoption of AI technologies in organizations. Legare Kerrison and Cedric Clyburn from the Red Hat team recently spoke at the Arc of AI 2026 Conference about practical methods for evaluating and optimizing LLM inference. They discussed the resource requirements and cost implications of different workloads in AI applications, like Retrieval Augmented Generation (RAG) and Agentic AI. Kerrison and Clyburn also discussed the importance of metrics such as Requests Per Second (RPS), Time to First Token (TTFT), and ITL (Inter-Token Latency) when evaluating applications.

The speakers began the presentation by highlighting that 2023 was the year of LLM's with Hugging Face and other models, 2024 was the year of RAG, 2025 was the year of model fine-tuning and AI Agents, and they predicted that 2026 will be about LLM evaluations. In terms of challenges with AI deployments and LLM model evaluations & performance, the leaderboards are helpful, but they tend to be generic. Some websites use criteria such as hard prompts, coding, math, and creative writing to evaluate the models. Your unique business problems and data are not represented in these benchmarks, so they should be used with their limitations in mind. The software development teams should understand the overall AI technology landscape to choose the best model and provider for their specific use cases.

The speakers highlighted the common pain points they experienced in real-world projects deploying LLMs, where delivering production-ready models meant navigating the "tradeoff triangle" among model quality (accuracy), responsiveness (latency), and overall cost. Optimizing for any two of these factors impacts the third. For example, focusing on high accuracy and low latency would lead to higher deployment costs. Applications built with a focus on low cost and high accuracy would typically incur high latency. And too much focus on low cost and low latency would result in low model accuracy. When choosing the right model, performance targets, and hardware infrastructure for your workloads, clear measurements and evaluations help make informed decisions.

Teams need to shift from just model choices to the actual application requirements and priorities of their systems to provide the right solutions to their customers. Service level objectives (SLOs) with clearly defined key performance and quality metrics ensure applications stay fast, useful, and trustworthy for end users, and can guide structured comparisons across models and hardware, enabling cost optimizations.

The Requests Per Second (RPS) metric measures how many inference requests a system can handle per second. It can be used to measure the overall throughput and how well the serving stack scales under load. Time to First Token (TTFT) is the time between sending a request and receiving the first generated token. It shows us the perceived latency for the user. And Inter-Token Latency (ITL) is the time between successive tokens after the first one. It highlights how fast streaming output feels to the user and indicates the decoder's efficiency.

They showed a few examples of different SLOs for different workloads for a variety of use cases and benchmarking metrics. An e-commerce chatbot solution would require a fast and conversational response. The TTFT metric for this use case would typically be ≤200ms and ITL ≤50ms for 99% of requests (P99). On the other hand, a RAG-based application would require more accuracy and completeness than just speed and performance. RAG use cases tend to use more input tokens and fewer output tokens. The metrics for TTFT, ITL, and request latency would be ≤300ms, ≤100ms (if streamed), and ≤3000ms, respectively, for 99% of the requests.

After deciding on the application priorities, teams should focus on hardware requirements. The LLM inference phase has two stages: Prefill (compute-bound) and Decode (memory-bound). Techniques such as structured generation, speculative decoding, prefix caching, and session caching can help with efficient LLM model serving. It's easier to load the prefill phase, which uses the first token, than the decode phase, which depends on the subsequent tokens. The speakers mentioned that running LLMs locally, where it makes sense, has the advantage of avoiding the cloud, making it more efficient for specific use cases.

They defined the term Model Evaluation as the process of assessing a model's overall performance and suitability for its intended purpose across various criteria, i.e., how a specific model performs under a workload on specific hardware. Model benchmarking was defined as a standardized comparison of a model's performance against predefined datasets, tasks, and other models.

They talked about what their teams typically measure for LLMs across different workflow patterns, such as the standard request flow, where a token is generated for every new request. The end-to-end request latency is an important metric for this pattern. On the other hand, in the streaming request flow, LLM requests are heterogeneous, and metrics such as TTFT and ITL need to be tracked formally.

The LLM performance metrics are affected by factors like model architecture & size, quantization (compress models by reducing the precision of the weights), serving engine (e.g., Ollama, vLLM, TGI, Triton), hardware (GPU memory), and batching & concurrency choices.

Model inference performance assessment is time-consuming and fragmented, so it's difficult to measure LLM deployments. Kerrison and Clyburn showed some examples of LLM workloads that the teams need to plan for, and ask questions for evaluations, like "With NVIDIA H200, should I use a Llama 3.1 8B or Llama 3.1 70B Instruct to create a customer service chatbot?" or "How many servers do I need to keep my service running under maximum load?"

Benchmarking with open source toolkits like GuideLLM for SLO-aware benchmarking of LLM deployments. GuideLLM, a part of the vLLM project, simulates real-world traffic and measures metrics such as throughput and latency. Its process flow includes steps like model selection and customization, dataset selection with real data or synthetic data, configuring the workload, and running the benchmark tests. If the model meets the desired SLO goals, it can be deployed in production on the vLLM engine.

Clyburn showed GuideLLM test results with simulated workloads, such as synchronous (runs a single stream of requests one at a time) and concurrent (runs a fixed number of synchronous streams in parallel), using datasets like Hugging Face (ShareGPT), file-based, and in-memory. He shared the benchmark statistics for different workloads like Chat, RAG, Summarization, and Code Generation, for P99 (99th percentile) and P90 (90th percentile) latency metrics.

In addition to LLM inference, we also need to consider evaluating model accuracy. LLM accuracy evaluation use cases should include categories like model accuracy, pipeline Accuracy (for RAG and AI Agents), and application accuracy. Some of the open source evaluation tools include the following:

The speakers concluded the talk by emphasizing that the application teams should look into LLM optimization techniques like quantization (compressing models is more effective than niche optimization techniques). In one instance, the quantization using GPTQModifier resulted in 45% size reduction of the model size. Another technique is KV Cache, which reduces redundant computation and accelerates decoding (but requires more memory). For additional learning on AI topics, they recommended the Hugging Face website, which offers Red Hat AI-validated language models, and the deeplearning.ai website for training courses on AI in general.

About the Author

Rate this Article

Adoption
Style

BT