InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News QCon SF 2024 - Scaling Large Language Model Serving Infrastructure at Meta

AI, ML & Data Engineering

QCon SF 2024 - Scaling Large Language Model Serving Infrastructure at Meta

Nov 26, 2024 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

At the QCon San Francisco Conference 2024, Ye (Charlotte) Qi from Meta spoke about scaling large language model (LLM) serving infrastructure. Her talk explored the complexities of deploying LLMs, underscoring the unique challenges posed by their size, computational demands, and integration into production systems.

Qi framed the current landscape as an "AI Gold Rush," where organizations grapple with unprecedented compute demands and resource constraints. Deploying LLMs at scale requires not only fitting models onto hardware but also optimizing their performance and cost. She emphasized that the work involves not just infrastructure techniques but also close collaboration with model developers to achieve end-to-end optimization.

One of the first challenges addressed was the need to fit models onto hardware efficiently. LLMs, especially those with billions of parameters, often exceed the capacity of a single GPU. Meta employs tensor parallelism and pipeline parallelism to partition models across GPUs and nodes. She explained that understanding hardware constraints and runtime requirements is critical, as mismatches between model architecture and hardware can drastically limit performance.

"Don't just grab your training runtime or your favorite framework. Find a runtime specialized for inference serving and understand your AI problem deeply to pick the right optimizations." – Qi

Performance optimization emerged as another focal point. Qi discussed how first token latency and overall generation throughput are key metrics for real-time applications. Techniques like continuous batching help improve responsiveness and throughput. Quantization, the practice of reducing model precision to unlock hardware efficiency, was highlighted as a major lever for performance gains, often achieving 2–4x improvements.

The transition from prototype to production revealed a new layer of challenges. Real-world applications experience fluctuating workloads, latency requirements, and fault tolerance needs. Qi emphasized that scaling LLMs is not just about deploying larger clusters of GPUs but also managing the intricate trade-offs between latency, reliability, and cost. Disaggregated deployments, hierarchical caching, and request scheduling all play crucial roles in maintaining performance under production conditions.

Qi shared Meta's approach to handling production-specific issues, such as caching strategies tailored to LLM workloads. Hierarchical caching systems, where common data is stored in high-speed memory tiers and less-used data in slower tiers, significantly reduce latency and resource consumption. She also detailed how consistent hashing ensures related requests are routed to the same host, maximizing cache hit rates.

Qi underscored the importance of automation and observability, highlighting Meta’s investment in tools that benchmark performance, optimize resource allocation, and monitor system behavior. She described Meta's custom deployment solver, which integrates auto-scaling and placement logic to meet demand while minimizing costs.

Qi emphasized the importance of stepping back to see the bigger picture when scaling AI infrastructure. By adopting this broader perspective, businesses can identify more effective approaches that deliver real value and focus their resources on these priorities. This mindset also clarifies which efforts yield meaningful results during continuous evaluation, allowing organizations to refine their systems at every stage for sustained performance and reliability.

Developers interested in learning more about Qi’s presentation may visit the InfoQ website, where a video of it will be available in the coming weeks.

About the Author

Andrew Hoblitzell

Andrew Hoblitzell is a senior member of technical staff at Salesforce, where he works on the Einstein team. He holds a Ph.D. in Computer Science from Purdue University, West Lafayette and is passionate about applications of machine learning and learning from and educating others in the ML community.

Show moreShow less

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

QCon SF 2024 - Scaling Large Language Model Serving Infrastructure at Meta

Write for InfoQ

About the Author

Andrew Hoblitzell

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter