BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Meta's Research SuperCluster for Real-Time Voice Translation AI Systems

Meta's Research SuperCluster for Real-Time Voice Translation AI Systems

A recent article from Engineering at Meta reveals how the company is building Research SuperCluster (RSC) infrastructure that is used for advancements in real-time voice translations, language processing, computer vision, and augmented reality (AR). Large-scale model training faces significant challenges as the number of GPUs in a job increases. Meta emphasizes the need to "minimize the chances of a hardware failure interrupting a training job" through rigorous testing, quality control measures, and automated issue detection and remediation. Meta focuses on "reducing re-scheduling overhead and fast training re-initialization" to quickly recover from such incidents. Meta notes that "a slow data exchange between a subset of GPUs can compound and slow down the whole job". To address this, Meta emphasizes the need for "a robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms".

Meta highlights the need for powerful computing systems capable of performing quintillions of operations per second to drive forward the development of advanced AI technologies. To achieve this objective, Meta expanded its AI infrastructure by building two 24k-GPU clusters. Meta identified RoCE and InfiniBand fabrics as the two primary choices that meet their requirements, at the same time acknowledging that each option presents its own set of tradeoffs.

Meta's decision stems from its experience with both technologies. While Meta has built RoCE clusters for the past four years, their largest cluster only supported 4K GPUs, falling short of their current needs. Conversely, Meta had previously constructed research clusters with InfiniBand supporting up to 16K GPUs, but these were not fully integrated into their production environment or optimized for the latest GPU and networking technologies.

To address these challenges, Meta decided to build two 24k clusters, one using RoCE and another using InfiniBand. Meta aims to gain operational experience from both implementations for their GenAI fabrics.

Meta reported that it successfully tuned both clusters to deliver equivalent performance for GenAI workloads, despite the underlying differences in network technology.

Meta reiterates its commitment to open compute and open-source principles. Meta built these clusters using Grand Teton, OpenRack, and PyTorch, reinforcing its dedication to promoting open innovation across the industry.

Looking ahead, Meta aims to significantly expand its infrastructure by planning to include 350,000 NVIDIA H100 GPUs by the end of 2024, projecting a total compute power equivalent to nearly 600,000 H100s.

Source: Network Compute Storage Under The Hood

This architecture diagram depicts the RSC team’s AI infrastructure, highlighting key components in network, compute, storage, and performance optimization.

In the network layer, Meta has implemented two distinct solutions. One cluster utilizes a remote direct memory access (RDMA) over converged Ethernet (RoCE) network fabric solution based on the Arista 7800 with Wedge400 and Minipack2 OCP rack switches. The other cluster features an NVIDIA Quantum2 InfiniBand fabric.

For compute, Meta utilizes its in-house developed Grand Teton platform. They explain:

Grand Teton builds on the many generations of AI systems that integrate power, control, compute, and fabric interfaces into a single chassis for better overall performance, signal integrity, and thermal performance.

Storage requirements have grown with the increasing multimodality of GenAI training. Meta addresses this with "a home-grown Linux Filesystem in Userspace (FUSE) API backed by a version of Meta's Tectonic distributed storage solution optimized for Flash media".

In terms of performance optimization, Meta states, "We also optimized our network routing strategy in combination with NVIDIA Collective Communications Library (NCCL) changes to achieve optimal network utilization".

LlaMA, No Language Left Behind (NLLB), Universal speech translator, Theorem proving are few of the many projects that run on RSC.

As companies unveil their ambitious AI infrastructure expansion, questions about the environmental impact of large-scale AI training have come to the forefront. Recent research has shed light on the significant energy consumption associated with training LLMs. Paul Walsh, analytics and AI innovation director at Accenture, wrote last year:

Research studies have highlighted the growing amount of energy consumed by training large language models, which are estimated to have grown by a factor of 300,000 in just six years, with AI model size doubling every 3.4 months, thereby creating a potentially significant carbon footprint.

Diogo Ribeiro, a machine-learning engineer, highlights the growing concerns in the industry: "As each successive iteration of these models grows in size, businesses will confront ever-increasing energy expenses, alongside the detrimental environmental impacts".

Ribeiro advises, "At a hardware level, companies can opt to invest slightly more in energy-efficient GPUs". This approach provides a tangible way for businesses to balance their AI ambitions with environmental responsibility.

For more on scaling LLM workloads, checkout this presentation, infrastructure cost optimization, ML training infrastructure, and LLM training on distributed infrastructure.

About the Author

Rate this Article

Adoption
Style

BT