Meta has unveiled its AI Research SuperCluster (RSC) supercomputer, aimed at accelerating AI research and helping the company build the metaverse. The RSC will help the company build new and better AI models, working across hundreds of different languages, and to develop new augmented reality tools.
Developing the next generation of advanced AI will require powerful new computers capable of quintillions of operations per second. Meta’s researchers have already started using RSC to train large models in natural-language processing (NLP) and computer vision for research, with the aim of one day training models with trillions of parameters across Meta’s businesses, from content-moderation algorithms used to detect hate speech on Facebook and Instagram, to augmented-reality features that will one day be available in the metaverse. RSC can train models that use multimodal signals to determine whether an action, sound or image is harmful or benign. Meta claims this will not only help keep people safe on Meta’s services today, but also in the metaverse.
Meta is also defining the power of its computer differently from how conventional supercomputers are measured because it relies on the performance of graphics-processing chips, which are useful for running deep-learning algorithms that can understand what’s in an image, analyze text and translate between languages.
AI supercomputers are built by combining multiple GPUs into compute nodes, which are then connected by a high-performance network fabric to allow fast communication between those GPUs. RSC today comprises a total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs. Each DGX communicates via an NVIDIA Quantum 1600 Gb/s InfiniBand two-level Clos fabric that has no oversubscription. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.
Before the end of 2022, the RSC will contain some 16,000 total GPUs and will be able to train AI systems with more than a trillion parameters on data sets as large as an exabyte. This raw number of GPUs only provides a narrow metric for a system’s overall performance — for example, Microsoft’s AI supercomputer built with research lab OpenAI is built from 10,000 GPUs.
Building the next-generation of AI infrastructure with RSC is going to help the foundational technologies that will power the metaverse and advance the broader of AI.