Microsoft recently announced ZeRO-Infinity, an addition to their open-source DeepSpeed AI training library that optimizes memory use for training very large deep-learning models. Using ZeRO-Infinity, Microsoft trained a model with 32 trillion parameters on a cluster of 32 GPUs, and demonstrated fine-tuning of a 1 trillion parameter model on a single GPU.
The DeepSpeed team described the new features in a recent blog post. ZeRO-Infinity is the latest iteration of the Zero Redundancy Optimizer (ZeRO) family of memory optimization techniques. ZeRO-Infinity introduces several new strategies for addressing memory and bandwidth constraints when training large deep-learning models, including: a new offload engine for exploiting CPU and Non-Volatile Memory express (NVMe) memory, memory-centric tiling to handle large operators without model-parallelism, bandwidth-centric partitioning for reducing bandwidth costs, and an overlap-centric design for scheduling data communication. According to the DeepSpeed team:
The improved ZeRO-Infinity offers the system capability to go beyond the GPU memory wall and train models with tens of trillions of parameters, an order of magnitude bigger than state-of-the-art systems can support. It also offers a promising path toward training 100-trillion-parameter models.
A recent trend in deep learning research is to train larger models on more data, with some of the largest models achieving superhuman performance on certain tasks. However, training these models requires large and expensive clusters of GPUs. In many cases, model developers can use transfer learning to fine-tune a large pre-trained model, using only a fraction of the compute resources that were required for pre-training; still, very large models such as GPT-3 are too big to fine-tune on a single machine. Both scenarios often require code refactoring to exploit distributed training frameworks.
To address these problems, Microsoft first released the DeepSpeed library and the Zero Redundancy Optimizer (ZeRO) in early 2020 as part of their AI at Scale program. ZeRO was improved in three stages, with each stage adding additional partitioning of model state, along with the ability to "offload" data and compute from the GPU to the CPU of a training machine. Stage 3 was released at the beginning of this year, with the ability to train models up to 40 billion parameters on a single machine and over 2 trillion parameters on a cluster of 512 GPUs.
The latest iteration of ZeRO, ZeRO-Infinity, brings new schemes for addressing two bottlenecks in training large models: memory size and memory bandwidth. The infinity offload engine increases the amount of memory available for storing model parameters and activations by using CPU and NVMe memory; unlike previous generations of ZeRO, the infinity engine can offload the entire model to these locations. Memory-centric tiling is another new technique, which reduces the memory footprint of large model layers by breaking them down into smaller "tiles" that can be executed sequentially; this allows for training large models without requiring model-parallelism. To handle bandwidth concerns, ZeRO-Infinity introduces bandwidth-centric partitioning, which partitions model parameters across multiple data parallel processes, and an overlap engine which executes NVMe-to-CPU, CPU-to-GPU, and GPU-to-GPU communication simultaneously.
The team performed several experiments to validate ZeRO-Infinity's ability to scale, training "GPT-like" Transformer models of different sizes. Compared to a state-of-the-art 3D parallelism framework, ZeRO-Infinity handled models 40x larger using the same compute hardware. Compared to the previous version of ZeRO, the new version achieved a 2x speedup on a 64 GPU cluster. When training a 1T-parameter model, ZeRO-Infinity scaled super-linearly on various cluster sizes from 64 GPUs to 512 GPUs.
The DeepSpeed library, which includes the ZeRO family of memory optimizations, is written for the PyTorch deep learning framework and has been adopted by several other PyTorch-based projects. HuggingFace, a popular source for pre-trained AI models, has integrated with the new ZeRO-Infinity release, and PyTorch Lightning, a distributed-training wrapper for PyTorch, has also adopted DeepSpeed and the first three stages of ZeRo. Facebook's FairScale library for training large PyTorch models also includes several ZeRO technologies.
In a discussion on Reddit, one commenter described the DeepSpeed library as "incredibly valuable." Another pointed out:
However, these techniques (zero-offload, zero-infinity) generally aren't so helpful for helping you train larger models. For training really large models from scratch, memory usually isn't the bottleneck - compute is. However, these techniques could be quite helpful for helping you finetune.
The DeepSpeed library code is available on GitHub.