Microsoft open-sourced Zero Redundancy Optimizer version 2 (ZeRO-2), a distributed deep-learning optimization algorithm that scales super-linearly with cluster size. Using ZeRO-2, Microsoft trained a 100-billion-parameter natural-language processing (NLP) model 10x faster than with previous distributed learning techniques.
Writing in a blog post, program manager Rangan Majumder and distinguished engineer Junhua Wang described the algorithm and their experiments. ZeRO-2 is part of Microsoft's open-source DeepSpeed library for deep-learning training optimization. ZeRO-2 optimizes memory consumption during training, allowing for distributed training of models as large as 170 billion parameters. The algorithm also reduces communication between workers nodes in the distributed cluster, achieving super-linear parallel speedup reducing training times by up to 10x. Using ZeRO-2 on a cluster of 1,024 GPUs, the DeepSpeed team achieved a record-setting time of 44 minutes to train a BERT natural-language model, an improvement of over 30% compared to NVIDIA's results.
Recent trends in NLP research have seen improved accuracy from larger models trained on larger datasets. OpenAI have proposed a set of "scaling laws" showing that model accuracy has a power-law relation with model size, and recently tested this idea by creating the GPT-3 model which has 175 billion parameters. Because these models are simply too large to fit in the memory of a single GPU, training them requires a cluster of machines and model-parallel training techniques that distribute the parameters across the cluster. There are several open-source frameworks available that implement efficient model parallelism, including GPipe and NVIDIA's Megatron, but these have sub-linear speedup due to the overhead of communication between cluster nodes, and using the frameworks often requires model refactoring.
ZeRO-2 reduces the memory needed for training using three strategies: reducing model state memory requirements, offloading layer activations to the CPU, and reducing memory fragmentation. ZeRO-2 can reduce model state memory requirements up to 8x by partitioning gradients and parameters across parallel processes. Layer activation values are saved from the forward training pass to be used later in the backward pass, but ZeRO-2 temporarily moves them from the GPU's memory to the host CPU's memory. Finally, memory allocation can fail even when memory is available, if the available memory is not contiguous. ZeRO-2 reduces fragmentation by pre-allocating contiguous memory chunks for temporary uses such as activations and gradients.
These memory optimizations can reduce the required model parallelism, and thus the required inter-node communication overhead, providing super-linear speedup when used along with data-parallel training: the DeepSpeed team found that increasing the number of GPUs used in training improved the overall throughput measured in teraflops per GPU. In experiments on large NLP models, the team observed an average throughput of 37 teraflops per V100 GPU "for model sizes ranging from 2 billion to 13 billion parameters." For a given model size and cluster, the team noted that ZeRO-2 trained the models up to 10x faster than the baseline Megatron approach. Using 1,024 V100 GPUs, the team trained a BERT model in 44 minutes, beating NVIDIA's previous record of 47 minutes using 1,472 V100 GPUs.
DeepSpeed team member Jeff Rasley joined a discussion on Hacker News, answering questions from the community. Rasley noted that DeepSpeed has "hundreds of internal users" at Microsoft, who have used it to train models that are live in production. When asked about support for TPUs, Rasley replied:
The ZeRO technology is compatible with TPU or any accelerator in a cluster setting, but we have not tested it with the TPUs. It likely would require some small refactoring to get DeepSpeed to work with TPUs. We do not have any internal plans to support them yet, but of course completely open to contribution from the community.
The DeepSpeed library is available on GitHub.