Researchers from Amazon Web Services and Rice University have introduced Gemini, a new distributed training system to redefine failure recovery in large-scale deep learning models. According to the research paper, Gemini adopts a daring strategy by utilizing CPU memory to ensure previously unheard-of speeds in failure recovery, overcoming obstacles related to high recovery costs and constrained checkpoint storage capacity.
The increasing prevalence of larger models, exemplified by PaLM with a staggering 540 billion parameters, has led to a surge in training failures, as seen in the OPT-175B training scenario. Gemini steps up to these challenges by introducing a distributed training system that capitalizes on the high bandwidth of CPU memory, promising swift failure recovery in large model training.
One of Gemini's notable achievements is the use of CPU memory checkpointing to recover from faults efficiently. To lessen interference, it also offers a scheduling technique for communications and a near-optimal checkpoint placement plan.
The system architecture of Gemini, features a Checkpoint Creation Module and a Failure Recovery Module, is meticulously designed for optimal failure recovery overhead in distributed training, showcasing its commitment to addressing challenges head-on. Though frameworks such as PyTorch and TensorFlow provide interfaces, Gemini adopts a novel strategy in contrast to earlier solutions limited by bandwidth and storage, which adversely affected checkpointing frequency and model fidelity.
Gemini also handles checkpointing to CPU memory, boasting higher frequencies compared to solutions like DeepFreeze and CheckFreq. The system employs redundant checkpoints for failure recovery, distinguishing itself from approaches like Diskless checkpointing and FTCCharm++.
Experimental results underscore Gemini's performance, demonstrating a remarkable 13× faster failure recovery compared to existing solutions, with plans to extend its application to various parallelism strategies and accelerators in the future.
Built on DeepSpeed and leveraging the ZeRO-3 setting for distributed training, Gemini utilizes Amazon EC2 Auto Scaling Groups to manage GPU model states, marking a significant leap forward in the world of deep learning.