Researchers from Sony announced that they trained a machine learning ResNet 50 architecture on ImageNet, an image database organized according to the WordNet hierarchy, in only 224 seconds. The resulting network has a top-1 accuracy of 75% on the validation set of ImageNet. The researchers achieved this record by using 2.100 Tesla V100 Tensor Core GPUs from NVIDIA. Besides this new timing record, they also got a 90% GPU scaling efficiency using 1.088 Tesla V100 Tensor Core GPUs.
There are two major challenges in scaling training neural networks to multiple machines: the batch size to train with, and synchronization of the gradients.
To reduce the time to train a neural network, one can use a small mini-batch size to quickly push the network weights "in the right direction". For updating weights the error gives a gradient that indicates the "direction" the weights need to be updated in. When using small mini-batches you can determine this direction very fast and often. However, small mini-batches make a neural network prone to find a local minimum of performance. Sony solved this problem using batch size control, a known technique that gets used more and more recently.
With batch-size control, the mini-batch size is increased gradually to make the loss landscape avoid local minima. By looking at more images the neural network has a better idea of the average error direction it needs to improve in, instead of only determining an error direction based on a few small samples. Sony treated the first five epochs as warm-up epochs, with a low batch size, while later epochs used a larger batch size. Sony also used mixed-precision training, where the forward/backward computations are conducted in FP16.
The second problem is that synchronizing the gradients among machines can be slow. For communication among the different GPUs, a lot of bandwidth is required. Sony's researchers went for a 2D-Torus all reduce algorithm to reduce the communication overhead. In this communication algorithm, GPU's are placed in a virtual grid. First, the gradients pass horizontally, then communicate vertically to all columns, and then horizontally again in a final pass. This means that, if X is the number of GPUs in the horizontal direction, 2(X-1) GPU-to-GPU operations are necessary.
The Sony researchers used the tool Neural Network Libraries (NLL) and its CUDA extension as a DNN training framework. For communication between GPUs they used the NVIDIA Collective Communications Library (NCCL) version 2.
Last year multiple parties attempted to train a ResNet 50 architecture in the least amount of time possible. In September 2017 InfoQ reported that IBM trained the same neural network architecture in 50 minutes. Back then, IBM achieved a higher scaling efficiency, but with only 256 GPUs. In the Q&A Hillery Hunter stated that the batch size was one of the most challenging things, but that they expected that their approach could scale to many more GPUs. In this paper, they looked at the GPU Scaling Efficiency for multiple amounts of GPUs, and when using 3264 GPUs for training the efficiency is 52.47%.
Sony's researchers published their findings in the paper 'ImageNet/ResNet-50 Training in 224 Seconds', which can be found on arXiv.