Intel, Stanford and the National Energy Research Scientific Computing Center (NERSC) recently announced the first super computing cluster achieving 15 Petaflops of computing calculations power. The work, published in the paper titled 'Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data' describes using a cluster of 9,622 Intel Xeon Phi processors at 1.4Ghz to achieve an average sustained performance of 11.41 to 13.47 Petaflops using training data from physics and weather data sets with Deep Learning. The 15.07 Petaflops is reported as peak performance using single precision. The experiments used NERSC's Cori Phase-II supercomputing cluster that features 9,668 nodes with 68 cores per node supporting 4 hardware threads each (272 cores per node) for a total of 2,629,696 threads for the whole cluster.
The implementation used a combination of Intel Caffe, Intel Math Kernel Library (Intel MKL), and Intel Machine Learning Scaling Library (Intel MLSL).
The most important characteristic of this implementation is the scaling getting to a 75% scaling factor, a 7,205 times speedup on a 9,600 nodes cluster. Perfect, 100% or linear scaling would be achieving a 9,600 times speedup.
This was being done in part due to the work by Christopher Ré's group, from Department of Computer Science at Stanford University. The group's work allows for both asynchronous and synchronous updates of Artificial Neural Network (ANN) parameters.
Synchronisation barriers are a big obstacle in the parallelisation of algorithms in general including Machine Learning algorithms. When we have multiple nodes synchronously computing a task, any transient slowdown of one node will slow down and block all nodes in their computations. This is called the straggler effect in distributed systems. Synchronised systems also suffer from decreased performance as batch computing size decreases. In massively parallelised clusters like the one described here, this can be a serious problem. An example of it is the DeepBench benchmarking framework from Baidu that shows a performance of 25-30% of peak flops as the batch size decreases. The overall reduction efficiency has a complexity of O(log(M)) where M=number of nodes.
Asynchronous Deep Learning systems on the other hand can require more iterations (and thus computations) to converge to the solution. This is due to a worse statistical efficiency described as staleness. There is also the risk of completely failing to converge to a solution, to which Ioannis Mitliagkas, a member of Christopher Ré's group, notes that if the objectives are right, the system will not converge if it is mistuned.
The drawbacks and efficiencies behind each model led the researchers to introduce a hybrid way of processing. Nodes form smaller compute groups where they work synchronously with the goal to provide a single update to the model. Different compute groups interact asynchronously with a centralised Parameter Server, getting the best of both synchronous and asynchronous worlds.
The hybrid approach reduced straggler effect and provided a speed increase of 1.66x to 10x (best/worst) as compared to synchronous configurations. In addition, the system exhibits characteristics of strong scaling, scaling to 1024 nodes whereas the synchronous approach stops scaling at 512 nodes. Strong scaling refers to keeping the total problem size constant and increasing the number of processors and is considered by Mitliagkas to be a common use case for Machine Learning problems.
Moreover, the algorithm was used to solve real scientific problems. One application was learning how to separate the rare signals of new particles from background events with the aim to understand the fundamental nature of our universe. The other application was identifying features in climate data enabling researchers to characterize changes in frequency and intensity of extreme weather under climate change.