Researchers at North Carolina State University recently presented a paper at the 35th IEEE International Conference on Data Engineering (ICDE 2019) on their new technique that can reduce training time for deep neural networks by up to 69%.
Dr. Xipeng Shen, along with Ph.D. students Lin Ning and Hui Guan, developed a technique called Adaptive Deep Reuse (ADR) that takes advantage of similarities in the data values that are input into a neural network layer. One fundamental operation in training a neural network is multiplying a vector of input data with a matrix of weights; it is this multiplication that consumes the bulk of the processing power during training.
The key insight of ADR is that instead of re-calculating the vector-matrix product for every unique input vector, the training process can apply clustering to the input and reuse a single approximate product result for similar input vectors. This does reduce the accuracy of the computations; however, the early stages of training are "less sensitive to approximation errors than in later stages." In later stages, the process adapts the clustering parameters to reduce the approximation errors. Furthermore, the clustering and approximation can be applied not only to the input of the neural network, but also to the activation maps in the hidden layers of the network.
The team evaluated ADR by using it during the training of three popular convolutional neural network (CNN) architectures for image classification: CifarNet, AlexNet, and VGG-19. ADR reduced training time for the networks by 63% for CifarNet, 68% for VGG-19, and 69% for AlexNet respectively, all with no loss in accuracy.
InfoQ spoke with Dr. Shen about his work.
InfoQ: What was the inspiration or "aha moment" of this new technique? Were there some other approaches that you tried before you developed adaptive deep reuse?
Xipeng Shen: This idea came from a long-term vision we had. In the world, there are lots of computations going at any moment that are not necessary; that's the general vision. But by focusing on deep-neural-networks, we have been thinking about whether we could find these redundant computations and avoid them. We tried a number of ways before we got to this point. We tried to look at the filter weights, which can be sparse; that is, many of the weights are zero. We explored that, and tried to avoid multiplying by zero, but it was difficult. Then we realized that the activation maps or inputs could be another angle. Our insight was starting from the inputs: lots of images have lots of pixels that are similar or identical. A graduate student investigated real-world datasets and indeed did find that was true. Then we wanted to know if this was also true in the middle layers, and it was. With that, the only missing piece of the puzzle was how to efficiently find the reuse opportunities and turn them into actual time savings. Adaptive deep reuse was born after we figured that out.
InfoQ: You implemented your work on TensorFlow; why TensorFlow instead of some other framework? Could adaptive deep reuse be applied in other frameworks?
Shen: TensorFlow was chosen because we were more familiar with it. It is also a very popular framework. The technique itself is very fundamental and tries to accelerate low-level operations of matrix-vector multiplication, and this will work across platforms.
InfoQ: This technique can be applied at inference time as well as training; do you see it as a valuable enhancement for inference on compute-constrained environments such as mobile or IoT?
Shen: We already have an upcoming paper that will be presented at the International Conference on Supercomputing (ICS 2019) in June, which is about using the technique during inference. In that paper we showed that deep reuse can speed up inferences by a factor of two. We did that on both server and embedded devices, so the technique has already proved it has the potential to be used in embedded systems such as IoT.
InfoQ: Would this technique work for other deep neural network architectures, for example RNNs or transformers/attention networks?
Shen: Yes, it would. We discussed using it on an LSTM, but have not done that experiment to quantitatively determine the benefits.
InfoQ: Can you talk about any related future work?
Shen: There are a number of possible directions. We explored the activation map and input aspect, but there could potentially be other reuse opportunities still hiding in deep neural networks.
There are also other types of networks besides CNNs: we need to quantify the benefits in other network types such as LSTM. Other researchers have asked about the potential of applying this idea to their systems. The patterns could be different, so there could be new challenges and new opportunities for innovation. But that is our goal: to generalize this technique for other networks.
We also could apply this technique to distributed training, and there are interesting challenges: for example, reusing data across nodes for model parallel.
The paper, “Adaptive Deep Reuse: Accelerating CNN Training on the Fly,” can be read at the ICDE 2019 website.