Microsoft's Azure Machine Learning team recently open-sourced their contribution to the ONNX Runtime library for improving the performance of the natural language processing (NLP) model BERT. With the optimizations, the model's inference latency on the SQUAD benchmark sped up 17x.
Senior program manager Emma Ning gave an overview of the results in a blog post. In collaboration with engineers from Bing, the Azure researchers developed a condensed BERT model for understanding web-search queries. To improve the model's response time, the team re-implemented the model in C++. Microsoft is now open-sourcing those optimizations by contributing them to ONNX Runtime, an open-source library for accelerating neural-network inference operations. According to Ning,
With ONNX Runtime, AI developers can now easily productionize large transformer models with high performance across both CPU and GPU hardware, using the same technology Microsoft uses to serve their customers.
BERT is a NLP model developed by Google AI, and Google announced last year that the model was being used by their search engine to help process about 1-in-10 search queries. BERT is useful for handling longer queries, or queries where short words (like "for" and "to", which are often ignored in standard search engines) are especially relevant to the meaning of the query. Bing also began using deep-learning NLP models in their search engine last year. However, BERT is a complex model, which means that processing a search query through it (aka inference) is computationally expensive and relatively slow. Bing found that even a condensed three-layer version required twenty CPU cores to achieve 77ms latency, which is already close to the limit for users to notice a delay. To handle the volume of queries at Bing's scale, using this model would require thousands of servers.
BERT inference does benefit from the parallelism of GPUs, and Bing found that the inference latency on Azure GPU VMs dropped to 20ms. For further improvements, the team partnered with NVIDIA to re-reimplement the model using TensorRT C++ APIs and CUDA libraries. This optimized model achieved 9ms latency. By using mixed precision and Tensor Cores, the latency improved to 6ms.
Deep-learning inference performance is a major concern, for web searches as well as mobile and edge devices, but re-implementing models by hand is not an attractive solution for most pracitioners seeking to improve performance. Inference acceleration tools, such as TensorFlow Lite and PyTorch Mobile, are now standard components of deep-learning frameworks. These tools improve performance by automatically re-writing the model code to take advantage of device-specific hardware acceleration and optimized libraries. This process is very similar to that used by an optimizing compiler for a high-level programming language, and similarly requires an abstract representation of the model being optimized. ONNX is an open standard for such a representation, and ONNX Runtime is an implementation of the standard.
Taking the lessons learned from re-implementing BERT, the Bing and Azure devs updated the ONNX Runtime code to automatically optimize any BERT model for inference on CPU as well as GPU. When used on the three-layer BERT model, CPU performance improved 17x and GPU performance improved 3x. Bing developers also found the ONNX Runtime was easier to use and reduced their time to optimize new models.
BERT model optimizations are available in the latest release of ONNX Runtime on GitHub.