BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Google's Apollo AI for Chip Design Improves Deep Learning Performance by 25%

Google's Apollo AI for Chip Design Improves Deep Learning Performance by 25%

This item in japanese

Scientists at Google Research have announced APOLLO, a framework for optimizing AI accelerator chip designs. APOLLO uses evolutionary algorithms to select chip parameters that minimize deep-learning inference latency while also minimizing chip area. Using APOLLO, researchers found designs that achieved 24.6% speedup over those chosen by a baseline algorithm.

Research Scientist Amir Yazdanbakhsh gave a high level overview of the system in a recent blog post. APOLLO searches for a set of hardware parameters, such memory size, I/O bandwidth, and processor units, that provides the best inference performance for a given deep-learning model. By using evolutionary algorithms and transfer learning, APOLLO can efficiently explore the space of parameters, reducing the overall time and cost of producing the design. According to Yazdanbakhsh,

We believe that this research is an exciting path forward to further explore ML-driven techniques for architecture design and co-optimization (e.g., compiler, mapping, and scheduling) across the computing stack to invent efficient accelerators with new capabilities for the next generation of applications.

Deep-learning models have been developed for a wide variety of problems, from computer vision (CV) to natural language processing (NLP). However, these models often require large amounts of compute and memory resources at inference time, straining the hardware constraints of edge and mobile devices. Custom accelerator hardware, such as Edge TPUs, can improve model inference latency, but often require modifications to the model, such as parameter quantization or model pruning. Some researchers, including a team at Google, have proposed using AutoML to design high-performance models targeted for specific accelerator hardware.

The APOLLO team's strategy, by contrast, is to customize the accelerator hardware to optimize performance for a given deep-learning model. The accelerator is based on a 2D array of processing elements (PEs), each of which contains a number of single instruction multiple data (SIMD) cores. This basic pattern can be customized by choosing values for several different parameters, including the size of the PE array, the number of cores per PE, and the amount of memory per core. Overall, there are nearly 500M parameter combinations in the design space. Because a proposed accelerator design must be simulated in software, evaluating its performance on a deep-learning model is time and compute intensive.

APOLLO builds on Google's internal Vizier "black box" optimization tool, and Vizier's optimization Bayesian method is used as a baseline comparison for evaluating APOLLO's performance. The APOLLO framework supports several optimization strategies, including random search, model-based optimization, evolutionary search, and an ensemble method called population-based black-box optimization (P3BO). The Google team performed several experiments, searching for optimal accelerator parameters for a set of CV models, including MobileNetV2 and MobileNetEdge, for three different chip-area constraints. They found that the P3BO algorithm produced the best designs and its performance improved compared to Vizier as available chip area decreased. Compared to a manually-guided exhaustive or "brute-force" search, P3BO found a better configuration while performing 36% fewer search evaluations.

The design of accelerator hardware for improving AI inference is an active research area. Apple's new M1 processor includes a neural engine designed to speed up AI computations. Stanford researchers recently published an article in Nature describing a system called Illusion that uses a network of smaller chips to emulate a single larger accelerator. At Google, scientists have also published work on optimizing chip floorplanning, to find the best placement of integrated-circuit components on the physical chip.

Rate this Article

Adoption
Style

BT