BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News PrefixRL: Nvidia's Deep-Reinforcement-Learning Approach to Design Better Circuits

PrefixRL: Nvidia's Deep-Reinforcement-Learning Approach to Design Better Circuits

Nvidia has developed PrefixRL, an approach based on reinforcement learning (RL) to designing parallel-prefix circuits that are smaller and faster than those designed by state-of-the-art electronic-design-automation (EDA) tools.

Various important circuits in the GPU such as adders, incrementors, and encoders are called parallel-prefix circuits. These circuits are fundamental to high-performance digital design and can be defined at a higher level as prefix graphs. PrefixRL is focused on this class of arithmetic circuits and the main goal of this approach is to understand if an AI agent could design a good prefix graph, considering that the state-space of the problem is O(2^n^n) and cannot be resolved using brute-force methods.

The desirable circuit should be small, fast and consume less power. Nvidia found that power consumption is well-correlated with area for the circuits of interest but circuit area and delay are often competing properties. The goal of PrefixRL is to find the Pareto frontier of designs that effectively trades off of the area and the delay: fitting more circuits in a lower area, decreasing the delay of the chip in order to improve the performances and consuming less power. 

Hopper GPU, the latest Nvidia architecture, has nearly 13,000 circuits designed by AI.

A fully convolutional neural network (Q-learning agent) is used to train the PrefixRL agent. At input and output of the Q network there is a grid representation for prefix graphs, where each element in the grid uniquely maps to a prefix node. Each element in the input grid represents whether a node is present or absent. On the output side each element represents the Q values for adding or removing a node. PrefixRL agent predicts the value of area and delay separately because these properties are observed separately during training.

Representations of prefix graphs (left) and fully convolutional Q-learning agent architecture (right).

The RL agent can add or remove a node from the prefix graph and, at each episode step of the reinforcement learning task, the agent receives the improvement in the corresponding circuit area and delay as rewards. In the other steps, the design process: legalizes the prefix graph to always maintain a correct prefix sum computation then generates a circuit from the legalized prefix graph. Finally a physical-synthesis tool optimizes the circuit and the last design-process step is to measure the area and delay properties of the circuit.

The best tradeoff between area and delay, the Pareto frontier of designs, is obtained by training a lot of agents with various weights (from 0 to 1). So, the physical-synthesis optimizations in the RL environment can generate various solutions. This synthesis process is slow (~35 seconds for 64-bit adders) and  computationally demanding: physical simulation required 256 CPUs for each GPU and training the 64-bit case took over 32,000 GPU hours.

For this kind of RL tasks, Nvidia developed Raptor, an in-house distributed reinforcement-learning platform that takes special advantage of Nvidia hardware. The core features that enhance scalability and training speed for this kind of RL tasks are: Job scheduling, GPU-aware data structures and custom networking. In order to improve the networking performances, Raptor has the ability to switch between NCCL (for point-to-point transfer to transfer model parameters directly from the learner GPU to an inference GPU), Redis (for asynchronous and smaller messages such as rewards or statistics) and a JIT-compiled RPC (to handle high volume and low-latency requests such as uploading experience data). 

NVIDIA framework powers concurrent training and data collection.

Raptor increases the training speed allowing the actor agents to not wait to step through the environment thanks to a pool of CPU workers that perform the physical synthesis in parallel. To avoid redundant computations on the same state, when CPU workers return the rewards, the transition is inserted into the replay buffer and the rewards are cached.

The adders designed by the RL agents have a 25% lower area than EDA tools at the same delay and have irregular structures. This achievement is reached by the RL agents learning to design circuits from scratch with feedback from synthesized circuit properties.

About the Author

Rate this Article

Adoption
Style

BT