A recent report published by Google's TPU group highlights ten takeaways from developing three generations of tensor processing units starting from TPUv1 (2016) to the latest inference unit TPUv4i (2020). The authors also discuss how their previous experience will affect the development of upcoming TPU versions. The paper extends the previous work on TPUs and brings additional insight into the future of deep learning (DL) specific architecture development.
Tensor Processing Units are Google's custom-made ASICs designed for deep neural network inference (TPUv1 and TPUv4i) and training purposes (TPUv2 to v4). TPU's design decisions and optimizations (for related tradeoffs) are aimed at the purpose of parallelizing common (specific) deep network functions and activations. Therefore, TPUs offer a cost-effective alternative for production workloads where the development of the deep network solution is relatively stabilized.
Considering the pace at deep networks has been improved, the development of TPUs has also required several iterations over the last decade. The key lessons stated in the progress report can be summarized as follows:
- Hardware components constituting processing units evolve at different rates (e.g. logic, wires, SRAM, DRAM).
- The success and popularity of architecture heavily depend on the related compiler optimizations (XLA for TPUs), but designers should be aware that compilers are optimized iteratively as per the architectures themselves.
- When developing domain-specific architectures, it is better to optimize total cost ownership (TCO) for each performance improvement.
- For future DL systems, support for backward compatibility in deep learning workloads is an important factor and should be taken into account.
- Deployment of inference units may require air cooling as they are globally distributed to reduce request latencies. Providing necessary liquid cooling may not be feasible at each location.
- Integer-only (quantized) inference may not be optimal for some applications, therefore DL architectures should support floating-point operations as well.
- Inference workloads should allow multitenancy. As inference does not require gradient accumulation and has a relatively lower resource footprint, sharing may reduce the cost and overall latency.
- Future DL architectures should take the size of deep learning models into account (i.e. deep networks grow ~1.5x per year in memory and compute requirements).
- Future DL architectures should take the increasing diversity of deep network functions and activations into account (i.e. deep network architectures evolve rapidly).
- The service level objective (SLO) limit should depend on API p99 latency, not the inference batch size, hence future architectures should utilize larger batches for their advantage.
Details of TPUv2 and TPUv3 development can be obtained in the following paper. Last summer, Google announced v4 pods for training, more information on TPUv4 can be reached in the official GCP blog post. In addition to TensorFlow and JAX-based Flax, PyTorch API has also started to support the XLA compiler to run models on TPUs; the details can be found in the related documentation. On this page, some pretrained deep networks supported natively by TPUs can be viewed.