Google's latest TPU can both handle training and serving now, a departure from the first-generation TPU that could only serve machine learning model computation. The first-generation TPU white-paper was detailed by InfoQ earlier this year.
The timing of the second-generation TPU announcement coincides closely with NVIDIA's announcement a week earlier about the Volta, a generic GPU with a heavily tested tensor-core feature optimized for TensorFlow. Google's announcement doesn't have a public white-paper associated with it yet like the first-generation TPU does. The first TPU was announced at a high level some months before the detailing covered in their white-paper. It may be reasonable to speculate a coming white-paper detailng the second-generation TPU (TPU-2) benchmarks. This would ideally include test permutations of both TPU and competitor chipset configurations, their performance bounds, and the machine learning workload types run on them. That would provide a comparable level of detail on TPU-2 that was published for the first-generation TPU.
Google provided some high-level performance metrics based on, presumably, TPU physical infrastructure configurations being used in their TPU-as-a-service offering via GCP's compute engine. A select group of researchers and scientists will have access to a cluster of 1,000 Cloud TPUs provided for free. The free TPU infrastructure, and the GCP offering for everyone else may have a significant degree of abstraction from what hardware researchers or news inquiries could glean insight from in the absence of a white-paper. On performance improvement Google noted
... our new large-scale translation model takes a full day to train on 32 of the world's best commercially available GPU's—while one 1/8th of a TPU pod can do the job in an afternoon...
TPU-2 pods contain TPU-2 boards comprised of multiple TPU-2 processors. Based on sparse technical information in the announcement by Google, along with several photos it's speculated there's potential connectivity for flash storage on each chip, and that there might be shared flash state between individual TPU-2s.
The second-generation TPU infrastructure provides up to 256 chips that can be joined together for 11.5 petaflops of machine learning power. Google is accepting applications to their Alpha release, however the application form is the same form directed to researchers for the free tier. It's unclear at this time whether or not the next-gen TPU's will make their way into services like CloudML used for executing model training on GPUs. However, the offering isn't limited to TPUs. The GCP features will:
allow users to start building their models on competing chips like Intel's Skylake) or GPUs like Nvidia's Volta and then move the project to Google's TPU cloud for final processing.
It's difficult at this time to adequately compare the performance improvements in TPU-2 over the first-generation TPU because of their different feature sets and underlying mathematical operation primitives. First-generation TPUs don't use floating point operations and instead use 8-bit integer approximations to floating point. It's not known at this time whether or not Google will provide methods of approximation for converting flops performance to 8-bit-integer- to–flops estimation for a quantitative analysis.
Google's newest large-scale translation model takes an entire day to train on 32 of the "best commercially available GPUs" (we can assume Pascal)—while one 1/8 of a TPU pod can do the job in an afternoon... each board is capable of a maximum peak throughput of 45 teraflops with the system board having an aggregate of 180 teraflops as we have said above [and] 11.5 petaflops of peak performance.
Access to flash storage, coupled with the ability to run training and serving on the same hardware could impact Google's competition with other chipset manufacturers, since AMD's Vega Radeon Instinct GPU accelerators have direct access to flash storage, and can also conduct both ML training and serving.