BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Amazon Releases DL1 Instances Powered by Gaudi Accelerators

Amazon Releases DL1 Instances Powered by Gaudi Accelerators

This item in japanese

Amazon recently announced the general availability of the EC2 DL1 instances powered by Gaudi accelerators from Habana Labs. The new instances promise better price performances in training deep learning models for use cases such as computer vision, natural language processing, autonomous vehicle perception and recommendation engines.

The DL1 instances are available only in the DL1.24xlarge size and have 8 Gaudi accelerators with 32 GB of high bandwidth memory (HBM) per accelerator, Intel Xeon Scalable processors, 768 GB of memory, 400 Gbps of networking throughput, and 4 TB of local storage.

Jeff Barr, vice president and chief evangelist at AWS, explains the benefits of the new instances:

There are more applications today for deep learning than ever before. Natural language processing, recommendation systems, image recognition, video recognition, and more can all benefit from high-quality, well-trained models. (...) The training process is math and processor intensive, and places demands on just about every part of the systems used for training including the GPU or other training accelerator, the network, and local or network storage.

The new instances include the Habana SynapseAI SDK, which is integrated with the TensorFlow and PyTorch machine learning frameworks. They were originally announced by Andy Jassy, then CEO at AWS, at re:Invent 2020. Dylan Martin, senior associate editor at CRN, comments:

This is clearly much later than Intel was originally planning. Last December, Intel said these instances would be available in the first half of 2021.

Source: https://aws.amazon.com/blogs/aws/new-ec2-instances-powered-by-gaudi-accelerators-for-training-deep-learning-models/

Intel-owned Habana has published an article to explain the "up to 40% better price performance" claim and how developers can evaluate the new instances:

As AWS has published the DL1 on-demand hourly pricing for DL1 alongside the p4d, p3dn and p3 GPU-based instances, there’s a simple way for end-users to assess the price-performance themselves. Take the latest TensorFlow Docker containers provided by both Nvidia on NGC and Habana (...) and run them on the respective instances to compare the training throughput vs. the hourly pricing.

Habana has released a TensorFlow User guide, a PyTorch User Guide and a Gaudi Model Migration Guide to support running models on Gaudi and the migration to the new instances. The HabanaAI Repo contains setup instructions, reference models and academic papers.

Even if the DL1 are the first instances using Habana Gaudi accelerators, Amazon is not the only cloud provider offering instances for machine learning workloads: Google Cloud released recently the fourth-generation tensor processing units and Azure has NCas_T4_v3 virtual machines, powered by Nvidia Tesla T4, a recent addition to the Azure GPU family designed for the AI and machine learning workloads.

The new Amazon EC2 DL1 instances are currently available only in the US East (N. Virginia) and US West (Oregon) regions and cost $13.10 USD per hour on-demand.

 

Rate this Article

Adoption
Style

BT