Recently Microsoft announced the general availability of the Azure ND A100 v4 Cloud GPU instances—powered by NVIDIA A100 Tensor Core GPUs. These Virtual Machines (VMs) are targeted at customers with high performance and demanding workloads like Artificial Intelligence (AI) and Machine Learning (ML) workloads.
The public cloud vendor released the Azure ND A100 v4 Cloud GPU in public preview as a High-Performance Computing (HPC) enabled virtual machine for AI workloads. The goal is to provide a large amount of computing power to compete with other large AI supercomputers in the industry regarding raw scale and advanced technology – and these ND A100 v4 VM series are now GA.
Other public cloud providers such as AWS and Google Cloud also offer a wide selection of instance types, varying combinations of storage, CPU, memory, and networking capacity, allowing customers to scale resources to the requirements of their target workload. For instance, Google Cloud introduced the Accelerator-Optimized VM (A2) family also based on the NVIDIA Ampere A100 Tensor Core GPU earlier in March.
According to the Azure Compute blog post by Ian Finder, senior program manager, Accelerated HPC Infrastructure benchmarking with 164 ND A100 v4 virtual machines on a pre-release public supercomputing cluster yielded a High-Performance Linpack (HPL) result of 16.59 petaflops - a result delivered on public cloud infrastructure would fall within the Top 20 of the November 2020 Top 500 list of the fastest supercomputers globally, or top 10 in Europe, based on the region where the job was run.
Finder also stated in an Azure Compute blog post:
Built to take advantage of de-facto industry standard HPC and AI tools and libraries, customers can leverage ND A100 v4’s GPUs and unique interconnect capabilities without any special software or frameworks, using the same NVIDIA NCCL2 libraries that most scalable GPU-accelerated AI and HPC workloads support out-of-box, without any concern for underlying network topology or placement. Provisioning VMs within the same VM Scale Set automatically configures the interconnect fabric.
And in addition, Ian Buck, general manager and vice president of Accelerated Computing at NVIDIA, wrote in an NVIDIA blog post:
NVIDIA collaborated with Azure to architect this new scale-up and scale-out AI platform, which brings together groundbreaking NVIDIA Ampere architecture GPUs, NVIDIA networking technology, and the power of Azure’s high-performance interconnect and virtual machine fabric to make AI supercomputing accessible to everyone.
ND A100 v4 VM series starts with a single virtual machine (VM) and eight NVIDIA Ampere architecture-based A100 Tensor Core GPUs. However, it can scale up to thousands of GPUs in a single cluster with 1.6 Tb/s of interconnect bandwidth per VM delivered via NVIDIA HDR 200Gb/s InfiniBand links: one for each GPU. The pricing starts at 27.20 an hour – more details are available on the pricing page.
Furthermore, ND A100 v4 is also available with Azure Machine Learning (AML) service for interactive AI development, distributed training, batch inferencing, and automation with ML Ops. And the company also intends to allow customers to use Azure Kubernetes Service, a fully-managed Kubernetes service, to deploy and manage containerized applications on the ND A100 v4 VMs, with NVIDIA A100 GPUs.
The ND A100 v4 VMs are currently available in the Azure Regions of East United States, West United States 2, West Europe, and South Central United States.