Hugging Face has announced the launch of an inference-as-a-service capability powered by NVIDIA NIM. This new service will provide developers easy access to NVIDIA-accelerated inference for popular AI models.
The new service allows developers to rapidly deploy leading large language models such as the Llama 3 family and Mistral AI models with optimization from NVIDIA NIM microservices running on NVIDIA DGX Cloud. This will help developers quickly prototype with open-source AI models hosted on the Hugging Face Hub and deploy them in production.
The Hugging Face inference-as-a-service on NVIDIA DGX Cloud powered by NIM microservices offers easy access to compute resources that are optimized for AI deployment. The NVIDIA DGX Cloud platform is purpose-built for generative AI and provides scalable GPU resources that support every step of AI development, from prototype to production.
To use the service, users must have access to an Enterprise Hub organization and a fine-grained token for authentication. The NVIDIA NIM Endpoints for supported Generative AI models can be found on the model page of the Hugging Face Hub.
Currently, the service only supports the chat.completions.create
and models.list
APIs, but Hugging Face is working on extending this while adding more models. Usage of Hugging Face Inference-as-a-Service on DGX Cloud is billed based on the compute time spent per request, using NVIDIA H100 Tensor Core GPUs.
Hugging Face is also working with NVIDIA to integrate the NVIDIA TensorRT-LLM library into Hugging Face's Text Generation Inference (TGI) framework to improve AI inference performance and accessibility. In addition to the new Inference-as-a-Service, Hugging Face also offers Train on DGX Cloud, an AI training service.
Clem Delangue, CEO at Hugging Face, posted on his X account:
Very excited to see that Hugging Face is becoming the gateway for AI compute!
And Kaggle Master Rohan Paul shared a post on X saying:
So, we can use open models with the accelerated compute platform of NVIDIA DGX Cloud for inference serving. Code is fully compatible with OpenAI API, allowing you to use the openai’s sdk for inference.
At SIGGRAPH, NVIDIA also introduced generative AI models and NIM microservices for the OpenUSD framework to accelerate developers’ abilities to build highly accurate virtual worlds for the next evolution of AI.