Google Cloud announced that Ops Agent, the agent for collecting telemetry from Compute Engine instances, can now collect and aggregate metrics from NVIDIA GPUs on VMs.
The utilization of AI and ML technologies within organizations has grown, particularly in domains like product recommendations, scientific computing, and gaming. To meet the demanding computational requirements of these applications, GPUs must be used. Effective usage and optimization of AI and ML development processes necessitate a comprehensive understanding of GPU performance metrics. Addressing these needs, Google Cloud expanded the capabilities of its Ops Agent with the ability to collect metrics from NVIDIA GPUs.
Ops Agent empowers users to:
- Visualize GPU Fleet Health: Gain insights into GPU fleet health through GPU metrics and pre-built dashboards.
- Optimize Costs and Workloads: Identify underutilized GPUs and optimize workload distribution to streamline costs and maximize efficiency.
- Plan Scaling Efficiently: Analyze trends and patterns to make informed decisions on GPU capacity expansion or upgrading existing GPUs.
- Identify Workload Consumption: Pinpoint which GPU processes, notably ML models, consume GPU utilization and memory.
- Utilize DCGM Profiling Metrics: Leverage DCGM profiling metrics to detect bottlenecks and performance issues within the GPU.
The NVIDIA Management Library (NVML) underpins Ops Agent, enabling effortless collection of essential GPU metrics without additional configurations. These metrics encompass GPU utilization, GPU memory usage, process maximum GPU memory usage, and process lifetime GPU utilization.
In addition, Ops Agent facilitates the collection of advanced GPU metrics utilizing NVIDIA’s Data Center GPU Manager (DCGM). DCGM offers an API for profiling-level metrics of diverse hardware components, providing deeper insights into GPU performance.
Ops Agent simplifies GPU metric visualization alongside other offerings in Google Cloud's operations suite. Users can effortlessly query and visualize the collected GPU metrics, construct custom charts, and create dashboards. A dedicated NVIDIA GPU Monitoring dashboard offers a consolidated view of GPU fleet health.
Ops Agent stands out as a unified telemetry agent, automating the collection of host metrics, system logs, and other metrics. It simplifies the management of telemetry processes, allowing users to focus on maximizing the potential of their GPU VMs.
Google Cloud has introduced a one-click option to add an Ops Agent while creating a new VM via the Google Cloud console for easy adoption. This lets users experience Ops Agent with default configurations, facilitating a seamless monitoring experience.
For comprehensive instructions on how to install and configure Ops Agent for GPU instance monitoring, refer to the provided documentation.
Also other cloud vendors have monitoring solutions for GPUs in particular CloudWatch, the monitoring tool of AWS, supports GPU monitoring and Azure supports GPU monitoring with container insights.