Skip to main content
Suraj Deshmukh
Software Engineer at Microsoft
View all authors

Dynamic Resource Allocation (DRA) with NVIDIA virtualized GPU (vGPU) on AKS

· 6 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Suraj Deshmukh
Software Engineer at Microsoft

Recently, dynamic resource allocation (DRA) has emerged as the standard mechanism to consume GPU resources in Kubernetes. With DRA, accelerators like GPUs are no longer exposed as static extended resources (for example, nvidia.com/gpu) but are dynamically allocated through DeviceClasses and ResourceClaims. This unlocks richer scheduling semantics and better integration with virtualization technologies like NVIDIA vGPU.

Virtual accelerators such as NVIDIA vGPU are commonly used for smaller workloads because they allow a single physical GPU to be securely partitioned across multiple tenants or apps. This is especially valuable for enterprise AI/ML development environments, fine-tuning, and audio/visual processing. vGPU enables predictable performance profiles while still exposing CUDA capabilities to containerized workloads.

On Azure, the NVadsA10_v5 virtual machine (VM) series is backed by the physical NVIDIA A10 GPU in the host and offers this resource model. Instead of assigning the entire GPU to a single VM, the vGPU technology is used to partition the GPU into multiple fixed-size slices at the hypervisor layer.

In this post, we’ll walk through enabling the NVIDIA DRA driver on a node pool backed by an NVadsA10_v5 series vGPU on Azure Kubernetes Service (AKS).

DRA with fractional A10 vGPU node on AKS

Simplifying InfiniBand on AKS

· 5 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Suraj Deshmukh
Software Engineer at Microsoft
Ernest Wong
Software Engineer at Microsoft

High performance computing (HPC) workloads, like large-scale distributed AI training and inferencing, often require fast, reliable data transfer and synchronization across the underlying compute. Model training, for example, requires shared memory across GPUs because the parameters and gradients need to be constantly shared. For models with billions of parameters, the available memory in a single GPU node may not be enough, so "pooling" the memory across multiple nodes also requires high memory bandwidth due to the sheer volume of data involved. A common way to achieve this at scale is with a high-speed, low-latency network interconnect technology called InfiniBand (IB).