Skip to main content

5 posts tagged with "GPU"

Configuration and optimization of GPU-accelerated workloads on AKS.

View All Tags

Optimizing RDMA performance for AI workloads on AKS with DRANET

· 10 min read
Anson Qian
Software Engineer at Azure Kubernetes Service
Michael Zappa
Software Engineer for Azure Container Networking

RDMA (Remote Direct Memory Access) is critical for unlocking the full potential of GPU infrastructure, enabling the high-throughput, low-latency GPU-to-GPU communication that large-scale AI workloads demand. In distributed training, collective operations like all-reduce and all-gather synchronize gradients and activations across GPUs — any communication bottleneck stalls the entire training pipeline. In disaggregated inference, RDMA provides the fast inter-node transfers needed to move KV-cache data between prefill and decode phases running on separate GPU pools.

rdma-for-ai-workload-on-gpu-infra

DRANET is an open-source Dynamic Resource Allocation (DRA) network driver that discovers RDMA-capable devices, advertises them as ResourceSlices, and injects the allocated devices into each pod and container. Combined with the NVIDIA GPU DRA driver, it enables topology-aware co-scheduling of GPUs and NICs for high-performance AI networking on Kubernetes.

Dynamic Resource Allocation (DRA) with NVIDIA virtualized GPU (vGPU) on AKS

· 6 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Suraj Deshmukh
Software Engineer at Microsoft

Recently, dynamic resource allocation (DRA) has emerged as the standard mechanism to consume GPU resources in Kubernetes. With DRA, accelerators like GPUs are no longer exposed as static extended resources (for example, nvidia.com/gpu) but are dynamically allocated through DeviceClasses and ResourceClaims. This unlocks richer scheduling semantics and better integration with virtualization technologies like NVIDIA vGPU.

Virtual accelerators such as NVIDIA vGPU are commonly used for smaller workloads because they allow a single physical GPU to be securely partitioned across multiple tenants or apps. This is especially valuable for enterprise AI/ML development environments, fine-tuning, and audio/visual processing. vGPU enables predictable performance profiles while still exposing CUDA capabilities to containerized workloads.

On Azure, the NVadsA10_v5 virtual machine (VM) series is backed by the physical NVIDIA A10 GPU in the host and offers this resource model. Instead of assigning the entire GPU to a single VM, the vGPU technology is used to partition the GPU into multiple fixed-size slices at the hypervisor layer.

In this post, we’ll walk through enabling the NVIDIA DRA driver on a node pool backed by an NVadsA10_v5 series vGPU on Azure Kubernetes Service (AKS).

DRA with fractional A10 vGPU node on AKS

Running more with less: Multi-instance GPU (MIG) with Dynamic Resource Allocation (DRA) on AKS

· 8 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Product Manager at Microsoft

GPUs power a wide range of production Kubernetes workloads across industries. For example, media platforms rely on them for video encoding/transcoding, financial services firms run quantitative risk simulations, and research groups process and visualize large datasets. In each of these scenarios, GPUs significantly improve job throughput, yet individual workloads often consume only a portion of the available device.

By default, Kubernetes schedules GPUs as entire units; when a workload requires only a fraction of a GPU, the remaining capacity can remain unused. Over time, this leads to lower hardware utilization and higher infrastructure costs within a cluster.

Multi-instance GPU (MIG) combined with dynamic resource allocation (DRA) helps address this challenge. MIG partitions a physical GPU into isolated instances with dedicated compute and memory resources, while DRA enables those instances to be provisioned and bound dynamically through Kubernetes resource claims. Rather than treating a GPU as an indivisible resource, the cluster can allocate right-sized GPU partitions to multiple workloads at the same time!

Fully Managed GPU workloads with Azure Linux on Azure Kubernetes Service (AKS)

· 7 min read
Flora Taagen
Product Manager 2 at Microsoft
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Introduction

Running GPU workloads on AKS enables scalable, automated data processing and AI applications across Windows, Ubuntu, or Azure Linux nodes. Azure Linux, Microsoft’s minimal and secure OS, simplifies GPU setup with validated drivers and seamless integration, reducing operational efforts. This blog covers how AKS supports GPU nodes on various OS platforms and highlights the security and performance benefits of Azure Linux for GPU workloads.

Scaling multi-node LLM inference with NVIDIA Dynamo and ND GB200 NVL72 GPUs on AKS

· 11 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Sertaç Özercan
Principal Engineering Manager for Azure Upstream
Rita Zhang
Partner Software Engineering at Microsoft

This blog post is co-authored with Rohan Varma, Saurabh Aggarwal, Anish Maddipoti, and Amr Elmeleegy from NVIDIA to showcase solutions that help customers run AI inference at scale using Azure Kubernetes Service (AKS) and NVIDIA’s advanced hardware and distributed inference frameworks.

Modern language models now routinely exceed the compute and memory capacity of a single GPU or even a whole node with multiple GPUs on Kubernetes. Consequently, inference at the scale of billions of model parameters demands multi-node, distributed deployment. Frameworks like the open-source NVIDIA Dynamo platform play a crucial role by coordinating execution across nodes, managing memory resources efficiently, and accelerating data transfers between GPUs to keep latency low.