Skip to main content
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
View all authors

Scaling multi-node LLM inference with NVIDIA Dynamo and NVIDIA GPUs on AKS (Part 3)

· 8 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Devi Vasudevan
Principal Engineering Manager for Azure Upstream

This blog post is co-authored with Nikhar Maheshwari, Anish Maddipoti, Rohan Varma, Clement Pakkam Isaac, and Stephen Mccoulough from NVIDIA.

We kicked things off in Part 1 by introducing NVIDIA Dynamo on AKS and demonstrating 1.2 million tokens per second across 10 GPU nodes of GB200 NVL72. In Part 2, we explored the Dynamo Planner and Profiler for SLO-driven scaling.

In this blog, we explore how Dynamo’s KV Router makes multi-worker LLM deployments significantly more efficient, demonstrating over 20x faster Time To First Token (TTFT) and over 4x faster end-to-end latency on real-world production traces. These latency reductions not only improve the end-user experience but also maximize GPU utilization and lower the Total Cost of Ownership (TCO).

Dynamic Resource Allocation (DRA) with NVIDIA virtualized GPU (vGPU) on AKS

· 6 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Suraj Deshmukh
Software Engineer at Microsoft

Recently, dynamic resource allocation (DRA) has emerged as the standard mechanism to consume GPU resources in Kubernetes. With DRA, accelerators like GPUs are no longer exposed as static extended resources (for example, nvidia.com/gpu) but are dynamically allocated through DeviceClasses and ResourceClaims. This unlocks richer scheduling semantics and better integration with virtualization technologies like NVIDIA vGPU.

Virtual accelerators such as NVIDIA vGPU are commonly used for smaller workloads because they allow a single physical GPU to be securely partitioned across multiple tenants or apps. This is especially valuable for enterprise AI/ML development environments, fine-tuning, and audio/visual processing. vGPU enables predictable performance profiles while still exposing CUDA capabilities to containerized workloads.

On Azure, the NVadsA10_v5 virtual machine (VM) series is backed by the physical NVIDIA A10 GPU in the host and offers this resource model. Instead of assigning the entire GPU to a single VM, the vGPU technology is used to partition the GPU into multiple fixed-size slices at the hypervisor layer.

In this post, we’ll walk through enabling the NVIDIA DRA driver on a node pool backed by an NVadsA10_v5 series vGPU on Azure Kubernetes Service (AKS).

DRA with fractional A10 vGPU node on AKS

Running more with less: Multi-instance GPU (MIG) with Dynamic Resource Allocation (DRA) on AKS

· 8 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Product Manager at Microsoft

GPUs power a wide range of production Kubernetes workloads across industries. For example, media platforms rely on them for video encoding/transcoding, financial services firms run quantitative risk simulations, and research groups process and visualize large datasets. In each of these scenarios, GPUs significantly improve job throughput, yet individual workloads often consume only a portion of the available device.

By default, Kubernetes schedules GPUs as entire units; when a workload requires only a fraction of a GPU, the remaining capacity can remain unused. Over time, this leads to lower hardware utilization and higher infrastructure costs within a cluster.

Multi-instance GPU (MIG) combined with dynamic resource allocation (DRA) helps address this challenge. MIG partitions a physical GPU into isolated instances with dedicated compute and memory resources, while DRA enables those instances to be provisioned and bound dynamically through Kubernetes resource claims. Rather than treating a GPU as an indivisible resource, the cluster can allocate right-sized GPU partitions to multiple workloads at the same time!

Autoscale KAITO inference workloads on AKS using KEDA

· 9 min read
Andy Zhang
Principal Software Engineer for the Azure Kubernetes Service
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Kubernetes AI Toolchain Operator (KAITO) is an operator that simplifies and automates AI/ML model inference, tuning, and RAG in a Kubernetes cluster. With the recent v0.8.0 release, KAITO has introduced intelligent autoscaling for inference workloads as an alpha feature! In this blog, we'll guide you through setting up event-driven autoscaling for vLLM inference workloads.

Scaling multi-node LLM inference with NVIDIA Dynamo and NVIDIA GPUs on AKS (Part 2)

· 8 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Sertaç Özercan
Principal Engineering Manager for Azure Upstream

This blog post is co-authored with Saurabh Aggarwal, Anish Maddipoti, Amr Elmeleegy, and Rohan Varma from NVIDIA.

In our previous post, we demonstrated the power of the Azure ND GB200-v6 VMs accelerated by NVIDIA GB200 NVL72, achieving a staggering 1.2M tokens per second across 10 nodes using NVIDIA Dynamo. Today, we're shifting focus from raw throughput to developer velocity and operational efficiency.

We will explore how the Dynamo Planner and Dynamo Profiler remove the guesswork from performance tuning on AKS.

Fully Managed GPU workloads with Azure Linux on Azure Kubernetes Service (AKS)

· 7 min read
Flora Taagen
Product Manager 2 at Microsoft
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Introduction

Running GPU workloads on AKS enables scalable, automated data processing and AI applications across Windows, Ubuntu, or Azure Linux nodes. Azure Linux, Microsoft’s minimal and secure OS, simplifies GPU setup with validated drivers and seamless integration, reducing operational efforts. This blog covers how AKS supports GPU nodes on various OS platforms and highlights the security and performance benefits of Azure Linux for GPU workloads.

Scaling multi-node LLM inference with NVIDIA Dynamo and ND GB200 NVL72 GPUs on AKS

· 11 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Sertaç Özercan
Principal Engineering Manager for Azure Upstream
Rita Zhang
Partner Software Engineering at Microsoft

This blog post is co-authored with Rohan Varma, Saurabh Aggarwal, Anish Maddipoti, and Amr Elmeleegy from NVIDIA to showcase solutions that help customers run AI inference at scale using Azure Kubernetes Service (AKS) and NVIDIA’s advanced hardware and distributed inference frameworks.

Modern language models now routinely exceed the compute and memory capacity of a single GPU or even a whole node with multiple GPUs on Kubernetes. Consequently, inference at the scale of billions of model parameters demands multi-node, distributed deployment. Frameworks like the open-source NVIDIA Dynamo platform play a crucial role by coordinating execution across nodes, managing memory resources efficiently, and accelerating data transfers between GPUs to keep latency low.

Pair llm-d Inference with KAITO RAG Advanced Search to Enhance your AI Workflows

· 10 min read
Ernest Wong
Software Engineer at Microsoft
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Overview

In this blog, we'll guide you through setting up an OpenAI API compatible inference endpoint with llm-d and integrating with retrieval- augmented generation (RAG) on AKS. This blog will showcase its value in a key finance use case: indexing the latest SEC 10-K filings for the two S&P 500 companies and querying them. We’ll also highlight the benefits of llm-d based on its architecture and its synergy with RAG.

From 7B to 70B+: Serving giant LLMs efficiently with KAITO and ACStor v2

· 6 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Francis Yu
Product Manager focusing on storage orchestration for Kubernetes workloads

XL-size large language models (LLMs) are quickly evolving from experimental tools to essential infrastructure. Their flexibility, ease of integration, and growing range of capabilities are positioning them as core components of modern software systems.

Massive LLMs power virtual assistants and recommendations across social media, UI/UX design tooling and self-learning platforms. But how do they differ from your average language model? And how do you get the best bang for your buck running them at scale?

Let’s unpack why large models matter and how Kubernetes, paired with NVMe local storage, accelerates intelligent app development.

Simplifying InfiniBand on AKS

· 5 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Suraj Deshmukh
Software Engineer at Microsoft
Ernest Wong
Software Engineer at Microsoft

High performance computing (HPC) workloads, like large-scale distributed AI training and inferencing, often require fast, reliable data transfer and synchronization across the underlying compute. Model training, for example, requires shared memory across GPUs because the parameters and gradients need to be constantly shared. For models with billions of parameters, the available memory in a single GPU node may not be enough, so "pooling" the memory across multiple nodes also requires high memory bandwidth due to the sheer volume of data involved. A common way to achieve this at scale is with a high-speed, low-latency network interconnect technology called InfiniBand (IB).

Deploy and take Flyte with an end-to-end ML orchestration solution on AKS

· 7 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Data is often at the heart of application design and development - it fuels user-centric design, provides insights for feature enhancements, and represents the value of an application as a whole. In that case, shouldn’t we use data science tools and workflows that are flexible and scalable on a platform like Kubernetes, for a range of application types?

In collaboration with David Espejo and Shalabh Chaudhri from Union.ai, we’ll dive into an example using Flyte, a platform built on Kubernetes itself. Flyte can help you manage and scale out data processing and machine learning pipelines through a simple user interface.

Fine tune language models with KAITO on AKS

· 8 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

You may have heard of the Kubernetes AI Toolchain Operator (KAITO) announced at Ignite 2023 and KubeCon Europe this year. The open source project has gained popularity in recent months by introducing a streamlined approach to AI model deployment and flexible infrastructure provisioning on Kubernetes.

With the v0.3.0 release, KAITO has expanded the supported model library to include the Phi-3 model, but the biggest (and most exciting) addition is the ability to fine-tune open-source models. Why should you be excited about fine-tuning? Well, it’s because fine-tuning is one way of giving your foundation model additional training using a specific dataset to enhance accuracy, which ultimately improves the interaction with end-users. (Another way to increase model accuracy is Retrieval-Augmented Generation (RAG), which we touch on briefly in this section, coming soon to KAITO).