Skip to main content
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
View all authors

Fully Managed GPU workloads with Azure Linux on Azure Kubernetes Service (AKS)

· 7 min read
Flora Taagen
Product Manager 2 at Microsoft
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Introduction

Running GPU workloads on AKS enables scalable, automated data processing and AI applications across Windows, Ubuntu, or Azure Linux nodes. Azure Linux, Microsoft’s minimal and secure OS, simplifies GPU setup with validated drivers and seamless integration, reducing operational efforts. This blog covers how AKS supports GPU nodes on various OS platforms and highlights the security and performance benefits of Azure Linux for GPU workloads.

Unique challenges of GPU nodes

Deploying a GPU workload isn’t just about picking the right VM size. There is also significant operational overhead that developers and platform engineers need to manage.

We’ve found that many of our customers struggled to manage GPU device discoverability/scheduling and observability, especially across different OS images. Platform teams spent cycles maintaining custom node images and post-deployment scripts to ensure CUDA compatibility, while developers had to debug “GPU not found” errors or stalled workloads that consumed GPU capacity with limited visibility into utilization.

The inconsistent experience across OS options on AKS was a major challenge that we sought to improve. We wanted to encourage our customers to use the OS that best-fit their needs, not blocking them because of feature parity gaps.

For example, Azure Linux support for GPU-enabled VM sizes on AKS was historically limited to NVIDIA V100 and T4, creating a gap for Azure Linux customers requiring higher-performance options. Platform teams looking to run compute-intensive workloads, such as general-purpose AI/ML workloads or large-scale simulations, were unable to do so with Azure Linux and NVIDIA NC A100 GPU node pools -- until now.

AKS expanding Azure Linux GPU support

NC A100 GPU support

The introduction of Azure Linux 3.0 support for NC A100 GPU node pools in AKS starts to close many of these gaps. For platform engineers, the new OS image standardizes the underlying kernel, container runtime, and driver stack while enabling GPU provisioning in a single declarative step. Instead of layering custom extensions or maintaining golden images, engineers can now define a node pool with --os-sku AzureLinux and get a consistent, secure, and AKS-managed runtime that includes NVIDIA drivers/plugin setup and GPU telemetry out of the box. The Azure Linux 3.0 image also aligns with the AKS release cadence, which means fewer compatibility issues when upgrading clusters or deploying existing workloads onto GPU nodes.

AKS fully managed GPU nodes (preview)

Using NVIDIA GPUs with Azure Linux on AKS requires the installation of several components for proper functioning of AKS-GPU enabled nodes, including GPU drivers, the NVIDIA Kubernetes device plugin, and GPU metrics exporter for telemetry. Previously, the installation of these components was either done manually or via the open-source NVIDIA GPU Operator, creating operational overhead for platform engineers. To ease this complexity and overhead, AKS has released support for fully managed GPU nodes (preview), which installs the NVIDIA GPU driver, device plugin, and Data Center GPU Manager (DCGM) metrics exporter by default.

Deploying GPU workloads on AKS with Azure Linux 3.0

Customers choose to run their GPU workloads on Azure Linux for many reasons, such as the security posture, support model, resiliency, and/or performance optimizations that the OS provides. Some of the benefits that Azure Linux provides for your GPU workloads include:

ValuesAzure LinuxOther Distributions
Security and ComplianceAzure Linux is a minimal, hardened OS built from source in Microsoft’s trusted pipeline. It includes only essential packages for Kubernetes and GPU workloads, reducing CVEs and patching overhead. All kernel modules installed on Azure Linux AKS nodes must be signed using a trusted Microsoft secure key. FIPS-compliant images and CIS benchmarks further strengthen the security posture of your GPU node pools with out of the box compliance.Other distributions often include broader package sets and dependencies, which can increase the attack surface and CVE exposure. Other distributions allow kernel modules to be installed on nodes that are not signed by Microsoft. Further, FIPS-compliant images or CIS benchmarks may require additional configuration or customizations.
Operational EfficiencyAzure Linux images are lightweight and optimized for AKS, enabling quick node provisioning and upgrade times. GPU drivers also come pre-installed for Azure Linux NVIDIA GPU node pools, ensuring smooth GPU enablement without manual intervention.Other distributions have larger image footprints which can lead to slower node provisioning and upgrade times. Like Azure Linux, other distributions also come with GPU drivers preinstalled in NVIDIA GPU node pools.
Resiliency and ReliabilityEach Azure Linux image undergoes rigorous validation by the Azure Linux team, including GPU-specific scenarios, to prevent regressions and ensure stability before the image is released to AKS.Other distributions cannot run AKS end-to-end tests prior to releasing their images to the AKS team.

Deploy a GPU workload with Azure Linux on AKS

Deploying your GPU workloads on AKS with Azure Linux 3.0 is simple. Let’s use the newly supported NVIDIA NC A100 GPU as our example.

  1. To add an NVIDIA NC A100 node pool running on Azure Linux to your AKS cluster using the fully managed GPU node experience you can follow these instructions. Please note, the following parameters must be specified in your az aks nodepool add command to create an NVIDIA NC A100 node pool running on Azure Linux:

    • --os-sku AzureLinux: provisions a node pool with the Azure Linux container host as the node OS.
    • --node-vm-size Standard_nc24ads_A100_v4: provisions a node pool using the Standard_nc24ads_A100_v4 VM size. Please note, any of the sizes in the Azure NC_A100_v4 series are supported.
  2. With the DCGM exporter installed by default, you can observe detailed GPU metrics such as utilization, memory consumption, and error states.

If you prefer not to use a preview feature, you can follow these instructions on AKS to create an NVIDIA NC A100 node pool with Azure Linux by manually installing the NVIDIA device plugin via a DaemonSet. You’ll also need to manually install the DCGM exporter to consume GPU metrics.

Observability & monitoring

Monitoring GPU performance is critical for optimizing utilization, troubleshooting workloads, and enabling cost-efficient scaling in AKS clusters. Traditionally, NVIDIA GPU node pools were treated as opaque resources - jobs would succeed or fail without visibility into whether GPUs were fully utilized or misallocated.

With the DCGM exporter now managed on AKS, cluster operators can collect detailed GPU metrics such as utilization, memory consumption, and error states for analysis. These metrics can integrate naturally with existing observability pipelines, providing a foundation for intelligent automation and alerting.

As an example, a platform team can configure scaling logic in the Cluster Autoscaler (CAS) or Kubernetes Event-Driven Autoscaling (KEDA) to add A100 nodes when GPU utilization exceeds 70%, or scale down when utilization remains low for a defined interval. This enables GPU infrastructure to operate as a dynamic, demand-driven resource rather than a static, high-cost allocation.

For more conceptual guidance on GPU metrics in AKS, visit these docs.

What's next?

The Azure Linux and AKS teams are actively working on expanding support for additional GPU VM sizes and managed GPU features on AKS. You can expect to see Azure Linux support for the NVIDIA ND A100, NC H100, and ND H200 families landing in the near future, as well as Azure Linux support for managed AKS GPU features like multi-instance GPU (MIG), built-in GPU metrics in Azure Managed Prometheus and Grafana, and KAITO.

Scaling multi-node LLM inference with NVIDIA Dynamo and ND GB200 NVL72 GPUs on AKS

· 11 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Sertaç Özercan
Principal Engineering Manager for Azure Upstream
Rita Zhang
Partner Software Engineering at Microsoft

This blog post is co-authored with Rohan Varma, Saurabh Aggarwal, Anish Maddipoti, and Amr Elmeleegy from NVIDIA to showcase solutions that help customers run AI inference at scale using Azure Kubernetes Service (AKS) and NVIDIA’s advanced hardware and distributed inference frameworks.

Modern language models now routinely exceed the compute and memory capacity of a single GPU or even a whole node with multiple GPUs on Kubernetes. Consequently, inference at the scale of billions of model parameters demands multi-node, distributed deployment. Frameworks like the open-source NVIDIA Dynamo platform play a crucial role by coordinating execution across nodes, managing memory resources efficiently, and accelerating data transfers between GPUs to keep latency low.

However, software alone cannot solve these challenges. The underlying hardware must also support this level of scale and throughput. Rack-scale systems like Azure ND GB200-v6 VMs, accelerated by NVIDIA GB200 NVL72, meet this need by integrating 72 NVIDIA Blackwell GPUs in a distributed GPU setup connected via high-bandwidth, low-latency interconnect. This architecture uses the rack as a unified compute engine and enables fast, efficient communication and scaling that traditional multi-node setups struggle to achieve.

For some more demanding or unpredictable workloads, even combining advanced hardware and distributed inference frameworks is not sufficient on its own. Inference traffic spikes unpredictably. Fixed, static inference configurations and setups with predetermined resource allocation can lead to GPU underutilization or overprovisioning. Instead, inference infrastructure must dynamically adjust in real time, scaling resources up or down to align with current demand without wasting GPU capacity or risking performance degradation.

A holistic solution: ND GB200-v6 VMs and Dynamo on AKS

To effectively address the variability in inference traffic in distributed deployments, our approach combines three key components: ND GB200-v6 VMs, the NVIDIA Dynamo inference framework, with an Azure Kubernetes Service (AKS) cluster. Together, these technologies provide the scale, flexibility, and responsiveness necessary to meet the demands of modern, large-scale inference workloads.

ND GB200-v6: Rack-Scale Accelerated Hardware

At the core of Azure’s ND GB200-v6 VM series is the liquid-cooled NVIDIA GB200 NVL72 system, a rack-scale architecture that integrates 72 NVIDIA Blackwell GPUs and 36 NVIDIA Grace™ CPUs into a single, tightly coupled domain.

The rack-scale design of ND GB200-v6 unlocks model serving patterns that were previously infeasible due to interconnect and memory bandwidth constraints.

NVIDIA GB200 NVL72 system

NVIDIA Dynamo: a distributed inference framework

NVIDIA Dynamo is an open source distributed inference serving framework that supports multiple engine backends, including vLLM, TensorRT-LLM, and SGLang. It disaggregates the prefill (compute-bound) and decode (memory-bound) phases across separate GPUs, enabling independent scaling and phase-specific parallelism strategies. For example, the memory-bound decode phase can leverage wide expert parallelism (EP) without constraining the compute-heavy prefill phase, improving overall resource utilization and performance.

Dynamo includes an SLA-based Planner that proactively manages GPU scaling for prefill/decode (PD) disaggregated inference. Using pre-deployment profiling, it evaluates how model parallelism and batching affect performance, recommending configurations that meet latency targets like Time to First Token (TTFT) and Inter-Token Latency (ITL) within a given GPU budget. At runtime, the Planner forecasts traffic with time-series models, dynamically adjusting PD worker counts based on predicted demand and real-time metrics.

The Dynamo LLM-aware Router manages the key-value (KV) cache across large GPU clusters by hashing requests and tracking cache locations. It calculates overlap scores between incoming requests and cached KV blocks, routing requests to GPUs that maximize cache reuse while balancing workload. This cache-aware routing reduces costly KV recomputation and avoids bottlenecks, which in turn improves performance, especially for large models with long context windows.

To reduce GPU memory overhead, the Dynamo KV Block Manager offloads infrequently accessed KV blocks to CPU RAM, SSDs, or object storage. It supports hierarchical caching and intelligent eviction policies across nodes, scaling cache storage to petabyte levels while preserving reuse efficiency.

Dynamo’s disaggregated execution model is especially effective for large, dynamic inference workloads where compute and memory demands shift across phases. The Azure Research paper "Splitwise: Efficient generative LLM inference using phase splitting" demonstrated the benefits of separating the compute-intensive prefill and memory-bound decode phases of LLM inference onto different hardware. We will explore this disaggregated model in detail in an upcoming blog post.

Dynamo project key features

How Dynamo can optimize AI product recommendations in e-commerce apps

Let’s put Dynamo’s features in context by walking through a realistic app scenario and explore how its framework addresses common inference challenges on AKS.

Imagine you operate a large e-commerce platform (or provide infrastructure for one), where customers browse thousands of products in real time. The app runs on AKS and experiences traffic surges during sales, launches, and seasonal events. The app also leverages LLMs to generate natural language outputs, such as:

  • Context-aware product recommendations
  • Dynamic product descriptions
  • AI-generated upsells based on behavior, reviews, or search queries

This architecture powers user experiences like: “Customers who viewed this camera also looked at these accessories, chosen for outdoor use and battery compatibility.” Personalized product copies are dynamically rewritten for different segments, such as “For photographers” vs. “For frequent travelers.”

Behind the scenes, it requires a multi-stage LLM pipeline: retrieving product/user context, running prompted inference, and generating natural language outputs per session.

Common pain points and how Dynamo tackles them

  1. Heavy Prefill + Lightweight Decode = GPU Waste

    Generating personalized recommendations requires a heavy prefill stage (processing more than 8,000 tokens of context) but results in short outputs (~50 tokens). Running both on a single GPU can be inefficient.

    Dynamo Solution: The pipeline is split into two distinct stages, each deployed on separate GPUs. This allows independent configuration of GPU count and model parallelism for each phase. It also enables the use of different GPU types—for example, GPUs with high compute capability but lower memory for the prefill stage, and GPUs with both high compute and large memory capacity for the decode stage.

    In our e-commerce example, when a user lands on a product page:

    • Prefill runs uninterrupted on dedicated GPUs using model parallelism degrees optimized for accelerating math-intensive attention GEMM operation. This enables fast processing of 8,000 tokens of user context and product metadata.

    • Decode runs on a GPU pool with different counts and parallelism degrees designed and tuned to maximize memory bandwidth and capacity for generating the short product blurb.

    Result: This approach maximizes GPU utilization and reduces per-request cost.

  2. Meeting SLOs and handling traffic spikes without overprovisioning

    Your SLO might define time-to-first-token < 300ms and 99th percentile latency < 500ms, but maintaining this across dynamic workloads is tough. Static GPU allocation leads to bottlenecks during traffic spikes, causing either SLO violations or wasted capacity.

    Dynamo Solution: Continuously monitors metrics and auto-scales GPU replicas or reallocates GPUs between prefill and decode stages based on real-time traffic patterns, queue depth, and latency targets.

    In our e-commerce example:

    • During Black Friday, Dynamo observes latency climbing due to a surge in prefill demand. It responds by increasing prefill GPU replicas by 50%, shifting GPUs from decode or spinning up additional ones.
    • At night, when email generation jobs dominate, Dynamo reallocates GPUs back to decode to optimize throughput.
    • When load drops, resources scale back down.

    Result: SLOs are met consistently without over or under provisioning, controlling costs while maintaining performance.

  3. Recomputing shared context is wasteful

    Many requests within the same session reuse the same product or user context but unnecessarily recompute the KV cache each time, wasting valuable GPU resources that could be spent serving other user requests.

    Dynamo Solution: LLM-aware routing maintains a map of KV cache across large GPU clusters and directs requests to the GPUs that already hold the relevant KV cache, avoiding redundant computation.

    In our e-commerce example:

    • A user browses five similar items in one session.
    • Dynamo routes all requests to the same GPU that already has the user’s or product’s context cached.

    Result: Faster response times, lower latency, reduced GPU usage.

  4. KV cache growth blows past GPU memory

    With many concurrent sessions and large input sequence lengths, the KV cache (product data + user history) can exceed available GPU memory. This can trigger evictions, leading to costly re-computations or inference errors.

    Dynamo Solution: The KV Block Manager (KVBM) offloads cold/unused KV cache data to CPU RAM, NVMe, or networked storage freeing valuable GPU memory for active requests.

    In our e-commerce example:

    • Without cache offloading: increasing number of concurrent sessions per GPU increases latency due to KV cache evictions and recomputations
    • With Dynamo: GPUs can support higher concurrencies while maintaining low latency

    Result: Higher concurrency at lower cost, without degrading user experience.

Enterprise-scale inference experiments: Dynamo with GB200, running on AKS

We set out to deploy the popular open-source GPT-OSS 120B reasoning model using Dynamo on AKS on GB200 NVL72, adapting the SemiAnalysis InferenceMAX recipe for a large scale, production-grade environment.

Our approach: leverage Dynamo as the inference server and swap GB200 NVL72 nodes in place of NVIDIA HGX™ B200, scaling the deployment across multiple nodes.

Our goal was to replicate the performance results reported by SemiAnalysis, but at a larger scale within an AKS environment, proving that enterprise-scale inference with cutting-edge hardware and open-source models is not only possible, but practical.

AKS Deployment Overview

Ready to build the same setup? Our comprehensive guide walks you through each stage of the deployment:

  1. Set up your foundation: Configure GPU node pools and prepare your inference set up with the prerequisites you will need.
  2. Deploy Dynamo via Helm: Get the inference server running with the right configurations for GB200 NVL72.
  3. Benchmark performance with your serving engine: Test and optimize latency/throughput under production conditions.

Find the complete recipe for GPT-OSS 120B at aka.ms/dynamo-recipe-gpt-oss-120b and get hands-on with the deployment guide at aka.ms/aks-dynamo.

The results

By following this approach, we achieved 1.2 million tokens per second, meeting our goal of replicating SemiAnalysis InferenceMAX results at enterprise scale. This demonstrates that Dynamo on AKS running on ND GB200-v6 instances can deliver the performance needed for production inference workloads.

Looking ahead

This work reflects a deep collaboration between Azure and NVIDIA to reimagine how large-scale inference is built and operated, from the hardware up through the software stack. By combining GB200 NVL72 nodes and the open-source Dynamo project on AKS, we’ve taken a step toward making distributed inference faster, more efficient, and more responsive to real-world demands.

This post focused on the foundational serving stack. In upcoming blogs, we will build on this foundation and explore more of Dynamo's advanced features, such as Disaggregated Serving and SLA-based Planner. We'll demonstrate how these features allow for even greater efficiency, moving from a static, holistic deployment to a flexible, phase-splitted architecture. Moving forward, we also plan to extend our testing to include larger mixture-of-experts (MoE) reasoning models such as DeepSeek R1. We encourage you to try out the Dynamo recipe in this blog on AKS and share your feedback!

Pair llm-d Inference with KAITO RAG Advanced Search to Enhance your AI Workflows

· 10 min read
Ernest Wong
Software Engineer at Microsoft
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Overview

In this blog, we'll guide you through setting up an OpenAI API compatible inference endpoint with llm-d and integrating with retrieval- augmented generation (RAG) on AKS. This blog will showcase its value in a key finance use case: indexing the latest SEC 10-K filings for the two S&P 500 companies and querying them. We’ll also highlight the benefits of llm-d based on its architecture and its synergy with RAG.

From 7B to 70B+: Serving giant LLMs efficiently with KAITO and ACStor v2

· 6 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Francis Yu
Product Manager focusing on storage orchestration for Kubernetes workloads

XL-size large language models (LLMs) are quickly evolving from experimental tools to essential infrastructure. Their flexibility, ease of integration, and growing range of capabilities are positioning them as core components of modern software systems.

Massive LLMs power virtual assistants and recommendations across social media, UI/UX design tooling and self-learning platforms. But how do they differ from your average language model? And how do you get the best bang for your buck running them at scale?

Let’s unpack why large models matter and how Kubernetes, paired with NVMe local storage, accelerates intelligent app development.

Simplifying InfiniBand on AKS

· 5 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Suraj Deshmukh
Software Engineer at Microsoft
Ernest Wong
Software Engineer at Microsoft

High performance computing (HPC) workloads, like large-scale distributed AI training and inferencing, often require fast, reliable data transfer and synchronization across the underlying compute. Model training, for example, requires shared memory across GPUs because the parameters and gradients need to be constantly shared. For models with billions of parameters, the available memory in a single GPU node may not be enough, so "pooling" the memory across multiple nodes also requires high memory bandwidth due to the sheer volume of data involved. A common way to achieve this at scale is with a high-speed, low-latency network interconnect technology called InfiniBand (IB).

Deploy and take Flyte with an end-to-end ML orchestration solution on AKS

· 7 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Data is often at the heart of application design and development - it fuels user-centric design, provides insights for feature enhancements, and represents the value of an application as a whole. In that case, shouldn’t we use data science tools and workflows that are flexible and scalable on a platform like Kubernetes, for a range of application types?

In collaboration with David Espejo and Shalabh Chaudhri from Union.ai, we’ll dive into an example using Flyte, a platform built on Kubernetes itself. Flyte can help you manage and scale out data processing and machine learning pipelines through a simple user interface.

Fine tune language models with KAITO on AKS

· 8 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

You may have heard of the Kubernetes AI Toolchain Operator (KAITO) announced at Ignite 2023 and KubeCon Europe this year. The open source project has gained popularity in recent months by introducing a streamlined approach to AI model deployment and flexible infrastructure provisioning on Kubernetes.

With the v0.3.0 release, KAITO has expanded the supported model library to include the Phi-3 model, but the biggest (and most exciting) addition is the ability to fine-tune open-source models. Why should you be excited about fine-tuning? Well, it’s because fine-tuning is one way of giving your foundation model additional training using a specific dataset to enhance accuracy, which ultimately improves the interaction with end-users. (Another way to increase model accuracy is Retrieval-Augmented Generation (RAG), which we touch on briefly in this section, coming soon to KAITO).