Skip to main content

26 posts tagged with "AI"

Artificial intelligence workloads, patterns, model deployment, and orchestration on AKS.

View All Tags

Control AI spend with per-application token rate limiting using Application Network and agentgateway

· 5 min read
Mitch Connors
Principal Product Manager for the Azure Kubernetes Service
John Howard
Senior Architect at Solo.io
Zhewei Hu
Senior Software Engineer at Microsoft

As organizations scale AI adoption, platform teams must balance two competing goals:

  • Enable broad, low-friction access to AI services
  • Prevent a single application from exhausting shared quotas

This article describes a platform-oriented approach to controlling AI spend using Azure Kubernetes Application Network (AppNet) and agentgateway. By leveraging workload identity already present in the network, you can enforce per-application, token-based rate limiting without issuing API keys to every application.

By adopting this platform-oriented approach, you gain centralized control over AI spending, eliminate secrets distribution, and improve operational efficiency.

AI Inference on AKS enabled by Azure Arc: Generative AI using Triton and TensorRT‑LLM

· 14 min read
Datta Rajpure
Principal Group Eng Manager at Microsoft Azure Core

In this post, you’ll deploy NVIDIA Triton Inference Server on your Azure Kubernetes Service (AKS) enabled by Azure Arc cluster to serve a Qwen‑based generative model using the TensorRT‑LLM backend. By the end, you’ll have a working generative AI inference pipeline running locally on your on‑premises GPU hardware.

AI Inference on AKS enabled by Azure Arc: Predictive AI using Triton and ResNet-50

· 11 min read
Datta Rajpure
Principal Group Eng Manager at Microsoft Azure Core

In this post, you'll deploy NVIDIA Triton Inference Server on your Azure Kubernetes Service (AKS) enabled by Azure Arc cluster to serve a ResNet-50 image classification model in ONNX format. By the end, you'll have a working predictive AI inference pipeline running on your on-premises GPU hardware.

AI Inference on AKS enabled by Azure Arc: Generative AI with Open‑Source LLM Server

· 11 min read
Datta Rajpure
Principal Group Eng Manager at Microsoft Azure Core

In this post, you'll explore how to deploy and run generative AI inference workloads using open-source large language model servers on Azure Kubernetes Service (AKS) enabled by Azure Arc. You'll focus on running these workloads locally, on-premises or at the edge, using GPU acceleration with centralized management.

AI Inference on AKS enabled by Azure Arc: Series Introduction and Scope

· 6 min read
Datta Rajpure
Principal Group Eng Manager at Microsoft Azure Core

This series gives you practical, step-by-step guidance for experimentation with generative and predictive AI inference workloads on Azure Kubernetes Service (AKS) enabled by Azure Arc clusters, using CPUs, GPUs, and neural processing units (NPUs). The scenarios target on‑premises and edge environments, specifically Azure Local, with a focus on repeatable, hands-on experimentation rather than abstract examples.

AI Inference on AKS enabled by Azure Arc: Bringing AI to the Edge and On‑Premises

· 3 min read
Datta Rajpure
Principal Group Eng Manager at Microsoft Azure Core

For many edge and on-premises environments, sending data to the cloud for AI inferencing isn't an option, as latency, data residency, and compliance make it a non-starter. With Azure Kubernetes Service (AKS) enabled by Azure Arc managing your Kubernetes clusters, you can run AI inferencing locally on the hardware you already have. This blog series shows you how, with hands-on tutorials for experimenting with generative and predictive AI workloads using CPUs, GPUs, and NPUs.

Optimizing RDMA performance for AI workloads on AKS with DRANET

· 10 min read
Anson Qian
Software Engineer at Azure Kubernetes Service
Michael Zappa
Software Engineer for Azure Container Networking

RDMA (Remote Direct Memory Access) is critical for unlocking the full potential of GPU infrastructure, enabling the high-throughput, low-latency GPU-to-GPU communication that large-scale AI workloads demand. In distributed training, collective operations like all-reduce and all-gather synchronize gradients and activations across GPUs — any communication bottleneck stalls the entire training pipeline. In disaggregated inference, RDMA provides the fast inter-node transfers needed to move KV-cache data between prefill and decode phases running on separate GPU pools.

rdma-for-ai-workload-on-gpu-infra

DRANET is an open-source Dynamic Resource Allocation (DRA) network driver that discovers RDMA-capable devices, advertises them as ResourceSlices, and injects the allocated devices into each pod and container. Combined with the NVIDIA GPU DRA driver, it enables topology-aware co-scheduling of GPUs and NICs for high-performance AI networking on Kubernetes.

Scaling multi-node LLM inference with NVIDIA Dynamo and NVIDIA GPUs on AKS (Part 3)

· 8 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Devi Vasudevan
Principal Engineering Manager for Azure Upstream

This blog post is co-authored with Nikhar Maheshwari, Anish Maddipoti, Rohan Varma, Clement Pakkam Isaac, and Stephen Mccoulough from NVIDIA.

We kicked things off in Part 1 by introducing NVIDIA Dynamo on AKS and demonstrating 1.2 million tokens per second across 10 GPU nodes of GB200 NVL72. In Part 2, we explored the Dynamo Planner and Profiler for SLO-driven scaling.

In this blog, we explore how Dynamo’s KV Router makes multi-worker LLM deployments significantly more efficient, demonstrating over 20x faster Time To First Token (TTFT) and over 4x faster end-to-end latency on real-world production traces. These latency reductions not only improve the end-user experience but also maximize GPU utilization and lower the Total Cost of Ownership (TCO).

Scaling Anyscale Ray Workloads on AKS

· 7 min read
Anson Qian
Software Engineer at Azure Kubernetes Service
Bob Mital
Principal Product Manager at Microsoft Azure
Kenneth Kilty
Technical Program Manager for Cloud Native Platforms

This post focuses on running Anyscale's managed Ray service on AKS, using the Anyscale Runtime (formerly RayTurbo) for an optimized Ray experience. For open-source Ray on AKS, see our Ray on AKS overview.

Ray is an open-source distributed compute framework for scaling Python and AI workloads from a laptop to clusters with thousands of nodes. Anyscale provides a managed ML/AI platform and an optimized Ray runtime with better scalability, observability, and operability than running open-source KubeRay—including intelligent autoscaling, enhanced monitoring, and fault-tolerant training.

As part of Microsoft and Anyscale's strategic collaboration to deliver distributed AI/ML Azure-native computing at scale, we've been working closely with Anyscale to enhance the production-readiness of Ray workloads on Azure Kubernetes Service (AKS) in three critical areas:

  • Elastic scalability through multi-cluster multi-region capacity aggregation
  • Data persistence with unified storage across ML/AI development and operation lifecycle
  • Operational simplicity through automated credential management with service principal

Whether you're fine-tuning models with DeepSpeed or LLaMA-Factory or deploying inference endpoints for LLMs ranging from small to large-scale reasoning models, Anyscale on AKS delivers a production-grade ML/AI platform that scales with your needs.

Autoscale KAITO inference workloads on AKS using KEDA

· 9 min read
Andy Zhang
Principal Software Engineer for the Azure Kubernetes Service
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Kubernetes AI Toolchain Operator (KAITO) is an operator that simplifies and automates AI/ML model inference, tuning, and RAG in a Kubernetes cluster. With the recent v0.8.0 release, KAITO has introduced intelligent autoscaling for inference workloads as an alpha feature! In this blog, we'll guide you through setting up event-driven autoscaling for vLLM inference workloads.

Scaling multi-node LLM inference with NVIDIA Dynamo and NVIDIA GPUs on AKS (Part 2)

· 8 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Sertaç Özercan
Principal Engineering Manager for Azure Upstream

This blog post is co-authored with Saurabh Aggarwal, Anish Maddipoti, Amr Elmeleegy, and Rohan Varma from NVIDIA.

In our previous post, we demonstrated the power of the Azure ND GB200-v6 VMs accelerated by NVIDIA GB200 NVL72, achieving a staggering 1.2M tokens per second across 10 nodes using NVIDIA Dynamo. Today, we're shifting focus from raw throughput to developer velocity and operational efficiency.

We will explore how the Dynamo Planner and Dynamo Profiler remove the guesswork from performance tuning on AKS.

AI Conformant Azure Kubernetes Service (AKS) clusters

· 9 min read
Ahmed Sabbour
Principal PM Lead for the Azure Kubernetes Service
Rita Zhang
Partner Software Engineering at Microsoft

As organizations increasingly move AI workloads into production, they need consistent and interoperable infrastructure they can rely on. The Cloud Native Computing Foundation (CNCF) launched the Kubernetes AI Conformance Program to address this need by creating open, community-defined standards for running AI workloads on Kubernetes. See CNCF Kubernetes AI Conformance Announcement at KubeCon North America 2025.

Azure Kubernetes Service (AKS) is proud to be among the first platforms certified for Kubernetes AI Conformance, demonstrating our commitment to providing customers with a verified, standardized platform for running AI workloads.

Scaling multi-node LLM inference with NVIDIA Dynamo and ND GB200 NVL72 GPUs on AKS

· 11 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Sertaç Özercan
Principal Engineering Manager for Azure Upstream
Rita Zhang
Partner Software Engineering at Microsoft

This blog post is co-authored with Rohan Varma, Saurabh Aggarwal, Anish Maddipoti, and Amr Elmeleegy from NVIDIA to showcase solutions that help customers run AI inference at scale using Azure Kubernetes Service (AKS) and NVIDIA’s advanced hardware and distributed inference frameworks.

Modern language models now routinely exceed the compute and memory capacity of a single GPU or even a whole node with multiple GPUs on Kubernetes. Consequently, inference at the scale of billions of model parameters demands multi-node, distributed deployment. Frameworks like the open-source NVIDIA Dynamo platform play a crucial role by coordinating execution across nodes, managing memory resources efficiently, and accelerating data transfers between GPUs to keep latency low.

How to Deploy AKS MCP Server on AKS with Workload Identity

· 15 min read
Paul Yu
Cloud Native Developer Advocate

It's been a few months since the AKS-MCP server was announced. Since then, there have been several updates and improvements. The MCP server can be easily installed on a local machine using the AKS Extension for VS Code, or via the GitHub MCP registry, or even using the Docker MCP hub.

In this blog post, I'll show you one approach to running the AKS MCP server: deploying it inside an AKS cluster as a Streamable HTTP service. This pattern demonstrates how MCP servers can be centrally managed and made accessible to multiple clients—including AI assistants, automation tools, and even autonomous agents.

Pair llm-d Inference with KAITO RAG Advanced Search to Enhance your AI Workflows

· 10 min read
Ernest Wong
Software Engineer at Microsoft
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Overview

In this blog, we'll guide you through setting up an OpenAI API compatible inference endpoint with llm-d and integrating with retrieval- augmented generation (RAG) on AKS. This blog will showcase its value in a key finance use case: indexing the latest SEC 10-K filings for the two S&P 500 companies and querying them. We’ll also highlight the benefits of llm-d based on its architecture and its synergy with RAG.

Observe Smarter: Leveraging Real-Time insights via the AKS-MCP Server

· 9 min read
Qasim Sarfraz
Software Engineer at Microsoft

Introduction

Recently, we released the AKS-MCP server, which enables AKS customers to automate diagnostics, troubleshooting, and cluster management using natural language. One of its key capabilities is real-time observability using inspektor_gadget_observability MCP tool, which leverages a technology called eBPF to help customers quickly inspect and debug applications running in AKS clusters.

Announcing the CLI Agent for AKS: Agentic AI-powered operations and diagnostics at your fingertips

· 9 min read
Pavneet Ahluwalia
Principal PM Lead for the Azure Kubernetes Service
Julia Yin
Product Manager at Microsoft
Aritra Ghosh
Senior Product Manager at Microsoft

At KubeCon India earlier this month, the AKS team shared our newest Agentic AI-powered feature with the broader Kubernetes community: the CLI Agent for AKS. CLI Agent for AKS is a new AI-powered command-line experience designed to help Azure Kubernetes Service (AKS) users troubleshoot, optimize, and operate their clusters with unprecedented ease and intelligence.

Announcing the AKS-MCP Server: Unlock Intelligent Kubernetes Operations

· 9 min read
Pavneet Ahluwalia
Principal PM Lead for the Azure Kubernetes Service

We're excited to announce the launch of the AKS-MCP Server. An open source Model Context Protocol (MCP) server designed to make your Azure Kubernetes Service (AKS) clusters AI-native and more accessible for developers, SREs, and platform engineers through Agentic AI workflows.

AKS-MCP isn't just another integration layer. It empowers cutting-edge AI assistants (such as Claude, Cursor, and GitHub Copilot) to interact with AKS through a secure, standards-based protocol—opening new possibilities for automation, observability, and collaborative cloud operations.

Streamlining Temporal Worker Deployments on AKS

· 6 min read
Steve Womack
Solutions Architect at Temporal
Brian Redmond
AKS and Azure Cloud Native Platforms

Temporal is an open source platform that helps developers build and scale resilient Enterprise and AI applications. Complex and long-running processes are easily orchestrated with durable execution, ensuring they never fail or lose state. Every step is tracked in an Event History that lets developers easily observe and debug applications. In this guide, we will help you understand how to run and scale your workers on Azure Kubernetes Service (AKS).

From 7B to 70B+: Serving giant LLMs efficiently with KAITO and ACStor v2

· 6 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Francis Yu
Product Manager focusing on storage orchestration for Kubernetes workloads

XL-size large language models (LLMs) are quickly evolving from experimental tools to essential infrastructure. Their flexibility, ease of integration, and growing range of capabilities are positioning them as core components of modern software systems.

Massive LLMs power virtual assistants and recommendations across social media, UI/UX design tooling and self-learning platforms. But how do they differ from your average language model? And how do you get the best bang for your buck running them at scale?

Let’s unpack why large models matter and how Kubernetes, paired with NVMe local storage, accelerates intelligent app development.

Simplifying InfiniBand on AKS

· 5 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Suraj Deshmukh
Software Engineer at Microsoft
Ernest Wong
Software Engineer at Microsoft

High performance computing (HPC) workloads, like large-scale distributed AI training and inferencing, often require fast, reliable data transfer and synchronization across the underlying compute. Model training, for example, requires shared memory across GPUs because the parameters and gradients need to be constantly shared. For models with billions of parameters, the available memory in a single GPU node may not be enough, so "pooling" the memory across multiple nodes also requires high memory bandwidth due to the sheer volume of data involved. A common way to achieve this at scale is with a high-speed, low-latency network interconnect technology called InfiniBand (IB).

Limitless Kubernetes Scaling for AI and Data-intensive Workloads: The AKS Fleet Strategy

· 7 min read
Pavneet Ahluwalia
Principal PM Lead for the Azure Kubernetes Service

With the fast-paced advancement of AI workloads, building and fine-tuning of multi-modal models, and extensive batch data processing jobs, more and more enterprises are leaning into Kubernetes platforms to take advantage of its ability to scale and optimize compute resources. With AKS, you can manage up to 5,000 nodes (upstream K8s limit) in a single cluster under optimal conditions, but for some large enterprises, that might not be enough.

Deploy and take Flyte with an end-to-end ML orchestration solution on AKS

· 7 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Data is often at the heart of application design and development - it fuels user-centric design, provides insights for feature enhancements, and represents the value of an application as a whole. In that case, shouldn’t we use data science tools and workflows that are flexible and scalable on a platform like Kubernetes, for a range of application types?

In collaboration with David Espejo and Shalabh Chaudhri from Union.ai, we’ll dive into an example using Flyte, a platform built on Kubernetes itself. Flyte can help you manage and scale out data processing and machine learning pipelines through a simple user interface.

Fine tune language models with KAITO on AKS

· 8 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

You may have heard of the Kubernetes AI Toolchain Operator (KAITO) announced at Ignite 2023 and KubeCon Europe this year. The open source project has gained popularity in recent months by introducing a streamlined approach to AI model deployment and flexible infrastructure provisioning on Kubernetes.

With the v0.3.0 release, KAITO has expanded the supported model library to include the Phi-3 model, but the biggest (and most exciting) addition is the ability to fine-tune open-source models. Why should you be excited about fine-tuning? Well, it’s because fine-tuning is one way of giving your foundation model additional training using a specific dataset to enhance accuracy, which ultimately improves the interaction with end-users. (Another way to increase model accuracy is Retrieval-Augmented Generation (RAG), which we touch on briefly in this section, coming soon to KAITO).