AKS Engineering Blog

Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Sertaç Özercan

Principal Engineering Manager for Azure Upstream

Rita Zhang

Partner Software Engineering at Microsoft

This blog post is co-authored with Rohan Varma, Saurabh Aggarwal, Anish Maddipoti, and Amr Elmeleegy from NVIDIA to showcase solutions that help customers run AI inference at scale using Azure Kubernetes Service (AKS) and NVIDIA’s advanced hardware and distributed inference frameworks.

Modern language models now routinely exceed the compute and memory capacity of a single GPU or even a whole node with multiple GPUs on Kubernetes. Consequently, inference at the scale of billions of model parameters demands multi-node, distributed deployment. Frameworks like the open-source NVIDIA Dynamo platform play a crucial role by coordinating execution across nodes, managing memory resources efficiently, and accelerating data transfers between GPUs to keep latency low.

However, software alone cannot solve these challenges. The underlying hardware must also support this level of scale and throughput. Rack-scale systems like Azure ND GB200-v6 VMs, accelerated by NVIDIA GB200 NVL72, meet this need by integrating 72 NVIDIA Blackwell GPUs in a distributed GPU setup connected via high-bandwidth, low-latency interconnect. This architecture uses the rack as a unified compute engine and enables fast, efficient communication and scaling that traditional multi-node setups struggle to achieve.

For some more demanding or unpredictable workloads, even combining advanced hardware and distributed inference frameworks is not sufficient on its own. Inference traffic spikes unpredictably. Fixed, static inference configurations and setups with predetermined resource allocation can lead to GPU underutilization or overprovisioning. Instead, inference infrastructure must dynamically adjust in real time, scaling resources up or down to align with current demand without wasting GPU capacity or risking performance degradation.

A holistic solution: ND GB200-v6 VMs and Dynamo on AKS

To effectively address the variability in inference traffic in distributed deployments, our approach combines three key components: ND GB200-v6 VMs, the NVIDIA Dynamo inference framework, with an Azure Kubernetes Service (AKS) cluster. Together, these technologies provide the scale, flexibility, and responsiveness necessary to meet the demands of modern, large-scale inference workloads.

ND GB200-v6: Rack-Scale Accelerated Hardware

At the core of Azure’s ND GB200-v6 VM series is the liquid-cooled NVIDIA GB200 NVL72 system, a rack-scale architecture that integrates 72 NVIDIA Blackwell GPUs and 36 NVIDIA Grace™ CPUs into a single, tightly coupled domain.

The rack-scale design of ND GB200-v6 unlocks model serving patterns that were previously infeasible due to interconnect and memory bandwidth constraints.

NVIDIA GB200 NVL72 system

NVIDIA Dynamo: a distributed inference framework

NVIDIA Dynamo is an open source distributed inference serving framework that supports multiple engine backends, including vLLM, TensorRT-LLM, and SGLang. It disaggregates the prefill (compute-bound) and decode (memory-bound) phases across separate GPUs, enabling independent scaling and phase-specific parallelism strategies. For example, the memory-bound decode phase can leverage wide expert parallelism (EP) without constraining the compute-heavy prefill phase, improving overall resource utilization and performance.

Dynamo includes an SLA-based Planner that proactively manages GPU scaling for prefill/decode (PD) disaggregated inference. Using pre-deployment profiling, it evaluates how model parallelism and batching affect performance, recommending configurations that meet latency targets like Time to First Token (TTFT) and Inter-Token Latency (ITL) within a given GPU budget. At runtime, the Planner forecasts traffic with time-series models, dynamically adjusting PD worker counts based on predicted demand and real-time metrics.

The Dynamo LLM-aware Router manages the key-value (KV) cache across large GPU clusters by hashing requests and tracking cache locations. It calculates overlap scores between incoming requests and cached KV blocks, routing requests to GPUs that maximize cache reuse while balancing workload. This cache-aware routing reduces costly KV recomputation and avoids bottlenecks, which in turn improves performance, especially for large models with long context windows.

To reduce GPU memory overhead, the Dynamo KV Block Manager offloads infrequently accessed KV blocks to CPU RAM, SSDs, or object storage. It supports hierarchical caching and intelligent eviction policies across nodes, scaling cache storage to petabyte levels while preserving reuse efficiency.

Dynamo’s disaggregated execution model is especially effective for large, dynamic inference workloads where compute and memory demands shift across phases. The Azure Research paper "Splitwise: Efficient generative LLM inference using phase splitting" demonstrated the benefits of separating the compute-intensive prefill and memory-bound decode phases of LLM inference onto different hardware. We will explore this disaggregated model in detail in an upcoming blog post.

Dynamo project key features

How Dynamo can optimize AI product recommendations in e-commerce apps

Let’s put Dynamo’s features in context by walking through a realistic app scenario and explore how its framework addresses common inference challenges on AKS.

Imagine you operate a large e-commerce platform (or provide infrastructure for one), where customers browse thousands of products in real time. The app runs on AKS and experiences traffic surges during sales, launches, and seasonal events. The app also leverages LLMs to generate natural language outputs, such as:

Context-aware product recommendations
Dynamic product descriptions
AI-generated upsells based on behavior, reviews, or search queries

This architecture powers user experiences like: “Customers who viewed this camera also looked at these accessories, chosen for outdoor use and battery compatibility.” Personalized product copies are dynamically rewritten for different segments, such as “For photographers” vs. “For frequent travelers.”

Behind the scenes, it requires a multi-stage LLM pipeline: retrieving product/user context, running prompted inference, and generating natural language outputs per session.

Common pain points and how Dynamo tackles them

Heavy Prefill + Lightweight Decode = GPU Waste

Generating personalized recommendations requires a heavy prefill stage (processing more than 8,000 tokens of context) but results in short outputs (~50 tokens). Running both on a single GPU can be inefficient.

Dynamo Solution: The pipeline is split into two distinct stages, each deployed on separate GPUs. This allows independent configuration of GPU count and model parallelism for each phase. It also enables the use of different GPU types—for example, GPUs with high compute capability but lower memory for the prefill stage, and GPUs with both high compute and large memory capacity for the decode stage.

In our e-commerce example, when a user lands on a product page:
- Prefill runs uninterrupted on dedicated GPUs using model parallelism degrees optimized for accelerating math-intensive attention GEMM operation. This enables fast processing of 8,000 tokens of user context and product metadata.
- Decode runs on a GPU pool with different counts and parallelism degrees designed and tuned to maximize memory bandwidth and capacity for generating the short product blurb.
Result: This approach maximizes GPU utilization and reduces per-request cost.
Meeting SLOs and handling traffic spikes without overprovisioning

Your SLO might define time-to-first-token < 300ms and 99th percentile latency < 500ms, but maintaining this across dynamic workloads is tough. Static GPU allocation leads to bottlenecks during traffic spikes, causing either SLO violations or wasted capacity.

Dynamo Solution: Continuously monitors metrics and auto-scales GPU replicas or reallocates GPUs between prefill and decode stages based on real-time traffic patterns, queue depth, and latency targets.

In our e-commerce example:
- During Black Friday, Dynamo observes latency climbing due to a surge in prefill demand. It responds by increasing prefill GPU replicas by 50%, shifting GPUs from decode or spinning up additional ones.
- At night, when email generation jobs dominate, Dynamo reallocates GPUs back to decode to optimize throughput.
- When load drops, resources scale back down.
Result: SLOs are met consistently without over or under provisioning, controlling costs while maintaining performance.
Recomputing shared context is wasteful

Many requests within the same session reuse the same product or user context but unnecessarily recompute the KV cache each time, wasting valuable GPU resources that could be spent serving other user requests.

Dynamo Solution: LLM-aware routing maintains a map of KV cache across large GPU clusters and directs requests to the GPUs that already hold the relevant KV cache, avoiding redundant computation.

In our e-commerce example:
- A user browses five similar items in one session.
- Dynamo routes all requests to the same GPU that already has the user’s or product’s context cached.
Result: Faster response times, lower latency, reduced GPU usage.
KV cache growth blows past GPU memory

With many concurrent sessions and large input sequence lengths, the KV cache (product data + user history) can exceed available GPU memory. This can trigger evictions, leading to costly re-computations or inference errors.

Dynamo Solution: The KV Block Manager (KVBM) offloads cold/unused KV cache data to CPU RAM, NVMe, or networked storage freeing valuable GPU memory for active requests.

In our e-commerce example:
- Without cache offloading: increasing number of concurrent sessions per GPU increases latency due to KV cache evictions and recomputations
- With Dynamo: GPUs can support higher concurrencies while maintaining low latency
Result: Higher concurrency at lower cost, without degrading user experience.

Enterprise-scale inference experiments: Dynamo with GB200, running on AKS

We set out to deploy the popular open-source GPT-OSS 120B reasoning model using Dynamo on AKS on GB200 NVL72, adapting the SemiAnalysis InferenceMAX recipe for a large scale, production-grade environment.

Our approach: leverage Dynamo as the inference server and swap GB200 NVL72 nodes in place of NVIDIA HGX™ B200, scaling the deployment across multiple nodes.

Our goal was to replicate the performance results reported by SemiAnalysis, but at a larger scale within an AKS environment, proving that enterprise-scale inference with cutting-edge hardware and open-source models is not only possible, but practical.

AKS Deployment Overview

Ready to build the same setup? Our comprehensive guide walks you through each stage of the deployment:

Set up your foundation: Configure GPU node pools and prepare your inference set up with the prerequisites you will need.
Deploy Dynamo via Helm: Get the inference server running with the right configurations for GB200 NVL72.
Benchmark performance with your serving engine: Test and optimize latency/throughput under production conditions.

Find the complete recipe for GPT-OSS 120B at aka.ms/dynamo-recipe-gpt-oss-120b and get hands-on with the deployment guide at aka.ms/aks-dynamo.

The results

By following this approach, we achieved 1.2 million tokens per second, meeting our goal of replicating SemiAnalysis InferenceMAX results at enterprise scale. This demonstrates that Dynamo on AKS running on ND GB200-v6 instances can deliver the performance needed for production inference workloads.

Looking ahead

This work reflects a deep collaboration between Azure and NVIDIA to reimagine how large-scale inference is built and operated, from the hardware up through the software stack. By combining GB200 NVL72 nodes and the open-source Dynamo project on AKS, we’ve taken a step toward making distributed inference faster, more efficient, and more responsive to real-world demands.

This post focused on the foundational serving stack. In upcoming blogs, we will build on this foundation and explore more of Dynamo's advanced features, such as Disaggregated Serving and SLA-based Planner. We'll demonstrate how these features allow for even greater efficiency, moving from a static, holistic deployment to a flexible, phase-splitted architecture. Moving forward, we also plan to extend our testing to include larger mixture-of-experts (MoE) reasoning models such as DeepSeek R1. We encourage you to try out the Dynamo recipe in this blog on AKS and share your feedback!

How to Deploy AKS MCP Server on AKS with Workload Identity

October 22, 2025 · 15 min read

Paul Yu

Cloud Native Developer Advocate

It's been a few months since the AKS-MCP server was announced. Since then, there have been several updates and improvements. The MCP server can be easily installed on a local machine using the AKS Extension for VS Code, or via the GitHub MCP registry, or even using the Docker MCP hub.

In this blog post, I'll show you one approach to running the AKS MCP server: deploying it inside an AKS cluster as a Streamable HTTP service. This pattern demonstrates how MCP servers can be centrally managed and made accessible to multiple clients—including AI assistants, automation tools, and even autonomous agents.

Recommendations for Major OS Version Upgrades with Azure Kubernetes Service (AKS)

October 7, 2025 · 11 min read

Flora Taagen

Product Manager 2 at Microsoft

Ally Ford

Product Manager 2 at Microsoft

Introduction

Upgrading the operating system version on your AKS nodes is a critical step that can impact workload security, stability, and performance. In this blog, we’ll share key recommendations to help you plan and execute major OS version upgrades smoothly and confidently on AKS.

Azure Monitor dashboards with Grafana in Azure Portal

September 18, 2025 · 7 min read

Aritra Ghosh

Senior Product Manager at Microsoft

Kayode Prince

Senior Program Manager at Microsoft

Introduction

As Kubernetes adoption accelerates, engineers need streamlined, cost-effective tools for cluster observability. Until now, this often meant deploying and managing separate monitoring stacks. Azure Monitor's latest integration with Grafana changes this: cluster insights are now just a click away in the Azure portal.

Announcing Azure Container Storage v2.0.0: Transforming Performance for Stateful Workloads on AKS

September 15, 2025 · 9 min read

Saurabh Sharma

Product Manager for Cloud Native Storage initiatives

Introduction

Last year we announced the general availability of Azure Container Storage, the industry’s first platform-managed container native storage service in the public cloud. This solution delivers high performance and scalable storage that can effectively meet the demands of containerized environments. Today we are announcing a new v2.0.0 release of Azure Container Storage for Azure Kubernetes Service (AKS). It builds on the foundation of previous release and takes it further by focusing on higher performance, lower latency, efficient resource management and a Kubernetes native user experience for managing stateful workloads on AKS.

Pair llm-d Inference with KAITO RAG Advanced Search to Enhance your AI Workflows

September 12, 2025 · 10 min read

Ernest Wong

Software Engineer at Microsoft

Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Overview

In this blog, we'll guide you through setting up an OpenAI API compatible inference endpoint with llm-d and integrating with retrieval- augmented generation (RAG) on AKS. This blog will showcase its value in a key finance use case: indexing the latest SEC 10-K filings for the two S&P 500 companies and querying them. We’ll also highlight the benefits of llm-d based on its architecture and its synergy with RAG.

Observe Smarter: Leveraging Real-Time insights via the AKS-MCP Server

August 20, 2025 · 9 min read

Qasim Sarfraz

Software Engineer at Microsoft

Introduction

Recently, we released the AKS-MCP server, which enables AKS customers to automate diagnostics, troubleshooting, and cluster management using natural language. One of its key capabilities is real-time observability using inspektor_gadget_observability MCP tool, which leverages a technology called eBPF to help customers quickly inspect and debug applications running in AKS clusters.

Announcing the CLI Agent for AKS: Agentic AI-powered operations and diagnostics at your fingertips

August 15, 2025 · 9 min read

Principal PM Lead for the Azure Kubernetes Service

Julia Yin

Product Manager at Microsoft

Aritra Ghosh

Senior Product Manager at Microsoft

At KubeCon India earlier this month, the AKS team shared our newest Agentic AI-powered feature with the broader Kubernetes community: the CLI Agent for AKS. CLI Agent for AKS is a new AI-powered command-line experience designed to help Azure Kubernetes Service (AKS) users troubleshoot, optimize, and operate their clusters with unprecedented ease and intelligence.

Announcing the AKS-MCP Server: Unlock Intelligent Kubernetes Operations

August 6, 2025 · 9 min read

Principal PM Lead for the Azure Kubernetes Service

We're excited to announce the launch of the AKS-MCP Server. An open source Model Context Protocol (MCP) server designed to make your Azure Kubernetes Service (AKS) clusters AI-native and more accessible for developers, SREs, and platform engineers through Agentic AI workflows.

AKS-MCP isn't just another integration layer. It empowers cutting-edge AI assistants (such as Claude, Cursor, and GitHub Copilot) to interact with AKS through a secure, standards-based protocol—opening new possibilities for automation, observability, and collaborative cloud operations.

Accelerate DNS Performance with LocalDNS

August 4, 2025 · 6 min read

Vaibhav Arora

Product Manager for Azure Kubernetes Service

DNS performance issues can cripple production Kubernetes clusters, causing application timeouts and service outages. LocalDNS in AKS solves this by moving DNS resolution directly to each node, delivering 10x faster queries and improved reliability. In this post, we share the results from our internal tests showing exactly how much of an improvement LocalDNS can make and how it can benefit your cluster.

Introducing Core Kubernetes Extensions for AKS

August 4, 2025 · 5 min read

Jane Guo

Product Manager at Microsoft Azure

What are Kubernetes Extensions?

Kubernetes extensions (or cluster extensions) are pre-packaged applications that simplify the installation and lifecycle management of Azure capabilities on Kubernetes clusters. Examples include Azure Backup, GitOps (Flux), and Azure Machine Learning. Third-party extensions (or Kubernetes apps), such as Datadog AKS Cluster Extension and Isovalent Cilium Enterprise, are also available in the Azure Marketplace.

Streamlining Temporal Worker Deployments on AKS

July 31, 2025 · 6 min read

Steve Womack

Solutions Architect at Temporal

Brian Redmond

AKS and Azure Cloud Native Platforms

Temporal is an open source platform that helps developers build and scale resilient Enterprise and AI applications. Complex and long-running processes are easily orchestrated with durable execution, ensuring they never fail or lose state. Every step is tracked in an Event History that lets developers easily observe and debug applications. In this guide, we will help you understand how to run and scale your workers on Azure Kubernetes Service (AKS).

AKS Long Term Support: 24-Month Support Now Available for Every Kubernetes Version

July 25, 2025 · 15 min read

Kaarthikeyan Subramanian

Senior Product Manager for the Azure Kubernetes Service

In London at KubeCon EU 2025, AKS announced our expansion of what AKS Long Term Support (LTS) includes. Today, we're sharing more details about this offering that addresses one of the most critical challenges enterprises face when running Kubernetes at scale.

Debugging DNS in AKS with Inspektor Gadget

July 23, 2025 · 7 min read

Jose Blanquicet

Senior Software Engineer at Microsoft

If you're reading this, you likely have heard the phrase "It's always DNS." This is a common joke amongst developers that the root of many issues is related to DNS.

In this blog we aim to empower you to identify the root cause of DNS issues and get back to green. You can also watch the video walkthrough from Microsoft Build Breakout Session #181 starting at the 5-minute mark.

Scaling Safely with Azure AKS Spot Node Pools Using Cluster Autoscaler Priority Expander

July 17, 2025 · 4 min read

Principal PM Lead for the Azure Kubernetes Service

As engineering teams seek to optimize costs and maintain scalability in the cloud, leveraging Azure Spot Virtual Machines (VMs) in Azure Kubernetes Service (AKS) can help dramatically reduce compute costs for workloads tolerant of interruption.

However, operationalizing spot nodes safely—especially for production or critical workloads—requires deliberate strategies around cluster autoscaling and workload placement.

Performance Tuning AKS for Network Intensive Workloads

July 15, 2025 · 6 min read

Anson Qian

Software Engineer at Microsoft

Alyssa Vu

Software Engineer at Microsoft

As more intelligent applications are deployed and hosted on Azure Kubernetes Service (AKS), network performance becomes increasingly critical to ensuring a seamless user experience. For example, a chatbot server running in an AKS cluster needs to handle high volumes of network traffic with low latency, while retrieving contextual data — such as conversation history and user feedback from a database or cache, and interacting with a LLM (Large Language Model) endpoint through prompt requests and streamed inference responses.

In this blog post, we share how we conducted simple benchmarks to evaluate and compare network performance across various VM (Virtual Machine) SKUs and series. We also provide recommendations on key kernel settings to help you explore the trade-offs between network performance and resource usage.

Boosting PostgreSQL performance on AKS

July 9, 2025 · 9 min read

Eric Cheng

Product manager in Azure Storage

PostgreSQL is one of the most popular stateful workloads on Azure Kubernetes Service (AKS). Thanks to the support of a vibrant community, we now have a strong PostgreSQL operator ecosystem that makes it easier for everyone to self-host PostgreSQL on Kubernetes.

From 7B to 70B+: Serving giant LLMs efficiently with KAITO and ACStor v2

July 8, 2025 · 6 min read

Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Francis Yu

Product Manager focusing on storage orchestration for Kubernetes workloads

XL-size large language models (LLMs) are quickly evolving from experimental tools to essential infrastructure. Their flexibility, ease of integration, and growing range of capabilities are positioning them as core components of modern software systems.

Massive LLMs power virtual assistants and recommendations across social media, UI/UX design tooling and self-learning platforms. But how do they differ from your average language model? And how do you get the best bang for your buck running them at scale?

Let’s unpack why large models matter and how Kubernetes, paired with NVMe local storage, accelerates intelligent app development.

What's New?! Guidance Updates for Stateful Workloads on AKS

June 10, 2025 · 5 min read

Colin Mixon

Product Manager 2 focusing on HPC and stateful workloads on Azure Kubernetes Service

Helping you deploy on AKS

Building on our initial announcement for Deploying Open Source Software on Azure Azure is excited to announce we have expanded our library of technical best practice deployment guides for stateful workloads on AKS. We have developed a comprehensive guide for deploying Kafka on AKS, and updated our Postgres guidance with additional storage considerations for data resiliency, performance and cost. We have also added Terraform templates to our Mongo DB and Valkey guides for automated deployments.

Using Stream Analytics to Filter AKS Control Plane Logs

May 30, 2025 · 11 min read

Steve Griffith

Microsoft App Innovation Global Blackbelt team

While AKS does not provide access to the cluster's managed control plane, it does provide access to the control plane component logs via diagnostic settings. The easiest option to persist and search this data is to send it directly to Azure Log Analytics, however there is a large amount of data in those logs, which makes it cost prohibitive in Log Analytics. Alternatively, you can send all the data to an Azure Storage Account, but then searching and alerting can be challenging.

To address the above challenge, one option is to stream the data to Azure Event Hub, which then gives you the option to use Azure Stream Analytics to filter out events that you deem important and then just store the rest in cheaper storage (ex. Azure Storage) for potential future diagnostic needs.

In this walkthrough we'll create an AKS cluster, enable diagnostic logging to Azure Stream Analytics and then demonstrate how to filter out some key records.

Azure VM Generations and AKS

April 23, 2025 · 6 min read

Jack Jiang

Product Manager at Microsoft

Ally Ford

Product Manager 2 at Microsoft

Sarah Zhou

Product Manager at Microsoft

What are Virtual Machine Generations?

If you are a user of Azure, you may be familiar with virtual machines. What you may not have known is the fact that Azure now offers two generations of virtual machines!

Before going further, let's first break down virtual machines. Azure virtual machines are offered in various "sizes," which are broken down by the amount and type of each resource allocated, such as CPU, memory, storage, and network bandwidth. These resources are tied to a portion of a physical server's hardware capabilities. Physical servers may be broken down into many different VM size series or configurations available utilizing its resources.

As the physical hardware ages and newer components become available, older hardware and VMs get retired, while newer generation hardware and VM products are made available.

In this blog, we will go over Generation 1 and newer Generation 2 virtual machines. Both have their own use cases, and picking the right one to suit your workloads is critical in ensuring you get the best possible experience, capabilities, and cost.

Enhancing Your Operating System's Security with OS Security Patches in AKS

April 22, 2025 · 6 min read

Kaarthikeyan Subramanian

Senior Product Manager for the Azure Kubernetes Service

Traditional patching and the need for Managed patching

Operating System (OS) security patches are critical for safeguarding systems against vulnerabilities that malicious actors could exploit. These patches help ensure your system remains protected against emerging threats. Traditionally, customers have relied on nightly updates, such as unattended upgrades in Ubuntu or Automatic Guest OS Patching at the virtual machine (VM) level. However, when kernel security packages were updated, a host machine reboot was often required, typically managed using tools like kured.

Simplifying InfiniBand on AKS

April 11, 2025 · 5 min read

Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Suraj Deshmukh

Software Engineer at Microsoft

Ernest Wong

Software Engineer at Microsoft

High performance computing (HPC) workloads, like large-scale distributed AI training and inferencing, often require fast, reliable data transfer and synchronization across the underlying compute. Model training, for example, requires shared memory across GPUs because the parameters and gradients need to be constantly shared. For models with billions of parameters, the available memory in a single GPU node may not be enough, so "pooling" the memory across multiple nodes also requires high memory bandwidth due to the sheer volume of data involved. A common way to achieve this at scale is with a high-speed, low-latency network interconnect technology called InfiniBand (IB).

Optimize AKS Traffic with externalTrafficPolicy Local

April 4, 2025 · 10 min read

Mitch Shao

Senior Software Engineer for Azure Kubernetes Service

Vaibhav Arora

Product Manager for Azure Kubernetes Service

Managing external traffic in Kubernetes clusters can be a complex task, especially when striving to maintain service reliability, optimize performance, and ensure seamless user experiences. With the increasing adoption of Kubernetes in production environments, understanding and implementing best practices for external traffic management when using the Azure Load Balancer has become essential.

Limitless Kubernetes Scaling for AI and Data-intensive Workloads: The AKS Fleet Strategy

April 2, 2025 · 7 min read

Principal PM Lead for the Azure Kubernetes Service

With the fast-paced advancement of AI workloads, building and fine-tuning of multi-modal models, and extensive batch data processing jobs, more and more enterprises are leaning into Kubernetes platforms to take advantage of its ability to scale and optimize compute resources. With AKS, you can manage up to 5,000 nodes (upstream K8s limit) in a single cluster under optimal conditions, but for some large enterprises, that might not be enough.

Enhancing Observability in Azure Kubernetes Service (AKS): What's New?

March 17, 2025 · 9 min read

Principal PM Lead for the Azure Kubernetes Service

At Azure Kubernetes Service (AKS), we deeply recognize how crucial observability is for running stable Kubernetes environments. Given our extensive reliance on Kubernetes internally, we're continually innovating to ensure you have robust, clear, and actionable insights into your cluster health and performance. Observability—the ability to monitor, understand, and manage your systems effectively—is a foundational pillar for AKS product vision to enable our users to achieve more.

Accelerating Open-Source Innovation with AKS and Bitnami on Azure Marketplace

March 11, 2025 · 5 min read

Bob Mital

Principal Product Manager at Microsoft Azure

Azure Kubernetes Service (AKS) is a highly managed platform that simplifies deploying, managing, and scaling containerized applications using Kubernetes on Azure. When paired with Bitnami's open-source solutions available on Azure Marketplace, AKS becomes an even more powerful platform for accelerating the deployment of Kubernetes workloads that rely on popular open-source projects.

End-to-End TLS Encryption with AKS App Routing and AFD

February 28, 2025 · 14 min read

Steve Griffith

Microsoft App Innovation Global Blackbelt team

When running globally distributed public applications in Kubernetes, having access to a global traffic management solution is critical to ensuring high availability and security at the edge. Fortunately, Azure Front Door provides an easy-to-use global traffic routing capability, with integrated Content Delivery Network and Web Application Firewall.

Building Community with CRDs: Kube Resource Orchestrator

January 30, 2025 · 3 min read

Matthew Christopher

Bridget Kromhout

Principal Product Manager at Microsoft Azure

Kube Resource Orchestrator introduces a Kubernetes-native, cloud-agnostic way to define groupings of Kubernetes resources. With kro, you can group your applications and their dependencies as a single resource that can be easily consumed by end users.

Just as we collaborate in upstream Kubernetes, Azure is partnering with AWS and Google Cloud on kro (pronounced “crow”) to make Kubernetes APIs simpler for all Kubernetes users. We’re centering the needs of customers and the cloud native community to offer tooling that works seamlessly no matter where you run your K8s clusters.

Multi-Cluster Management with KubeFleet

January 27, 2025 · 4 min read

Ryan Zhang

Principal Software Engineer for Azure Kubernetes Service

In the ever-evolving world of cloud native technologies, managing multiple Kubernetes clusters efficiently is a common challenge that still does not have a well received community driven solution.

Apache Airflow Guidance for AKS

January 20, 2025 · 2 min read

Technical Program Manager for Cloud Native Platforms

We're pleased to share new guidance on deploying open-source Apache Airflow on Azure Kubernetes Service (AKS).

AKS - Community Calls

January 13, 2025 · 2 min read

Sanket Bakshi

Technical Program Manager for Cloud Native Platforms

As we start 2025, we are thrilled to announce a new initiative from the Azure Kubernetes Service (AKS) Product Group: The AKS Community Calls. These monthly sessions are designed to foster a closer connection with our community, providing a platform to discuss the product roadmap and address your questions directly

Ray on AKS

January 13, 2025 · 2 min read

Technical Program Manager for Cloud Native Platforms

We've released new guidance for running Ray on AKS!

Mastering the Move: EKS to AKS by Example - Part 2

January 6, 2025 · 2 min read

Technical Program Manager for Cloud Native Platforms

Welcome back to our series on migrating Amazon Elastic Kubernetes Service (EKS) workloads to Azure Kubernetes Service (AKS). In Part 1 we explored migrating and Event Driven Workload using Karpenter and KEDA from EKS to AKS. Next, we look into a more complex migration scenario with a common Kubernetes workload the n-tier web application.

Local Development on AKS with mirrord

December 4, 2024 · 11 min read

Gemma Tipper

Software Engineer at MetalBear

Quentin Petraroia

Product Manager for Azure Kubernetes Service

Developing applications for Kubernetes can mean a lot of time spent waiting and relatively little time spent writing code. Whenever you want to test your code changes in the cluster, you usually have to build your application, deploy it to the cluster, and attach a remote debugger (or add a bunch of logs). These iterations can be incredibly time-consuming. Thankfully, there is a way to bridge the gap between your local environment and a remote cluster, making them feel seamlessly connected. mirrord, which can be used as a plugin for VSCode or IntelliJ or directly in the CLI, is an open-source tool that does exactly that (and much more).

Deploy and take Flyte with an end-to-end ML orchestration solution on AKS

November 20, 2024 · 7 min read

Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Data is often at the heart of application design and development - it fuels user-centric design, provides insights for feature enhancements, and represents the value of an application as a whole. In that case, shouldn’t we use data science tools and workflows that are flexible and scalable on a platform like Kubernetes, for a range of application types?

In collaboration with David Espejo and Shalabh Chaudhri from Union.ai, we’ll dive into an example using Flyte, a platform built on Kubernetes itself. Flyte can help you manage and scale out data processing and machine learning pipelines through a simple user interface.

Delve into Dynamic Resource Allocation, devices, and drivers on Kubernetes

November 13, 2024 · 8 min read

Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Dynamic Resource Allocation (DRA) is often mentioned in discussions about GPUs and specialized devices designed for high-performance AI and video processing jobs. But what exactly is it?

Fine tune language models with KAITO on AKS

August 23, 2024 · 8 min read

Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

You may have heard of the Kubernetes AI Toolchain Operator (KAITO) announced at Ignite 2023 and KubeCon Europe this year. The open source project has gained popularity in recent months by introducing a streamlined approach to AI model deployment and flexible infrastructure provisioning on Kubernetes.

With the v0.3.0 release, KAITO has expanded the supported model library to include the Phi-3 model, but the biggest (and most exciting) addition is the ability to fine-tune open-source models. Why should you be excited about fine-tuning? Well, it’s because fine-tuning is one way of giving your foundation model additional training using a specific dataset to enhance accuracy, which ultimately improves the interaction with end-users. (Another way to increase model accuracy is Retrieval-Augmented Generation (RAG), which we touch on briefly in this section, coming soon to KAITO).

Mastering the Move: EKS to AKS by Example

August 7, 2024 · 3 min read