Skip to main content

Azure Monitor dashboards with Grafana in Azure Portal

· 7 min read
Aritra Ghosh
Senior Product Manager at Microsoft
Kayode Prince
Senior Program Manager at Microsoft

Introduction

As Kubernetes adoption accelerates, engineers need streamlined, cost-effective tools for cluster observability. Until now, this often meant deploying and managing separate monitoring stacks. Azure Monitor's latest integration with Grafana changes this: cluster insights are now just a click away in the Azure portal.

Announcing Azure Container Storage v2.0.0: Transforming Performance for Stateful Workloads on AKS

· 9 min read
Saurabh Sharma
Product Manager for Cloud Native Storage initiatives

Introduction

Last year we announced the general availability of Azure Container Storage, the industry’s first platform-managed container native storage service in the public cloud. This solution delivers high performance and scalable storage that can effectively meet the demands of containerized environments. Today we are announcing a new v2.0.0 release of Azure Container Storage for Azure Kubernetes Service (AKS). It builds on the foundation of previous release and takes it further by focusing on higher performance, lower latency, efficient resource management and a Kubernetes native user experience for managing stateful workloads on AKS.

Pair llm-d Inference with KAITO RAG Advanced Search to Enhance your AI Workflows

· 10 min read
Ernest Wong
Software Engineer at Microsoft
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Overview

In this blog, we'll guide you through setting up an OpenAI API compatible inference endpoint with llm-d and integrating with retrieval- augmented generation (RAG) on AKS. This blog will showcase its value in a key finance use case: indexing the latest SEC 10-K filings for the two S&P 500 companies and querying them. We’ll also highlight the benefits of llm-d based on its architecture and its synergy with RAG.

Observe Smarter: Leveraging Real-Time insights via the AKS-MCP Server

· 9 min read
Qasim Sarfraz
Software Engineer at Microsoft

Introduction

Recently, we released the AKS-MCP server, which enables AKS customers to automate diagnostics, troubleshooting, and cluster management using natural language. One of its key capabilities is real-time observability using inspektor_gadget_observability MCP tool, which leverages a technology called eBPF to help customers quickly inspect and debug applications running in AKS clusters.

Announcing the CLI Agent for AKS: Agentic AI-powered operations and diagnostics at your fingertips

· 9 min read
Pavneet Ahluwalia
Principal PM Lead for the Azure Kubernetes Service
Julia Yin
Product Manager at Microsoft
Aritra Ghosh
Senior Product Manager at Microsoft

At KubeCon India earlier this month, the AKS team shared our newest Agentic AI-powered feature with the broader Kubernetes community: the CLI Agent for AKS. CLI Agent for AKS is a new AI-powered command-line experience designed to help Azure Kubernetes Service (AKS) users troubleshoot, optimize, and operate their clusters with unprecedented ease and intelligence.

Announcing the AKS-MCP Server: Unlock Intelligent Kubernetes Operations

· 9 min read
Pavneet Ahluwalia
Principal PM Lead for the Azure Kubernetes Service

We're excited to announce the launch of the AKS-MCP Server. An open source Model Context Protocol (MCP) server designed to make your Azure Kubernetes Service (AKS) clusters AI-native and more accessible for developers, SREs, and platform engineers through Agentic AI workflows.

AKS-MCP isn't just another integration layer. It empowers cutting-edge AI assistants (such as Claude, Cursor, and GitHub Copilot) to interact with AKS through a secure, standards-based protocol—opening new possibilities for automation, observability, and collaborative cloud operations.

Accelerate DNS Performance with LocalDNS

· 6 min read
Vaibhav Arora
Product Manager for Azure Kubernetes Service

DNS performance issues can cripple production Kubernetes clusters, causing application timeouts and service outages. LocalDNS in AKS solves this by moving DNS resolution directly to each node, delivering 10x faster queries and improved reliability. In this post, we share the results from our internal tests showing exactly how much of an improvement LocalDNS can make and how it can benefit your cluster.

Streamlining Temporal Worker Deployments on AKS

· 6 min read
Steve Womack
Solutions Architect at Temporal
Brian Redmond
AKS and Azure Cloud Native Platforms

Temporal is an open source platform that helps developers build and scale resilient Enterprise and AI applications. Complex and long-running processes are easily orchestrated with durable execution, ensuring they never fail or lose state. Every step is tracked in an Event History that lets developers easily observe and debug applications. In this guide, we will help you understand how to run and scale your workers on Azure Kubernetes Service (AKS).

Debugging DNS in AKS with Inspektor Gadget

· 7 min read
Jose Blanquicet
Senior Software Engineer at Microsoft

If you're reading this, you likely have heard the phrase "It's always DNS." This is a common joke amongst developers that the root of many issues is related to DNS.

In this blog we aim to empower you to identify the root cause of DNS issues and get back to green. You can also watch the video walkthrough from Microsoft Build Breakout Session #181 starting at the 5-minute mark.

Scaling Safely with Azure AKS Spot Node Pools Using Cluster Autoscaler Priority Expander

· 4 min read
Pavneet Ahluwalia
Principal PM Lead for the Azure Kubernetes Service

As engineering teams seek to optimize costs and maintain scalability in the cloud, leveraging Azure Spot Virtual Machines (VMs) in Azure Kubernetes Service (AKS) can help dramatically reduce compute costs for workloads tolerant of interruption.

However, operationalizing spot nodes safely—especially for production or critical workloads—requires deliberate strategies around cluster autoscaling and workload placement.

Performance Tuning AKS for Network Intensive Workloads

· 6 min read
Anson Qian
Software Engineer at Microsoft
Alyssa Vu
Software Engineer at Microsoft

As more intelligent applications are deployed and hosted on Azure Kubernetes Service (AKS), network performance becomes increasingly critical to ensuring a seamless user experience. For example, a chatbot server running in an AKS cluster needs to handle high volumes of network traffic with low latency, while retrieving contextual data — such as conversation history and user feedback from a database or cache, and interacting with a LLM (Large Language Model) endpoint through prompt requests and streamed inference responses.

In this blog post, we share how we conducted simple benchmarks to evaluate and compare network performance across various VM (Virtual Machine) SKUs and series. We also provide recommendations on key kernel settings to help you explore the trade-offs between network performance and resource usage.

From 7B to 70B+: Serving giant LLMs efficiently with KAITO and ACStor v2

· 6 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Francis Yu
Product Manager focusing on storage orchestration for Kubernetes workloads

XL-size large language models (LLMs) are quickly evolving from experimental tools to essential infrastructure. Their flexibility, ease of integration, and growing range of capabilities are positioning them as core components of modern software systems.

Massive LLMs power virtual assistants and recommendations across social media, UI/UX design tooling and self-learning platforms. But how do they differ from your average language model? And how do you get the best bang for your buck running them at scale?

Let’s unpack why large models matter and how Kubernetes, paired with NVMe local storage, accelerates intelligent app development.

What's New?! Guidance Updates for Stateful Workloads on AKS

· 5 min read
Colin Mixon
Product Manager 2 focusing on HPC and stateful workloads on Azure Kubernetes Service

Helping you deploy on AKS

Building on our initial announcement for Deploying Open Source Software on Azure Azure is excited to announce we have expanded our library of technical best practice deployment guides for stateful workloads on AKS. We have developed a comprehensive guide for deploying Kafka on AKS, and updated our Postgres guidance with additional storage considerations for data resiliency, performance and cost. We have also added Terraform templates to our Mongo DB and Valkey guides for automated deployments.

Using Stream Analytics to Filter AKS Control Plane Logs

· 11 min read
Steve Griffith
Microsoft App Innovation Global Blackbelt team

While AKS does not provide access to the cluster's managed control plane, it does provide access to the control plane component logs via diagnostic settings. The easiest option to persist and search this data is to send it directly to Azure Log Analytics, however there is a large amount of data in those logs, which makes it cost prohibitive in Log Analytics. Alternatively, you can send all the data to an Azure Storage Account, but then searching and alerting can be challenging.

To address the above challenge, one option is to stream the data to Azure Event Hub, which then gives you the option to use Azure Stream Analytics to filter out events that you deem important and then just store the rest in cheaper storage (ex. Azure Storage) for potential future diagnostic needs.

In this walkthrough we'll create an AKS cluster, enable diagnostic logging to Azure Stream Analytics and then demonstrate how to filter out some key records.

Azure VM Generations and AKS

· 6 min read
Jack Jiang
Product Manager at Microsoft
Ally Ford
Product Manager 2 at Microsoft
Sarah Zhou
Product Manager at Microsoft

What are Virtual Machine Generations?

If you are a user of Azure, you may be familiar with virtual machines. What you may not have known is the fact that Azure now offers two generations of virtual machines!

Before going further, let's first break down virtual machines. Azure virtual machines are offered in various "sizes," which are broken down by the amount and type of each resource allocated, such as CPU, memory, storage, and network bandwidth. These resources are tied to a portion of a physical server's hardware capabilities. Physical servers may be broken down into many different VM size series or configurations available utilizing its resources.

As the physical hardware ages and newer components become available, older hardware and VMs get retired, while newer generation hardware and VM products are made available.

In this blog, we will go over Generation 1 and newer Generation 2 virtual machines. Both have their own use cases, and picking the right one to suit your workloads is critical in ensuring you get the best possible experience, capabilities, and cost.

Enhancing Your Operating System's Security with OS Security Patches in AKS

· 6 min read
Kaarthikeyan Subramanian
Senior Product Manager for the Azure Kubernetes Service

Traditional patching and the need for Managed patching

Operating System (OS) security patches are critical for safeguarding systems against vulnerabilities that malicious actors could exploit. These patches help ensure your system remains protected against emerging threats. Traditionally, customers have relied on nightly updates, such as unattended upgrades in Ubuntu or Automatic Guest OS Patching at the virtual machine (VM) level. However, when kernel security packages were updated, a host machine reboot was often required, typically managed using tools like kured.

Simplifying InfiniBand on AKS

· 5 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Suraj Deshmukh
Software Engineer at Microsoft
Ernest Wong
Software Engineer at Microsoft

High performance computing (HPC) workloads, like large-scale distributed AI training and inferencing, often require fast, reliable data transfer and synchronization across the underlying compute. Model training, for example, requires shared memory across GPUs because the parameters and gradients need to be constantly shared. For models with billions of parameters, the available memory in a single GPU node may not be enough, so "pooling" the memory across multiple nodes also requires high memory bandwidth due to the sheer volume of data involved. A common way to achieve this at scale is with a high-speed, low-latency network interconnect technology called InfiniBand (IB).

Optimize AKS Traffic with externalTrafficPolicy Local

· 10 min read
Mitch Shao
Senior Software Engineer for Azure Kubernetes Service
Vaibhav Arora
Product Manager for Azure Kubernetes Service

Managing external traffic in Kubernetes clusters can be a complex task, especially when striving to maintain service reliability, optimize performance, and ensure seamless user experiences. With the increasing adoption of Kubernetes in production environments, understanding and implementing best practices for external traffic management when using the Azure Load Balancer has become essential.

Limitless Kubernetes Scaling for AI and Data-intensive Workloads: The AKS Fleet Strategy

· 7 min read
Pavneet Ahluwalia
Principal PM Lead for the Azure Kubernetes Service

With the fast-paced advancement of AI workloads, building and fine-tuning of multi-modal models, and extensive batch data processing jobs, more and more enterprises are leaning into Kubernetes platforms to take advantage of its ability to scale and optimize compute resources. With AKS, you can manage up to 5,000 nodes (upstream K8s limit) in a single cluster under optimal conditions, but for some large enterprises, that might not be enough.

Enhancing Observability in Azure Kubernetes Service (AKS): What's New?

· 9 min read
Pavneet Ahluwalia
Principal PM Lead for the Azure Kubernetes Service

At Azure Kubernetes Service (AKS), we deeply recognize how crucial observability is for running stable Kubernetes environments. Given our extensive reliance on Kubernetes internally, we're continually innovating to ensure you have robust, clear, and actionable insights into your cluster health and performance. Observability—the ability to monitor, understand, and manage your systems effectively—is a foundational pillar for AKS product vision to enable our users to achieve more.

Accelerating Open-Source Innovation with AKS and Bitnami on Azure Marketplace

· 5 min read
Bob Mital
Principal Product Manager at Microsoft Azure

Azure Kubernetes Service (AKS) is a highly managed platform that simplifies deploying, managing, and scaling containerized applications using Kubernetes on Azure. When paired with Bitnami's open-source solutions available on Azure Marketplace, AKS becomes an even more powerful platform for accelerating the deployment of Kubernetes workloads that rely on popular open-source projects.

End-to-End TLS Encryption with AKS App Routing and AFD

· 14 min read
Steve Griffith
Microsoft App Innovation Global Blackbelt team

When running globally distributed public applications in Kubernetes, having access to a global traffic management solution is critical to ensuring high availability and security at the edge. Fortunately, Azure Front Door provides an easy-to-use global traffic routing capability, with integrated Content Delivery Network and Web Application Firewall.

Building Community with CRDs: Kube Resource Orchestrator

· 3 min read
Bridget Kromhout
Principal Product Manager at Microsoft Azure

Kube Resource Orchestrator introduces a Kubernetes-native, cloud-agnostic way to define groupings of Kubernetes resources. With kro, you can group your applications and their dependencies as a single resource that can be easily consumed by end users.

Just as we collaborate in upstream Kubernetes, Azure is partnering with AWS and Google Cloud on kro (pronounced “crow”) to make Kubernetes APIs simpler for all Kubernetes users. We’re centering the needs of customers and the cloud native community to offer tooling that works seamlessly no matter where you run your K8s clusters.

AKS - Community Calls

· 2 min read
Sanket Bakshi
Technical Program Manager for Cloud Native Platforms

As we start 2025, we are thrilled to announce a new initiative from the Azure Kubernetes Service (AKS) Product Group: The AKS Community Calls. These monthly sessions are designed to foster a closer connection with our community, providing a platform to discuss the product roadmap and address your questions directly

Mastering the Move: EKS to AKS by Example - Part 2

· 2 min read
Kenneth Kilty
Technical Program Manager for Cloud Native Platforms

Welcome back to our series on migrating Amazon Elastic Kubernetes Service (EKS) workloads to Azure Kubernetes Service (AKS). In Part 1 we explored migrating and Event Driven Workload using Karpenter and KEDA from EKS to AKS. Next, we look into a more complex migration scenario with a common Kubernetes workload the n-tier web application.

Local Development on AKS with mirrord

· 11 min read
Gemma Tipper
Software Engineer at MetalBear
Quentin Petraroia
Product Manager for Azure Kubernetes Service

Developing applications for Kubernetes can mean a lot of time spent waiting and relatively little time spent writing code. Whenever you want to test your code changes in the cluster, you usually have to build your application, deploy it to the cluster, and attach a remote debugger (or add a bunch of logs). These iterations can be incredibly time-consuming. Thankfully, there is a way to bridge the gap between your local environment and a remote cluster, making them feel seamlessly connected. mirrord, which can be used as a plugin for VSCode or IntelliJ or directly in the CLI, is an open-source tool that does exactly that (and much more).

Deploy and take Flyte with an end-to-end ML orchestration solution on AKS

· 7 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Data is often at the heart of application design and development - it fuels user-centric design, provides insights for feature enhancements, and represents the value of an application as a whole. In that case, shouldn’t we use data science tools and workflows that are flexible and scalable on a platform like Kubernetes, for a range of application types?

In collaboration with David Espejo and Shalabh Chaudhri from Union.ai, we’ll dive into an example using Flyte, a platform built on Kubernetes itself. Flyte can help you manage and scale out data processing and machine learning pipelines through a simple user interface.

Fine tune language models with KAITO on AKS

· 8 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

You may have heard of the Kubernetes AI Toolchain Operator (KAITO) announced at Ignite 2023 and KubeCon Europe this year. The open source project has gained popularity in recent months by introducing a streamlined approach to AI model deployment and flexible infrastructure provisioning on Kubernetes.

With the v0.3.0 release, KAITO has expanded the supported model library to include the Phi-3 model, but the biggest (and most exciting) addition is the ability to fine-tune open-source models. Why should you be excited about fine-tuning? Well, it’s because fine-tuning is one way of giving your foundation model additional training using a specific dataset to enhance accuracy, which ultimately improves the interaction with end-users. (Another way to increase model accuracy is Retrieval-Augmented Generation (RAG), which we touch on briefly in this section, coming soon to KAITO).

Mastering the Move: EKS to AKS by Example

· 3 min read
Kenneth Kilty
Technical Program Manager for Cloud Native Platforms

Many companies use multiple clouds for their workloads. Some of these companies need to accommodate the cloud preferences of their customers. Kubernetes plays a central role in multi-cloud workloads due to its ability to provide a consistent and portable environment across different cloud providers.

We would like to share the first in a new documentation series designed specifically for customers already using Amazon EKS, to help them replicate or migrate their workloads to AKS: Replicate an AWS event-driven workflow (EDW) workload with KEDA and Karpenter in Azure Kubernetes Service (AKS)

Even with Kubernetes’ portable API, moving between clouds can be challenging. Each cloud has its own unique concepts, behaviors, and characteristics that will seem unfamiliar when you’re accustomed to another cloud’s approach. This is not unlike the experience of learning a new language or visiting a new country for the first time. This series will be your local guide to the world of Azure. The samples in this series begin with infrastructure and code on EKS and end with equivalently functional infrastructure and code on AKS, while explaining the conceptual differences between AWS and Azure throughout.

Introducing Core Kubernetes Extensions for AKS

· 5 min read
Jane Guo
Product Manager at Microsoft Azure

What are Kubernetes Extensions?

Kubernetes extensions (or cluster extensions) are pre-packaged applications that simplify the installation and lifecycle management of Azure capabilities on Kubernetes clusters. Examples include Azure Backup, GitOps (Flux), and Azure Machine Learning. Third-party extensions (or Kubernetes apps), such as Datadog AKS Cluster Extension and Isovalent Cilium Enterprise, are also available in the Azure Marketplace.

Using AKS-managed Istio External Ingress Gateway with Gateway API

· 11 min read
Paul Yu
Cloud Native Developer Advocate

Kubernetes is great at orchestrating containers, but it can be a bit tricky to manage traffic routing. There are many options and implementations that you, as a cluster operator have probably had to deal with. We have the default Service resource that can be used to expose applications, but it is limited to routing based on layer 4 (TCP/UDP) and does not support more advanced traffic routing use cases. There's also the Ingress controller, which enabled layer 7 (HTTP) routing, and securing the North-South traffic using TLS, but it was not standardized and each vendor implementation required learning a new set of resource annotations. When it comes to managing and securing East-West traffic between services, there's Service Mesh which is yet another layer of infrastructure to manage on top of Kubernetes. And we're in the same boat when it comes to resource management with each vendor having their own ways of doing things.

Azure Container Storage - Generally Available

· 8 min read
Saurabh Sharma
Product Manager for Cloud Native Storage initiatives

Last May, we announced the preview of Azure Container Storage with backing storage options including Ephemeral Disks, Azure Disks, and Azure Elastic SAN. Earlier today we announced the general availability of Azure Container Storage, the industry’s first platform-managed container native storage service in the public cloud providing highly scalable storage that can keep up with the demands of a containerized environment. With this announcement, Azure Disks and Ephemeral Disks are now generally available, while Azure Elastic SAN remains in preview and is expected to reach general availability soon. In this post, we'll explore the benefits of Azure Container Storage, the inspirations behind its development, and the new features in the GA release.

AKS Automatic

· 3 min read
Jorge Palma
Principal PM Lead for the Azure Kubernetes Service

You may have heard about AKS Automatic in the Build keynote today. We thought we'd share a bit of the thinking that went into it and why we think it can be a game changer for you.

Automatic is a new experience for Azure Kubernetes Service (AKS) that lets you create and manage production-ready clusters with minimal effort and added confidence. This means you can focus on developing and running your applications, while AKS handles the rest for you.

AKS - Past, Present and Future

· 11 min read
Jorge Palma
Principal PM Lead for the Azure Kubernetes Service

Hi! My name is Jorge Palma, I’m a PM Lead for AKS and I’m excited to be inaugurating our new AKS Engineering Blog. In this new blog, we will complement and extend some of our existing channels, providing extra context to announcements, sharing product tips & tricks that may not fit in our core documentation, and giving you a peak behind the curtain of how we build the product.

In this initial post, taking inspiration from Mark Russinovich who has named so many similar series and talks like this, I hope to take you on a (shortened) journey through the history of Azure Kubernetes Service. We will talk about the past, how we started and how we got here, where we are today and also some thoughts about what the future holds for AKS.