AKS Engineering Blog

Azure Monitor dashboards with Grafana in Azure Portal

Thu, 18 Sep 2025 00:00:00 GMT

Introduction

As Kubernetes adoption accelerates, engineers need streamlined, cost-effective tools for cluster observability. Until now, this often meant deploying and managing separate monitoring stacks. Azure Monitor's latest integration with Grafana changes this: cluster insights are now just a click away in the Azure portal.

We are thrilled to announce that Azure Kubernetes Service (AKS) now offers native Grafana dashboards within the Azure portal at no additional cost. This integration eliminates the complexity of maintaining separate visualization tools while delivering Grafana's powerful capabilities directly where you need them. Metrics from Container Insights, the Kubernetes metrics server, and any configured Azure Managed Prometheus endpoints are available out-of-the-box, providing comprehensive cluster observability.

To get started, navigate to your AKS cluster in the Azure portal and select Monitoring > Dashboards with Grafana (preview). You will be presented with prebuilt dashboards for cluster health, node utilization, and pod performance. From there, you may edit and add panels, configure template variables scoped to namespaces or node pools, and save custom dashboards - all within the familiar AKS management experience. Because no Grafana server needs to be provisioned or maintained, teams can quickly adopt and customize dashboards within the AKS portal- reducing setup time, operational complexity, and accelerating access to actionable insights for SRE and DevOps workflows.

Figure 1: Comprehensive AKS cluster dashboard showing resource utilization, performance metrics, and health status directly within the Azure portal.

Why Grafana in Azure Portal?

Grafana is celebrated for its rich panel types, templating engine, and client-side data transformations. Embedding it natively in Azure Portal offers:

Unified experience: No extra authentication or network configuration—just use your Azure login.
Single-pane observability: Combine Azure Metrics, Logs, and Application Insights data alongside and other Azure data sources supported by Grafana.
Rapid onboarding: Spin up dashboards in minutes using familiar Azure workflows and templates. Many prebuilt dashboards are available out of the box and any community dashboard using Prometheus and/or Azure Monitor can be imported.

These capabilities mean faster troubleshooting, deeper insights, and a more consistent observability platform for Azure-centric workloads.

When to upgrade to Azure Managed Grafana?

While Dashboards with Grafana in the Azure portal cover most common visualization scenarios, Azure Managed Grafana remains the right choice for advanced use cases, including:

Extended plugin support (including open-source and community plugins).
Advanced authentication, provisioning APIs, and fine-grained access control.
Multi-cloud and hybrid data source connectivity.

When to choose native dashboards: Quick visibility into your Azure telemetry, minimal setup, with no additional costs.
When to choose Azure Managed Grafana: Large teams with complex governance, open-source or Enterprise, or multi-cloud data sources.

Feature Comparison

Feature	Azure Monitor dashboards with Grafana (preview)	Azure Managed Grafana
Access	Azure portal	Grafana Web Interface
Pricing	No cost	Per user pricing plus compute costs for Standard SKU
Data Sources	Azure Monitor and Azure Prometheus	Azure Monitor, Azure Prometheus, Azure Data Explorer, OSS data sources, Enterprise data sources available with license
Data source Authentication	Current-user only	User-configurable: Current-user, Managed Identity, App registration
Customization	Basic - Azure-managed plugins only	Advanced - Custom plugins, dashboards, and settings
Alerting	Azure Monitor alerts only	Full Grafana alerting capabilities
Enterprise Features	Not supported	Reporting, team collaboration, enterprise plugins
Deployment Model	Managed SaaS	Dedicated VM scale sets with private networking options

A detailed difference is available on the solution comparison

Customization and Advanced Features

The native Grafana experience in Azure Monitor includes many of the customization features you expect:

Variables & Templating: Define dashboard variables for dynamic filtering across subscriptions, resource groups, or services.
Themes & Layouts: Automatically switches theme based on the Azure portal's theme, resize panels, and arrange layouts with drag-and-drop.
Cross-workspace & cross-source queries: Query data from multiple Log Analytics workspaces, metrics namespaces.
Alerts: View Azure alerts state and history in Grafana dashboards.

Getting Started

note

To view telemetry in Grafana dashboards, ensure you have at least Reader access to the relevant resource (Azure Monitor or Log Analytics workspace).

Currently, the AKS cluster entry point is gated by the enablement of Managed Prometheus. This will likely change in the future as we incorporate dashboards which leverage platform metrics or Container Insights based logs

The default dashboards are available in the AKS cluster page under Dashboards with Grafana

Figure 2: Navigation view showing the collection of pre-built Grafana dashboards available in the AKS monitoring section, providing one-click access to various visualization templates.

To create a new dashboard

In the Azure portal, go to Azure Monitor > Dashboards with Grafana.
Click + New and select New Dashboard.
Provide a title for the dashboard and the subscription, resource group for the dashboard. Click on Create. Once the dashboard is created, select + Add and chose Add Visualization
Choose data sources (Log Analytics, Metrics, Prometheus, Azure Resource Graph) and start adding panels.

This feature is available out of the box for all customers. For more information on customization, refer to the Learn documentation.

Real-world Use Cases

Troubleshooting node issues: A platform SRE spots CPU saturation on the prod-nodepool; by filtering the dashboard by node pool, they view CPU and memory trends and identify the problematic pod in under two minutes.
Analyzing API latency: A DevOps team correlates Application Insights request durations with pod-level metrics for the payments service, isolating slow endpoints and impacted pods to optimize performance.
Unified multi-cluster monitoring: A cloud architect overseeing three AKS clusters uses a single Grafana dashboard to compare node utilization and ingress traffic across regions, enabling data-driven scaling and cost decisions.
Investigating API server list bottlenecks: When the API server experiences high latency, an SRE opens the API Server Grafana dashboard to view list request counts and durations, identifying a misbehaving DaemonSet issuing excessive list calls and restoring control plane performance swiftly.
Monitoring Azure Container Networking Services: A network administrator overlays Container Networking Metrics in Grafana to track pod-to-pod latency, dropped packet rates, and network policy enforcement events—quickly isolating a misconfigured CNI plugin and ensuring secure cluster communications.

Roadmap

Building on the public preview launch of Grafana dashboards in AKS, we have a lot of exciting features on the our roadmap. Some of these are listed here

In-portal Grafana Explore integration: Embed Grafana’s Explore mode directly in the Azure portal—providing query builders, logs/metrics toggle, and inline documentation to help users investigate time series and log data without leaving Azure Monitor
Expanded Azure resource support:

Add native integrations for additional resources —so metrics and logs from these resources appear in Grafana dashboards

Seamless migration tooling:

Provide easy migration of dashboards and data source configurations between native Grafana dashboards and Azure Managed Grafana instances, simplifying hybrid and migration scenarios

Frequently Asked Questions

Q: Is there any additional cost for using Grafana dashboards in Azure Monitor?
A: No, the feature is included at no additional cost with your Azure Monitor usage.

Q: Can I use custom Grafana plugins with the native dashboards?
A: Currently, only the built-in plugins are supported. For custom plugins, consider Azure Managed Grafana.

Q: Can I export dashboards created in the portal to standalone Grafana?
A: Yes, Dashboards using Prometheus and Azure data sources can be imported.

Q: Will my existing Grafana dashboards work with this feature?
A: Most dashboards using Prometheus or Azure data sources can be imported, though some adjustments may be needed.

Conclusion and Next Steps

Azure Monitor dashboards with Grafana simplify observability by bringing Grafana's power into the Azure portal. Get started today to build rich, interactive dashboards without extra infrastructure. For deeper customization or hybrid scenarios, explore Azure Managed Grafana.

Ready to dive in? Try creating your first dashboard using the default AKS Node Overview template and customize one panel to match your team's monitoring needs. The public preview is available now

Announcing Azure Container Storage v2.0.0: Transforming Performance for Stateful Workloads on AKS

Mon, 15 Sep 2025 00:00:00 GMT

Introduction

Last year we announced the general availability of Azure Container Storage, the industry’s first platform-managed container native storage service in the public cloud. This solution delivers high performance and scalable storage that can effectively meet the demands of containerized environments. Today we are announcing a new v2.0.0 release of Azure Container Storage for Azure Kubernetes Service (AKS). It builds on the foundation of previous release and takes it further by focusing on higher performance, lower latency, efficient resource management and a Kubernetes native user experience for managing stateful workloads on AKS.

Furthermore, it’s now also completely free to use, and available as an open-source version for installation on non-AKS clusters. Whether you’re running databases, AI inferencing or training or any I/O-intensive application, v2.0.0 aims to provide cloud-native storage that performs like local hardware while being easier to use than ever. In this post, we’ll dive into what’s new in Azure Container Storage v2.0.0, how it achieves its performance gains, and how you can get started using it on your AKS or self-hosted clusters.

Improved Performance with Local NVMe

Azure Container Storage v2.0.0 offers blazing fast performance, enabled by deep integration with local NVMe disks on Azure VM instances. By utilizing ephemeral NVMe attached to AKS nodes, Azure Container Storage can offer extremely high IOPS and throughput, and extremely low latency compared to network-attached disks while also reducing the infrastructure costs. But v2.0.0 goes even further. Through improvements in the data plane and leaner architecture (discussed later), v2.0.0 can squeeze even more performance out of local disks. In fio benchmarking tests, this version has demonstrated reduced sub-millisecond latency at even higher transaction rates than previous version. In practice, this means this version can help accelerate your databases commit and your AI models load data more quickly, directly translating to faster application response times.

Let's look at the benchmarks. On fio, the industry standard for storage testing, Azure Container Storage with NVMe striping delivers approximately 7x higher IOPS and 4x less latency compared to the previous version.

But how does this translate to real workloads? We tested our own PostgreSQL for AKS deployment guide and found that PostgreSQL's transactions per second improved by 60% while cutting latency by over 30%. For database-driven applications, this means faster query responses, higher throughput, and better user experiences.

Integration with KAITO for Fast AI Model Loading

Azure Container Storage v2.0.0 isn’t just for traditional databases; it’s also proving hugely beneficial for AI and machine learning use cases on AKS. A great example is our integration with KAITO, the Kubernetes AI Toolchain Operator. KAITO is a tool that helps deploy and manage AI inference workloads (like large language models) on AKS. These AI workloads often involve huge model files (tens or hundreds of GB) that need to be loaded into GPU memory. Loading such models from remote storage can be painfully slow, becoming a bottleneck.

With this version, KAITO can accelerate model loading by using local NVMe drives. Here’s what happens: when you deploy a model with KAITO on an AKS cluster, KAITO will automatically provision an Azure Container Storage-backed striped NVMe volume on each GPU node and place the model files on that volume. Because the volume stays attached to the node (even if the pod restarts or is replaced), subsequent runs can reuse the local cached model instead of pulling it over the network again. We saw over a 5X improvement in model file loading performance when using Azure Container Storage v2.0.0 with a locally striped NVMe volume, compared to using an ephemeral OS disk.

This improvement means you can scale out AI inference pods much more quickly and respond to traffic spikes without long cold starts. It also means better utilization of expensive GPU nodes – they spend less time waiting on I/O and more time doing actual compute.

Simplified Architecture – No StoragePool CRDs, No Prometheus Hassles

Azure Container Storage v2.0.0 not only runs faster, it’s also simpler to use and manage. We took feedback from users of previous versions to remove complexity and toil from the system. Some major changes in this regard are: (1) eliminating the custom StoragePool object, and (2) removing several built-in components (like a bundled Prometheus) which were not as helpful for some of the users of previous versions.

No more StoragePool custom resource: In Azure Container Storage’s previous versions, you had to create a storage pool object to carve out storage (e.g. define a pool of NVMe devices or an Elastic SAN pool) before creating PVCs. Now you just use Kubernetes StorageClasses to describe what kind of storage you want and then create PVCs as usual. This aligns with the standard Kubernetes patterns, if you know how to request a volume in K8s, you already know how to use Azure Container Storage.

Lighter footprint and no reserved resources: Architecture of previous release included multiple controllers and node daemons (for pooling, replication, etc.), and by default it would reserve a portion of each node’s CPU for its own. From release v2.0.0, the architecture has been streamlined. There is a single lightweight operator running plus the CSI driver components. There’s also less that can go wrong: fewer moving parts means fewer chances for a bug or crash. Now single-node or two-node clusters are also supported, unlike previous versions which needed 3 nodes. This allows its use in smaller development or testing environments.

No built-in Prometheus Operator: One of the pain points for some users of previous versions was that it automatically deployed a Prometheus operator (for metrics) in the cluster. If you already had your own Prometheus, this could lead to conflicts (duplicate operators fighting over Custom Resources). In v2.0.0, we removed the bundled Prometheus stack. Instead, now Azure Container Storage exposes metrics that can be scraped by Azure Monitor or an existing Prometheus, so you get full observability but without Azure Container Storage injecting its own monitoring components. This change makes it much friendlier to integrate into clusters that already have monitoring set up, and it means one less thing to manage (or upgrade). You can still deploy Prometheus if you want, or use the Azure Managed Prometheus service, and pick up Azure Container Storage metrics from the standard endpoints.

User Experience Improvements: This version does not depend on cert-manager for its webhooks (it uses a built-in certificate approach), so you don’t have to worry about certificate CRDs or renewal jobs. This version also runs in the kube-system namespace now just like the built-in CSI drivers, instead of a special namespace, which avoids issues with certain restrictive policies. In general, our goal was to make Azure Container Storage feel like a natural part of AKS as if it were just another CSI driver.

Pricing changes

As before, you'll continue to pay for the underlying storage backend you use. But Azure Container Storage v2.0.0 and beyond will no longer charge a per-GB monthly fee for storage pools larger than 5 TiB for both our first party managed and open-source version, making the service now completely free to use. Provision as much storage as you need without worrying about additional management fees. This means you get enterprise-grade storage orchestration and breakthrough performance without any additional service costs, just pure value for your Kubernetes workloads.

Open-Source Foundations

Another important aspect of Azure Container Storage v2.0.0 is its open-source core. Open source is the cornerstone of AKS, and we’ve open-sourced the key components of Azure Container Storage on GitHub, including the “local CSI driver” that manages NVMe and ephemeral disks. It brings a few big benefits:

Self-hosted K8s cluster support: Perhaps most significantly, open-sourcing means that the technology behind Azure Container Storage can be used outside of AKS. For example, if you have a self-managed Kubernetes cluster on Azure VMs, you could deploy the open-source local CSI driver to get similar NVMe support.
Transparency: Users and the community can see how Azure Container Storage works under the hood. This builds trust and allows experts to review or suggest improvements.
Community contributions: By open sourcing, we invite the Kubernetes community to contribute features, bug fixes, and enhancements. Over time, this can accelerate development and ensure Azure Container Storage stays aligned with real-world needs.

What’s Next

While local NVMe is a game-changer for performance, not every workload can tolerate ephemeral storage. Some stateful applications need larger capacity and durable storage that persists even if nodes are deallocated. This is where Azure Elastic SAN storage-type comes in. Azure Container Storage is set to expand its capabilities with upcoming support for Azure Elastic SAN which is a block storage option that provides optimized price-performance through dynamic resource sharing. With native remote block storage support through iSCSI, it also enables for fast attach and detach of volumes. In a future update, you’ll be able to create StorageClasses for Elastic SAN and provision persistent volumes that are backed by Elastic SAN, all orchestrated through Azure Container Storage.

This means you’ll have a spectrum of options under one umbrella: ultra-fast ephemeral NVMe for when you need sheer speed (and handle durability at the app level), and Elastic SAN for when you need large, durable volumes with excellent price-performance. Both will be usable simultaneously in the same cluster through Azure Container Storage’s next release (v2.1.0).

Getting started

Ready to experience the performance boost? Here are your next steps:

• New to Azure Container Storage? Start with our comprehensive documentation

• Deploying specific workloads? Check out our updated deployment guide for PostgreSQL

• Want the open-source version? Visit our GitHub repository for installation instructions

• Have questions or feedback? Reach out to our team at AskContainerStorage@microsoft.com

Pair llm-d Inference with KAITO RAG Advanced Search to Enhance your AI Workflows

Fri, 12 Sep 2025 00:00:00 GMT

Overview

In this blog, we'll guide you through setting up an OpenAI API compatible inference endpoint with llm-d and integrating with retrieval- augmented generation (RAG) on AKS. This blog will showcase its value in a key finance use case: indexing the latest SEC 10-K filings for the two S&P 500 companies and querying them. We’ll also highlight the benefits of llm-d based on its architecture and its synergy with RAG.

Introduction

Deploying large language models (LLMs) efficiently, while leveraging private data for context-aware responses, is critical for modern AI applications. Retrieval-Augmented Generation (RAG) combines an LLM with a retriever that pulls in relevant context from your own data (documents, codebases, Wiki pages). This enables scalable, adaptive applications that can respond with domain-specific knowledge simply by updating the underlying data store.

But there’s a catch:

Setting up a RAG pipeline involves infrastructure: vector databases, LLM inference, embedding models, and orchestration - what do these components do?

RAG component	Purpose	Example
Vector store	Stores text data (documents, FAQs, etc.) in a vector format, or numerical representations of meaning. This allows the system to find relevant pieces of information, even if the user’s question uses different words than the original text (semantic search)	FAISS (Facebook AI Similarity Search) is widely used and is like a memory system that understands meaning and not just keywords
Embedding model	Converts text (like a phrase or sentence) into a vector that captures the meaning of the text. Even if the words in different searches don’t match exactly, an embedding model can produce similar vectors that indicate the system sees that they are semantically close	Sentence-BERT (sBERT) and HuggingFace embedding models
Retriever + LLM	The retriever finds useful information to help the LLM give a more accurate or update-to-date answer. Together, they make the RAG system flexible and grounded in real, relevant data and not just what the model memorized.	LlamaIndex and LangChain offer open-source retrievers that are useful for different types of data

There is where Kubernetes AI Toolchain Operator (KAITO) RAGEngine brings cloud- native agility to AI application development. KAITO is a CNCF Sandbox project that makes it easy to deploy, serve, and scale LLMs on Kubernetes, without needing to become a DevOps expert. Using RAGEngine, you can quickly stand up a service that indexes documents and queries them in conjunction with an existing LLM inference endpoint. This enables your large language model to answer questions based on your own private content.

By automatically configuring and orchestrating the RAG pipeline on Kubernetes, KAITO lets developers focus on building high-impact AI apps, while the engine helps cluster operators and platform engineers handle scaling, rapid iteration, and real-time data grounding.

The RAGEngine preset gives you an end-to-end RAG pipeline out of the box, including:

FAISS as the default, configurable vector store vector store
BAAI/bge-small-en-v1.5 as the default, configurable embedding model to index your documents
llama_index as the LLM-based document retrieval framework
Any OpenAI API compatible LLM inference endpoint to process retrieved documents as context and user queries in natural language

In this blog, the inference endpoint will be provisioned via the llm- d framework.

note

Both llm-d and RAGEngine are deployed as open-source solutions in the following example, and are not currently managed on AKS.

Quick vocab check

Before diving in, here's a quick breakdown of terms used with regard to llm-d that will clarify the steps ahead:

Prefill Stage: The initial phase of LLM inference where the model processes the complete input prompt, computing attention and embeddings to establish the internal context for generation.
Decode Stage: The autoregressive phase of LLM inference where the model generates output tokens sequentially, one at a time, based on the context from the prefill stage.
Prefill/Decode (P/D) Disaggregation: The optimization technique of distributing the computationally intensive prefill stage and the lighter, iterative decode stage across separate hardware resources to enhance efficiency and inference speed.
KV Cache (Key-Value Cache): Stores key and value tensors from the prefill stage’s attention computations, enabling the decode stage to reuse these results for faster token generation, with reduced computational overhead.

Benefits of llm-d and its intersection with RAG

The llm-d framework, built on open-source technologies like vLLM, Gateway API Inference Extension (GAIE), and NIXL, is a Kubernetes-native distributed inference serving stack for serving LLMs at scale. As detailed in the llm-d documentation, it provides several key benefits, particularly when paired with RAG workflows, which often involve long context to keep LLMs up to date.

llm-d feature	What it does	Benefits to RAG workflows
Prefill/Decode (P/D) Disaggregation	Separates the compute-heavy prefill stage (KV cache building) from the decode stage (autoregressive token generation) on dedicated GPUs, where each GPU pool can be independently scheduled and scaled	Optimizes throughput for long contexts; prevents resource contention for concurrent requests when processing long RAG queries; enables flexible scaling and lowers time-to-first-token (TTFT) and improves overall token output time (TPOT)
Intelligent routing (via Gateway API Inference Extension's Endpoint Picker)	Schedules requests based not only on server queue length and available KV cache size, but also prefix cache hit probability. Maximize the reuse of KV cache for queries with overlapping system prompts and common context retrieved from the vector store	Minimizes latency; enables efficient reuse of KV cache for similar or repeated contexts; handles RAG queries of mixed context length and structure robustly
Disaggregated KV cache management	Offloads KV cache across local or remote stores, with orchestration from decode-sidecar and advanced cache eviction policies for efficient memory use	Supports much longer input contexts and multiple concurrent sessions with reduced GPU memory overhead, enabling scalability for large RAG pipelines

Let’s get started: KAITO RAGEngine backed by llm-d with P/D Disaggregation

With the prerequisites covered, this guide will dive into the creation of two distinct but related endpoints:

Inference Endpoint: an OpenAI API compatible inference service in Kubernetes, created by the llm-d stack. Jump to the llm-d inference endpoint section in our GitHub repository to set it up.
RAG Endpoint: a RAG service provisioned by KAITO with the inference endpoint pointed to step (1) for users to efficiently index and query their documents. Check out the steps in this RAG endpoint GitHub cookbook section to spin up and verify your RAGEngine workspace, which includes a YAML manifest that looks like:

apiVersion: kaito.sh/v1alpha1
kind: RAGEngine
metadata:
  name: ragengine-llm-d
spec:
  compute:
    instanceType: "Standard_D2s_v4"
    labelSelector:
      matchLabels:
        node.kubernetes.io/instance-type: Standard_D2s_v4
  embedding:
    local:
      modelID: "BAAI/bge-small-en-v1.5"
  inferenceService:
    url: "http:///v1/chat/completions"

note

This RAG service only requires general-purpose compute, like the Azure Standard_DS2_v4 SKU shown above, which often provides a more cost-effective and accessible alternative to the GPU-intensive process of continuous fine-tuning.

Practical Example: Indexing and Querying 10-K Filings

Now, we'll pair llm-d inference with KAITO RAGEngine to index the latest SEC 10-K filings of NVIDIA and Berkshire Hathaway in PDFs, allowing us to ask questions and quickly extract key financial and strategic insights.

Investors, analysts, and researchers benefit from this approach by bypassing manual document review and accessing accurate, up-to-date information through natural language queries - all within their Kubernetes cluster!

Stepping through this finance cookbook example, you can port-forward the RAGEngine service to access your endpoint locally, and specify the context of the 10-K filings index as follows:

kubectl port-forward svc/ragengine-llm-d 8000:80

curl -X POST http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
    "index_name": "10-k",
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": "What was the revenue of NVIDIA from 2024 to 2025?"
      }
    ], "max_tokens": 100
   }' | jq -r '.choices[0].message.content'

Which results in the output:

According to the provided documents, NVIDIA's revenue for the years
ended January 26, 2025 and January 28, 2024 were:
* Year Ended January 26, 2025: $130,497 million
* Year Ended January 28, 2024: $60,922 million

This looks accurate, when compared to Page 38 of NVIDIA’s 10-K:

Let's take it one step further and request a more complex summary from Berkshire Hathaway context in our request:

curl -X POST http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
    "index_name": "10-k",
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": "Summarize Warren Buffett’s letter in BRK-B’s latest 10-K."
      }
    ], "max_tokens": 200
   }' | jq -r '.choices[0].message.content'

We get the following response:

In his letter, Warren Buffett discusses Berkshire Hathaway's
performance and strategy. He notes that the company has made
mistakes, but emphasizes the importance of acknowledging and
correcting them. Buffett highlights Berkshire's core business,
property-casualty insurance, and its unique financial model.
He also shares a personal anecdote about Pete Liegl, the founder
of Forest River, and explains Berkshire's approach to investing
in both controlled businesses and marketable equities.

We can also verify this response by looking at the story of Pete Liegl on Page 6 of BRK-B’s 10-K:

Great! We've built a system that makes it easy to search and understand complex financial documents, just by asking questions in plain English. Instead of manually combing through hundreds of pages, analysts and researchers can now get clear, accurate answers in a matter of seconds. Behind the scenes, this llm-d inference and KAITO RAGEngine system is designed to handle large and complex documents quickly and efficiently, giving fast responses even when many people are using it at once.

Next steps

Now that you've deployed an OpenAI-compatible endpoint using llm-d and integrated it with KAITO RAGEngine on AKS, you're well-positioned to scale this setup for enterprise use cases. Here’s how to continue building on what you’ve learned:

Dynamically scale your llm-d inference by creating a Kubernetes Event-Driven Autoscaling (KEDA) ScaledObject based on key vLLM metrics.
Introduce an automated data processing pipeline to index more extensive data efficiently as your RAG system grows over time.
Stay up-to-date on the latest releases of the the llm-d project!

Observe Smarter: Leveraging Real-Time insights via the AKS-MCP Server

Wed, 20 Aug 2025 00:00:00 GMT

Introduction

Recently, we released the AKS-MCP server, which enables AKS customers to automate diagnostics, troubleshooting, and cluster management using natural language. One of its key capabilities is real-time observability using inspektor_gadget_observability MCP tool, which leverages a technology called eBPF to help customers quickly inspect and debug applications running in AKS clusters.

note

If you haven't already seen it, we recently published an announcement video for the AKS-MCP Server!

For example, with the AKS-MCP Server, you can simply ask GitHub Copilot:

Prompt: Can you give me the top 3 pods with highest traffic in the AKS cluster?

and it can capture real-time traffic and analyze the data to provide insights like:

Background

Let’s start with why we built the inspektor_gadget_observability MCP tool and how it fits into the AKS-MCP server. If you are curious and want to skip the background, you can jump directly to the Getting Started section.

While working in the Linux and Kubernetes observability spaces, we’ve come to better understand challenges customers face, including:

Planning troubleshooting: While there are plenty of tools for capturing low-level system and network data, picking the right one and mapping out an effective troubleshooting plan isn’t always easy. It takes experience and a good understanding of what each tool provides and how to combine them effectively.
Capturing real-time data: Getting real-time data adds another layer of complexity. It often requires specialized tooling and careful coordination so that important events aren’t missed.
Data analysis: Collecting data is powerful but analyzing it is hard. Sifting through raw telemetry from multiple sources can quickly become overwhelming, making it difficult to extract meaningful insights in time.

Now imagine facing these challenges during a production outage—it only makes everything harder. Even before the recent wave of AI, LLMs, and MCPs, we were already tackling these issues through an open-source project: Inspektor Gadget. With Inspektor Gadget, it became possible to run, for example, a networking gadget to capture traffic, a syscall gadget to monitor system calls, or a file gadget to track file operations—all with Kubernetes context built in. To help address the above issues more simply, we created the Inspektor Gadget MCP Server, which allowed us to leverage LLMs for planning, capturing, and analyzing real-time data. And to take it a step further, we realized a few more things were needed:

More tools: To be useful, the LLM needed more context than just workload data— it needed to understand the cluster, the cloud it was running on, and be able to check the state of both Kubernetes and cloud infrastructure.
Broader reach: To make this valuable for a wider range of users, we began exploring potential integrations with other projects.

This led us to the AKS-MCP server, which combines the capabilities of the Inspektor Gadget, Kubernetes, and AKS tools for a unified experience for AKS customers.

I’d like to thank the Azure/aks-mcp team for their quick reviews and valuable feedback on integrating the inspektor_gadget_observability MCP tool into the AKS-MCP server.

note

If you’re interested in contributing to the project, check out the contributing guide.

Getting Started

From here, we’ll walk you through how the inspektor_gadget_observability MCP tool, along with other MCP tools in the AKS-MCP server, helps reduce these pain points. Let's review this tool's features and then explore some real-world examples.

note

You can get complete overview of the inspektor_gadget_observability MCP tool in the aks-mcp repository

The easiest way to start with the AKS-MCP server is by using the VS Code Extension. After installing the server with the extension, let's configure it to test a few features:

note

AKS-MCP Server is also available via Docker MCP Toolkit.

Here, we added the readwrite access level to deploy Inspektor Gadget, limiting it to the gadget namespace (where Inspektor Gadget will be installed) and the mcp-demo namespace (where we will create pods for testing).

note

AKS-MCP server allows restricting to specific Kubernetes namespaces if needed.

To get started, we gave the following prompt to GitHub Copilot:

Prompt: I am trying to showcase AKS-MCP features so I want you to start network tracing in mcp-demo namespace, then create client and server pods in mcp-demo namespace. Finally, perform some testing and get me a brief overview?

Copilot started by using the AKS-MCP server and the deploy action of the inspektor_gadget_observability tool to deploy Inspektor Gadget. It then started a gadget for tracing TCP connections, created a nginx server pod, and finally created a client pod. See how Copilot comes up with a clear plan, addressing one of the pain points mentioned above:

tip

We used Claude Sonnet 4 for this interaction, but the plan could be different with other models. You can always include more information in the prompt to get a specific plan.

It then continued to perform a few more steps and came back with the following analysis:

Capturing real-time traffic and analyzing the data immediately helps to address key issues mentioned earlier. Notice we needed to deploy Inspektor Gadget as part of the troubleshooting.

note

You can manually deploy Inspektor Gadget using this guide or learn more about it via Microsoft Learn guide

After Inspektor Gadget is deployed either by MCP server or manually, you can safely switch to readonly access since readwrite is not required for capturing real-time traffic.

Also, you can use the following prompt (e.g. using slash command in VS Code) to fetch cluster metadata, which is useful for understanding the cluster context:

Now we'll review some practical examples of these features.

Troubleshooting Connectivity Issues

To test the effectiveness of AKS-MCP Server in real-life projects, our initial test focused on checking the health of Inspektor Gadget, a project we're highly familiar with and actively working on.

Prompt: Can you check if there are any connectivity issues with Inspektor Gadget pods?

It started by running multiple tools inspektor_gadget_observability, kubectl_resources and kubectl_diagnostics and came back with the following summary:

tip

Select only the MCP tools needed for your session to help Copilot manage context better.

There weren’t any major issues, but it appears the DaemonSet was using an outdated Prometheus port for metrics. This was a real issue that we somehow missed in the recent migration. The fact that the AKS-MCP server caught a real issue was a positive sign.

Uncovering Hidden Bugs in Workloads

Next, we wanted to conduct an in-depth analysis by examining Inspektor Gadget pod for any potential performance issues:

Prompt: Can you observe system calls for the pod gadget-k4n8b in the gadget namespace for a few seconds? I want to understand why it might be slow

By default, excessive data (>64kb) from gadget runs is truncated, so we targeted a specific pod and namespace to limit the data. Copilot started the tool with:

And came back with the following analysis:

No performance issues were identified (thankfully); however, a permission issue was found with wasm-cache. This issue had been overlooked in certain environments previously due to inaccurate logging. Again, finding a hidden bug in an actual system was really encouraging.

tip

If Copilot doesn’t pinpoint the issue, providing application context or specifying an expected outcome helps.

Identifying Slow DNS Resolution

For the final example, we deployed otel-demo in our AKS cluster. The goal here was to start monitoring DNS over a certain period and see if we can detect any issues over time.

Prompt: Can you continuously monitor DNS queries taking more than 1s in the AKS cluster?

We are using the keyword continuously to guide Copilot to continue running the gadget. It started observing DNS queries with minimum latency of 1 second as specified in the prompt:

After waiting for a few minutes, we asked Copilot to get results for the running gadget:

Prompt: Can you get me results for already running gadgets monitoring DNS queries in the AKS cluster?

From here, we can see a list of slow DNS queries, the pod and namespaces associated with the queries, latency, and the response code. Again, you can ask follow-up questions focusing on a specific DNS request or checking the health of CoreDNS in the context of these requests.

note

We recently published a blog on using Inspektor Gadget for DNS issues. All the troubleshooting steps from that blog can be performed using the AKS-MCP server.

The ability to capture problematic requests over time can help solve intermittent issues that are hard to reproduce. It not only speeds up root cause analysis but also reduces the guesswork involved in resolving complex problems.

What's Next?

This brings us to the end of our blog post. As the MCP ecosystem continues to evolve, we will keep improving the inspektor_gadget_observability MCP tool in the AKS-MCP server. We invite you to give it a try and reach out to the Azure/aks-mcp team with any feedback, questions, or ideas for the project roadmap!

Announcing the CLI Agent for AKS: Agentic AI-powered operations and diagnostics at your fingertips

Fri, 15 Aug 2025 00:00:00 GMT

At KubeCon India earlier this month, the AKS team shared our newest Agentic AI-powered feature with the broader Kubernetes community: the CLI Agent for AKS. CLI Agent for AKS is a new AI-powered command-line experience designed to help Azure Kubernetes Service (AKS) users troubleshoot, optimize, and operate their clusters with unprecedented ease and intelligence.

Built on open-source building blocks — including the CNCF-pending HolmesGPT agent and the AKS Model Context Protocol (MCP) server — the Agentic CLI brings secure, extensible, and intelligent agentic workflows directly to your terminal.

We have been working on this experience for the last few months, starting with a focus on the number one pain point for most Kubernetes users: troubleshooting and diagnosing issues in their environments. We are currently providing early access to a limited set of users to collaborate closely with and gather feedback. If you are interested in participating, please fill out our sign up form.

Why We Built This?

AKS's mission is to "enable developers, SREs, DevOps and platform engineers to do more with AKS." AI is the single biggest force multiplier we have seen in a generation, and we wanted to harness the power of this foundational technology in a responsible, secure, and focused way. We want to put this technology in the hands of our users to solve complex and difficult problems like troubleshooting, optimizing cost, managing configuration/decision overload.

The first tradeoff we faced was whether we should solve for a breadth of use cases, such as all AKS interactions, or depth of usefulness by targeting focused and specific scenarios. We are striking a balance by going for depth in troubleshooting, the most common and proven problem which we see agentic AI is the most promising (in fact, internally we first called this project the "AKS AI troubleshooter"). We focused on 4 main types of problems for troubleshooting: networking/DNS, pod scheduling, node health, and cluster CRUD issues and failures. In parallel, we are also building a table-stakes experience for general Kubernetes and AKS use cases since we see agentic AI being widely useful across many different everyday scenarios. Rest assured, we have every intention to cover most of the AKS user workflows and use cases, and this is where your feedback is invaluable.

Troubleshooting in Kubernetes is notoriously complex. AKS customers from cloud-native startups to large enterprises face several recurring challenges. One example is the overwhelming signal fragmentation and struggle to correlate metrics, logs, and traces across layers and tools. This is exacerbated without the deep Kubernetes and Azure expertise needed to interpret all of these cluster signals. Troubleshooting is further complicated by the need to manually wrangle multiple tools, leading to high mean-time-to-resolution (MTTR) and avoidable support costs. Existing tools can surface raw data but lack built-in intelligence to guide users through diagnosis and resolution, making AI-powered assistance both timely and essential.

The AKS Agentic CLI is designed to solve these problems to reduce downtime, bridge the knowledge gap, and empower users to troubleshoot and manage their AKS environments with confidence.

Built on Open Source: HolmesGPT + AKS-MCP

Another question we wrestled with was: should we build something proprietary or build/contribute in the Open Source community? This was a simple decision because "Working in the Open Source community" is one of the core product pillars for AKS, so that is what we did.

The agent framework - HolmesGPT: HolmesGPT is an open-source agentic AI framework that performs root cause analysis (RCA), executes diagnostic tools, and synthesizes insights using natural language prompts. Before deciding on an open-source project to work with, we conducted a thorough due diligence of several existing open-source solutions and even built internal prototypes. In the end, we decided to work with the team at Robusta.dev on HolmesGPT because of several reasons:

Highly extensible architecture with built-in support for modular toolsets, MCP servers, and custom runbooks
Extensible and comprehensive prompts tailored for Kubernetes environments
An active and collaborative community in the open-source

The Microsoft AKS team is now a co-maintainer of HolmesGPT and Robusta has kindly donated it to CNCF as a Sandbox project. We welcome you all to join this community and contribute: HolmesGPT.

The tools and capabilities - AKS-MCP Server: The AKS Model Context Protocol (MCP) server provides a secure, protocol-first bridge between AI agents and AKS clusters. It exposes Kubernetes and Azure APIs, observability signals, and diagnostic tools to AI agents via a standardized interface. Today, you can use AKS-MCP (or any MCP server of your choosing) in combination with HolmesGPT (learn more in the HolmesGPT remote MCP server docs), and we will add a more seamless integration as we add more functionality and best practice knowledge into the AKS-MCP project.

Together, these components form a lego-block architecture that allows users to plug in their preferred AI providers, observability tools, and cluster configurations all while maintaining full control over data and execution.

Designing for Safety: Why We Started with a CLI Experience

While our long-term vision is to build an autonomous, AI-powered autohealing system (a true “SRE-as-a-service” for AKS), our first step is intentionally cautious.

In production environments, the cost of failure is high. Automated actions without human oversight can lead to unintended disruptions, especially when AI agents misinterpret telemetry or act on incomplete data. Recent public incidents across the industry have reaffirmed this tenet: autonomy without accountability is risky.

That’s why we chose to begin with a human-in-the-loop CLI experience.

The AKS Agentic CLI is designed to assist, not replace, the Kubernetes cluster operator. The agent synthesizes insights, runs diagnostics, and recommends actions but leaves the final decision to the user. This approach ensures:

Transparency: Users see exactly what tools were run and what data was analyzed.
Control: No changes are made to the cluster without explicit user permissions.
Trust: AI outputs are grounded in real telemetry and presented with supporting evidence.

This model allows us to validate the AI’s reasoning, gather feedback, and iterate safely while laying the foundation for more autonomous workflows in the future.

Security and privacy are core to the Agentic CLI experience:

Runs locally: All diagnostics and data collection are performed on the user’s machine, and is sent to the user configured LLM Model, it is not sent to or stored in AKS systems.
Uses Azure CLI Auth: Inherits Azure identity and RBAC permissions from the user, ensuring access only to authorized resources.
Bring Your Own AI: Users configure their own AI provider (OpenAI, Azure OpenAI, Anthropic, etc.) so no user data is retained by Microsoft. Users can bring their own LLMs approved by their organization - including Azure OpenAI instances deployed in their own subscriptions and virtual network.

🔌 Extensible and Customizable

The Agentic CLI is designed to adapt to your environment:

Custom Toolsets: Easily configure integrations with Prometheus, Datadog, Dynatrace, or proprietary observability platforms.
Runbook Plugins: Add your own troubleshooting workflows or use community-contributed ones.
MCP Server Support: Expand capabilities by connecting to AKS-MCP or other MCP servers for advanced diagnostics, including AppLens detectors, Azure Monitor, and debug pod deployment.

How to Get Started

Once you have signed up for the limited preview, we will reach out to everyone in batches and provide access to the CLI installation guide, documentation and next steps.

Once you have access, get started the following command to understand what commands and capabilities CLI Agent for AKS has to offer:

az aks agent --help
// or $ az aks agent "how is my cluster [Cluster-name] in resource group [Resource-group-name]".

Here are a few more examples of different ways you can use the CLI Agent:

🧠 Node NotReady

Diagnose kubelet crashes, CNI failures, and resource pressure:

az aks agent "why is one of my nodes in NotReady state?"

🌐 DNS Failures

Identify CoreDNS issues, NSG misconfigurations, and upstream DNS problems:

az aks agent "why are my pods failing DNS lookups?"

🕵️ Pod Scheduling Failures

Detect resource constraints, affinity mismatches, and zone limitations:

az aks agent "why is my pod stuck in Pending state?"

🔄 Upgrade Failures

Pinpoint PDB violations, quota issues, and IP exhaustion:

az aks agent "my AKS cluster is in a failed state, what happened?"

General CloudOps and Optimizations

az aks agent "how can I optimize the cost of my cluster?"

Each scenario is powered by AI-driven reasoning, tool execution, and actionable recommendations to bridge the gap between raw telemetry and real insights.

🌐 Vision: Omnichannel AI Across AKS Interfaces

The CLI Agent for AKS is just the beginning. We want to be where our customers are, and we understand that every user has preference of tooling - some prefer command-line interfaces, others use the AKS VS Code Extension, and others use Azure Copilot. Hence, our long-term vision is to integrate with all the user's touchpoints so that they can get a consistent and comprehensive experience wherever they are. Some of our next focus areas include:

Azure Portal: Integrated agentic capabilities such as diagnostics and operations via Copilot and Diagnose & Solve.
Visual Studio Code: One-click troubleshooting via the AKS VS Code Extension and MCP integration.

This omnichannel strategy ensures that every AKS user across developers, operators, or SREs can access intelligent troubleshooting wherever they work.

📣 Join the Preview

We’re actively gathering feedback and iterating throughout the process before officially launching the CLI Agent for AKS via our Limited Preview: aka.ms/cli-agent/signup. Please feel free to share your experience with the CLI Agent or AKS-MCP via GitHub issues or through our feedback form.

💬 Final Thoughts

The AKS Agentic CLI represents a major step forward in making Kubernetes operations more accessible, intelligent, and secure. By combining open-source innovation with Azure-native integrations, we’re empowering every AKS user to troubleshoot faster, reduce downtime, and focus on what matters most: building great applications.

Stay tuned for more updates as we expand capabilities, integrate with managed experiences, and bring AI-powered troubleshooting to every corner of the AKS ecosystem!

Announcing the AKS-MCP Server: Unlock Intelligent Kubernetes Operations

Wed, 06 Aug 2025 00:00:00 GMT

We're excited to announce the launch of the AKS-MCP Server. An open source Model Context Protocol (MCP) server designed to make your Azure Kubernetes Service (AKS) clusters AI-native and more accessible for developers, SREs, and platform engineers through Agentic AI workflows.

AKS-MCP isn't just another integration layer. It empowers cutting-edge AI assistants (such as Claude, Cursor, and GitHub Copilot) to interact with AKS through a secure, standards-based protocol—opening new possibilities for automation, observability, and collaborative cloud operations.

The Problem: Why Do We Need MCP Now?

The biggest pain point facing modern AI assistants is not their reasoning or language abilities, or writing good prompts (although still important) but the fragmented, brittle context in which they operate. Organizations today are struggling with:

Siloed information and workflows: AI can only act on what it can see—often, just a fraction of what's truly relevant. For example, Kubectl commands may not be enough for Claude code to figure out the issues you are facing or you may not want to put pictures of several dashboards, log lines etc. into your coding agent.
Complex, expensive integrations: Every new data source, tool, or AI agent means more custom connectors, patches, and technical debt. Environments evolve, tooling evolves, you might be using ArgoCD, or Istio, or other open-source/third-party tooling in your workflows. How can users extend the AI agent's capabilities as their environment evolves, coding one connector at a time?
Stunted automation and insight: When assistants lack rich, real-time access to infrastructure, APIs, and data sources, their utility is massively limited. For example, GitHub Copilot can tell you how to get health of your AKS clusters from Azure APIs or in other cases, it may ask for resource information such as subscription ID, Resource Group etc. before it can really start helping you. Often missing the connection between resources to do truly autonomous diagnostics.

Our thought process

What we felt is needed here, is a building block for real context engineering i.e. "the art of providing all the context for the task to be plausibly solvable by the LLM." - Tobi Lutke. The rise of and standardization around the open Model Context Protocol (MCP) opens the doors for us to start realizing this vision.

Our team thought hard about what should be the tenets of an AKS AI experience and how to realize this, as we had several tradeoffs to navigate, this is what we narrowed it down to:

Open-source and community driven: We launched aks-mcp as an open-sourced (under MIT license) project not just because we wanted to tap into the rich cloud native open source community, but also since security, trust and transparency are top of mind for our users and us. By opening up the source code, and enabling users to contribute, we believe we are closer to that goal. This approach aligns with AKS's commitment to open development, contribution to the community, and cloud-native flexibility for users of all sizes.
Plug-and-play AI agent support: We wanted to be where our customers are, the solutions we build should work out of the box with Claude, Cursor, GitHub Copilot, and more—so you can use the assistants you already love. Hence our investment in the key LEGO building blocks of a best-in-class AI Agent experience for Kubernetes. Step one in that process was building an MCP server for AKS that has the depth of tools needed to provide detailed context, and integrating it into the AKS VS Code extension v1.6.12. But we will not stop here, stay tuned for other announcements!
Speed over Perfection: The underlying technology for AI Agents is changing fast as demonstrated by the growth of MCP servers, and even changes with it - such as the deprecation of the SSE transport in favor of streamable HTTP. This highlights the need to move fast and trust in users' understanding of the evolving ecosystem. That is why we started by releasing the binaries and docker images on GitHub, so that users can run it locally or in-cluster and unlock its benefits, while the community converges on a secure remote MCP server architecture.
Human-in-the-loop workflows first: AI is delivering massive value and rewards for organizations across the globe, however its non-deterministic nature introduces risks. We believe users want control over actions performed by AI agents. Whether that is deploying a debug pod or creating/deleting resources - users want AI tools to request explicit write permissions. Hence, aks-mcp defaults to read-only tools with explicit user opt-in required for write tool access.

The Solution: AKS-MCP's Open, Secure Protocol

We believe AKS-MCP is that building block and acts as a universal, protocol-first bridge between AI agents and AKS.

Secure by default: Leverage Azure authentication and granular RBAC to keep operations and data safe, the agents can only access the data/resources that users have access to.
Extensible and community-driven: Fully open source, so you can adapt, extend, and evolve the platform for new tools and future needs.
In-depth troubleshooting tools: Supports a number of tools to interact with Kubernetes and Azure APIs, monitoring telemetry (activity/audit logs), diagnostic tooling such as Inspektor Gadget. AKS customers have repeatedly shared that troubleshooting Kubernetes and AKS issues is hard, so we have started with that problem and will expand rapidly over the coming weeks.

For a full list of tools and capabilities please see Available tools.

How does aks-mcp server authenticate and maintain RBAC compliance?

The AKS-MCP server is designed with security at its core, relying on Azure's industry-standard authentication mechanisms through the Azure SDK's DefaultAzureCredential chain (via the azidentity library), which checks for environment variables, managed identities, Azure CLI logins, or even browser-based credentials. This means the server never manages user credentials directly; instead, users must authenticate with Azure CLI (az login) beforehand, and the server simply reuses this context to obtain secure OAuth tokens for every Azure API call. To further protect operations, AKS-MCP enforces a three-tier access control system—readonly, readwrite, and admin—with a built-in security validator, ensuring every command is checked against configured permission levels before execution. This approach provides seamless, secure access for both automation and interactive use cases, and ensures only authorized actions can be performed, all by building on existing, trusted Azure identity patterns.

Getting Started with AKS MCP

Option A: VS Code Extension (Recommended)

To make it easy for AKS users to leverage the new AKS-MCP server, we have integrated it with AKS VS Code Extension starting with version 1.6.12.

Install the AKS Extension from the VS Code Marketplace.
Run AKS: Setup AKS MCP Server from the command palette (Ctrl+Shift+P on Windows/Linux or Cmd+Shift+P on macOS).
The extension automatically installs the binary based on the platform, configures the MCP server and updates your VS Code mcp.json file.
Run MCP: List Servers (via Command Palette). From there, you can start the AKS-MCP server or view its status.
Instantly start using GitHub Copilot or your favorite AI agent with first-class access to AKS tools.

This one-click setup brings AI-powered Kubernetes operations to every developer's fingertips.

Option B: Download Binaries/Container Image from GitHub

You can download the correct binaries from the AKS-MCP GitHub releases page, and then manually configure the MCP server in VS Code or the IDE/agent of your choice. Find information on how to install and get started.

Real-World Example Use Cases

AKS-MCP enables intelligent, agent-driven cloud operations. Here are a few hands-on scenarios that you can try in GitHub Copilot, Claude Code or other MCP compatible agents:

Diagnose Resource Health

Prompt: "Why is one of my AKS nodes in NotReady state?"

Understand Network Security

Prompt: "Can you help me understand my AKS cluster's VNet and NSG configuration? I think DNS traffic is being blocked."

Automate Operations

Prompt: "Scale my payments-api deployment to 5 replicas and confirm rollout status."

The result? AI agents that can reason, act, and surface Azure-specific insights—accelerating DevOps and cloud troubleshooting.

Let us see an example below, where I am asking GitHub Copilot for the health of my cluster, and the AI assisted diagnostic journey that follows:

Prompt: How is the health of my cluster? (actually there is a typo, but GitHub Copilot is able to figure that out)

Now I need help to figure out why there is an outbound connectivity issue in my cluster?

Prompt 2: diagnose the outbound connectivity failure

As you can see GitHub Copilot is able to use the aks-mcp server, and make various tool calls to Azure and Kubernetes APIs to figure out that there is a problem, and then what the problem is and how to remediate the issue.

Get Involved

Visit Azure/aks-mcp on GitHub. We're looking for feedback, contributions, and innovative feature ideas.

Let's take the next step in cloud-native DevOps—where Kubernetes, AI, and open protocols empower every developer.

Accelerate DNS Performance with LocalDNS

Mon, 04 Aug 2025 00:00:00 GMT

DNS performance issues can cripple production Kubernetes clusters, causing application timeouts and service outages. LocalDNS in AKS solves this by moving DNS resolution directly to each node, delivering 10x faster queries and improved reliability. In this post, we share the results from our internal tests showing exactly how much of an improvement LocalDNS can make and how it can benefit your cluster.

Background: The Hidden Cost of DNS in Production Kubernetes

In Kubernetes clusters, DNS is the invisible backbone that enables service discovery and inter pod communication, but its critical role often goes unnoticed until it becomes a bottleneck. DNS related issues are among the most challenging operational problems. What begins as minor performance degradation can quickly escalate into customer impacting incidents and even full scale outages. As cluster size grows, the complexity of DNS management increases exponentially. A configuration that works for a small development environment may prove completely inadequate at production scale, exposing fundamental architectural limitations that can threaten the reliability and scalability of the entire system.

Why Centralized CoreDNS Becomes a Bottleneck in Kubernetes Clusters

Traditional DNS was built for static, predictable environments with long lived hosts and low query volumes. Kubernetes, however, is dynamic and high churn:

Ephemeral workloads: Pods are rapidly created and destroyed, each needing immediate DNS resolution
High query volume: Service meshes, health checks, and inter service calls generate thousands of DNS queries per second
Dynamic endpoints: Services and pods frequently change IPs, requiring constant DNS updates and cache invalidation
Complex networking: Multiple network layers (pod, service, ingress) add latency and increase DNS infrastructure load

These differences turn DNS from a background service into a critical bottleneck as clusters grow. Relying on a handful of centralized CoreDNS pods exposes architectural weaknesses: all DNS queries are funneled through these pods, creating a single point of contention and introducing network overhead with every lookup. High query volumes can overwhelm conntrack tables, and centralized caching misses the benefits of local cache hits—forcing even repeated queries to traverse the network.

The result? Application timeouts, resource exhaustion, cascading failures, and increased operational burden. Without rethinking DNS architecture, teams face increased latency, reliability issues, and operational headaches at scale. LocalDNS addresses these challenges by decentralizing DNS resolution—moving the cache and resolver directly onto each node, closer to every workload.

Introducing LocalDNS for Faster, More Reliable DNS Resolution

To address these fundamental architectural challenges, AKS introduces LocalDNS - a node level DNS proxy that transforms how DNS resolution works in Kubernetes clusters. LocalDNS represents a shift from centralized DNS resolution to a distributed, resilient architecture that brings DNS responses closer to the workloads that need them. By deploying a DNS proxy directly on each node as a systemd service, LocalDNS eliminates the network hop to centralized DNS pods, dramatically reducing latency while improving overall cluster resilience. This is especially useful in large clusters and high-traffic environments, where it dramatically reduces DNS latency and improves reliability even under high load.

For more details on LocalDNS and how to enable it in your AKS clusters, check out the official AKS LocalDNS documentation.

How We Tested LocalDNS

To evaluate the impact of LocalDNS, we conducted parallel tests across two AKS clusters: one with LocalDNS enabled on all nodes and another using only centralized CoreDNS. In both environments, we generated a sustained load of 10,000 DNS queries per second (QPS) and used industry standard tools like dnsperf and resperf in the testing. This allowed us to observe query distribution across CoreDNS pods, measure resolution success rates, and compare end to end DNS lookup latencies.

The Results

1. Improved DNS Query Resolution Times

The graphs below demonstrate a substantial reduction in DNS query resolution times across all percentiles (P50, P95, P99) when LocalDNS is enabled. LocalDNS consistently delivers faster responses, with >10x lower latency and significant tail latency reduction at the P99 scale. These improvements apply to both internal cluster traffic and external domain resolution.

In-Cluster DNS Resolution Time (cluster.local)

External DNS Resolution Time

2. Better Distribution of Requests Across CoreDNS Pods

The pie charts below show the dramatic improvement in traffic distribution across CoreDNS pods when LocalDNS is enabled. In the centralized setup, nearly all DNS traffic (99.9%) is handled by a single CoreDNS pod (because of the use of UDP protocol), creating a significant bottleneck. With LocalDNS, the split shifts to a much healthier 40%/59.9% distribution, demonstrating balanced load and improved scalability.

3. Additional Operational Improvements

Beyond performance gains, LocalDNS provides critical operational benefits that improve cluster reliability and reduce maintenance overhead:

Stale cache serving during upstream DNS outages: LocalDNS can serve DNS responses from its local cache even if the upstream CoreDNS or external DNS servers become temporarily unavailable. This ensures that workloads continue to resolve frequently used names without interruption, improving resilience during intermittent DNS outages.
Reduced conntrack table entries for DNS connections: With LocalDNS running as a node level service, DNS queries from pods are resolved locally, reducing the need for each DNS request to traverse the node’s network stack and create conntrack entries. This reduces pressure on the node’s conntrack table, lowering the risk of resource exhaustion and related networking issues.
Fewer DNS queries reaching CoreDNS: By caching responses at the node level, LocalDNS dramatically reduces the number of queries that need to be forwarded to the centralized CoreDNS pods. This offloads traffic from CoreDNS, decreases overall DNS infrastructure load, and further improves cluster scalability and reliability.

Conclusion

LocalDNS transforms DNS delivery in AKS clusters by providing faster resolution, greater reliability, and streamlined operations for production workloads. By decentralizing DNS and placing resolution closer to each node, LocalDNS eliminates common bottlenecks and empowers teams to scale with confidence.

We invite you to enable LocalDNS in your AKS clusters and experience the benefits firsthand. Your feedback helps us evolve this feature—please share your insights, report issues, or suggest enhancements by opening a GitHub issue.

For a deeper dive into LocalDNS architecture and step by step guidance on activation, visit our official documentation

Streamlining Temporal Worker Deployments on AKS

Thu, 31 Jul 2025 00:00:00 GMT

Temporal is an open source platform that helps developers build and scale resilient Enterprise and AI applications. Complex and long-running processes are easily orchestrated with durable execution, ensuring they never fail or lose state. Every step is tracked in an Event History that lets developers easily observe and debug applications. In this guide, we will help you understand how to run and scale your workers on Azure Kubernetes Service (AKS).

Running Temporal Workers efficiently is crucial for scalable and resilient distributed services. Azure Kubernetes Service (AKS) stands out as a leading platform for hosting Temporal workers, offering tight integration with Azure's ecosystem and built-in auto-scaling capabilities with fault tolerance—critical features for enterprise Temporal deployments.

Everything you need can be found at the following repo: https://github.com/temporal-community/temporal-on-aks-starter

Getting Started

This walkthrough will cover writing Temporal Worker code, containerizing and publishing the Worker to Azure Container Registry (ACR), and then deploying it to AKS. We'll leverage Temporal's Python SDK for our examples, along with automation scripts.

Project Structure and Configuration

A well-organized project is key to managing complex deployments. Your project's structure, with dedicated files for activities, workflows, workers, clients, and configuration, provides a clear roadmap for development and deployment.

When you're dealing with complex deployments, keeping things organized is important. Breaking everything out into separate files for activities, workflows, workers, clients, and config simplifies development and deployment.

Your approach to configuration is just as important. By using a config.env file, you centralize configuration and make environment variables easier to manage. For an AKS deployment, you would then configure variables such as:

ACR Name: Azure Container Registry name
Resource Group: Azure resource group name
ACR Username/Password/Email: Credentials for ACR authentication
Temporal Address: The address of your Temporal server (whether self-hosted or Temporal Cloud)
Temporal Namespace: The Temporal namespace you're using
Temporal Task Queue: The name of your task queue
Resource Limits: CPU and memory requests and limits for your worker pods.

You would typically set up this configuration by copying an example file and then populating it with your specific Azure and Temporal environment values.

Crafting Your Temporal Worker Implementation

Temporal applications elegantly separate business logic into Workflow definitions, with Worker processes handling the actual execution of Workflows and Activities. Temporal applications separate business logic into Workflow definitions, with Worker processes handling the actual execution of Workflows and Activities.

Your activities.py file defines the individual tasks your workflows perform, like data processing or external API calls.

The worker.py file is where your Temporal Worker is initialized. It connects to the Temporal server and registers the workflows and activities it's responsible for. For AKS deployments, your worker would connect to your Temporal server address, handling both local development scenarios (without TLS or API Keys) and production environments (like which require TLS or an API Key).

Finally, a client.py application is used to trigger your workflows, initiating their execution within the Temporal system. This client would also connect to your Temporal server.

Preparing Your Containers for Kubernetes

Containerization is a fundamental step for deploying applications to Kubernetes. Your project's Dockerfile provides the blueprint for building your worker image. This Dockerfile would typically:

Use a suitable base image (e.g., Python slim).
Set a working directory.
Install necessary system dependencies.
Install the Temporal Python SDK and other project dependencies.
Copy your application code.
Configure Python for unbuffered output.
Ensure your startup scripts are executable.
Define the command to start your worker process.

Automated Deployment Process

A flexible automation system is important for reliable deployments to Kubernetes. This project’s deploy.sh script emulates this, handling everything from building and pushing Docker images to applying Kubernetes manifests. You’d likely substitute in your own automation processes and tools for your use cases.

The automated deployment process would generally involve:

Configuration Setup: Copying and configuring your environment variables in config.env.
Building and Pushing Images: Building your Docker image (potentially for multiple architectures) and pushing it to Azure Container Registry (ACR).
Kubernetes Manifest Generation: Scripts like your generate-k8s-manifests.sh would create Kubernetes deployment files, including:
- An ACR authentication secret.
- A ConfigMap for Temporal configuration.
- A Deployment YAML for your application.
Applying Manifests: Using kubectl to create the necessary Kubernetes namespace, apply Secrets and ConfigMaps, and deploy your application to the AKS cluster.

Local Development

Getting set up locally is the best way to start building and testing. Your project provides instructions that clearly lay out all the steps, which include:

Installing Python dependencies.
Configuring your config.env for local development.
Running your worker and client applications directly.

For a faster setup, the start.sh script launches the local environment with a single command.

Validating Worker Connectivity and Resource Management

After deployment, verify that Temporal Workers have successfully connected to the Temporal server. You can use kubectl commands to check pod statuses and examine worker logs for confirmation messages like "Starting worker... Awaiting tasks."

Resource management is critical for stable and efficient Kubernetes deployments. Your Kubernetes deployment is configured with specific CPU and memory requests and limits. These values are adjustable in your config.env and should be fine-tuned based on your worker's actual resource consumption to prevent performance bottlenecks or unnecessary resource allocation.

Troubleshooting and Configuration Details

Common issues often revolve around configuration errors, ACR authentication, Temporal connection problems, or insufficient resource limits. Your guide's troubleshooting section provides valuable insights and useful kubectl commands for diagnosing and resolving these issues. These include checking configuration, viewing generated manifests, examining pod logs, and restarting deployments.

For a comprehensive understanding of all configuration options, refer to the project documentation. Your configuration system prioritizes environment variables and supports automatic manifest generation. This offers a flexible approach for both local and production deployments.

By following these principles and leveraging robust automation, you can confidently deploy and manage your Temporal Workers on Azure Kubernetes Service (AKS), ensuring your distributed services are scalable, resilient, and efficiently operated.

Want to learn more?

AKS Long Term Support: 24-Month Support Now Available for Every Kubernetes Version

Fri, 25 Jul 2025 00:00:00 GMT

In London at KubeCon EU 2025, AKS announced our expansion of what AKS Long Term Support (LTS) includes. Today, we're sharing more details about this offering that addresses one of the most critical challenges enterprises face when running Kubernetes at scale.

From our conversations with customers, we consistently hear the same concerns: "How do I balance Kubernetes innovation with the stability my business-critical applications require?" and "Why do I need to upgrade my clusters so frequently when my applications are running perfectly?"

On the flip side, customers also ask: "If I don't upgrade frequently, how do I still ensure that I'm getting security fixes and ecosystem compatibility updates for my Kubernetes infrastructure?" AKS LTS directly addresses these real-world challenges.

Why AKS LTS Matters

Enterprise organizations running mission-critical workloads on Kubernetes have consistently told us they need more predictable, stable platform foundations. While the rapid innovation pace of Kubernetes is fantastic for driving new capabilities, it can create challenges for organizations that require:

Extended support lifecycles that align with enterprise planning cycles
Predictable upgrade windows that minimize operational disruption
Stability guarantees for production workloads that can't tolerate frequent changes
Compliance requirements that demand long-term support commitments

AKS LTS directly addresses these needs by providing a support plan specifically designed for enterprise stability requirements.

What is AKS LTS?

Understanding Kubernetes Community Support: The upstream Kubernetes project follows a regular release cadence of approximately 4 months between versions, supporting the current version plus the two previous versions (n to n-2). AKS typically makes new Kubernetes versions available about a month after the upstream release, following our rigorous testing and validation process. This means each version receives roughly 12-14 months of community support.

While this rapid innovation cycle drives excellent new capabilities, we've heard from many enterprise customers that the standard 12-14 month support lifecycle can present challenges for production environments with stability requirements, compliance needs, or complex upgrade validation processes.

AKS Long Term Support is a support plan that provides 24 months of support for Kubernetes versions from their GA date in AKS, compared to the standard 12-14 month support lifecycle. Here's the key change: Every currently supported AKS Kubernetes version is now also available for long term support (LTS).

This isn't just extended maintenance—it's a support commitment for all managed components of AKS designed for enterprises running mission-critical workloads.

The Real Impact: Instead of planning cluster upgrades every 12-14 months, you can now plan major Kubernetes upgrades every 24 months. This reduces upgrade frequency by 50% and reduces operational overhead while maintaining full security and support coverage.

AKS LTS provides all the same capabilities and features as community-supported AKS versions, with these key enhancements:

Extended Support Timeline: 24 months of full support from GA date (compared to standard 12-14 months), including security patches and critical bug fixes
Premium Tier Requirement: Available as part of the Premium tier, which includes additional enterprise features and comes with associated costs
Same Kubernetes Experience: Identical functionality to standard AKS with proven, stable Kubernetes versions
Backwards Compatibility: Strong commitment to API stability and workload compatibility within LTS versions
Integrated Azure Services: Compatibility with most existing Azure services and AKS ecosystem features (see LTS limitations for current exceptions as we work towards 100% coverage)

Enterprise Benefits at a Glance

Reduced Operational Overhead

Fewer required upgrades mean less time spent on cluster maintenance
Predictable planning cycles aligned with enterprise budgeting and resource allocation
Reduced testing and validation overhead for workload compatibility

Enhanced Stability

Production-tested Kubernetes versions with proven track record
Minimized risk of introducing breaking changes or regressions
Focus on reliability over cutting-edge features

Compliance and Governance

Extended support timelines that meet enterprise compliance requirements
Clear end-of-life planning with advance notice for migration planning
Integration with Azure Policy and governance frameworks

Technical Implementation

Universal LTS Coverage - A Major Expansion: Previously, AKS LTS was available only for select Kubernetes versions, typically spaced several versions apart. This announcement represents a fundamental shift: AKS LTS now extends to every currently supported Kubernetes version in AKS. This means instead of waiting for specific "LTS-designated" versions, enterprises can choose long-term support for any supported Kubernetes version that meets their application requirements.

Beyond Community Support: When the Kubernetes community support ends (typically 12-14 months after a version's GA), AKS LTS continues providing comprehensive support for 24 months from the original GA date. Crucially, once the community stops supplying patches after the official version support end-of-life, AKS continues to provide patches and CVE fixes for all supported components.

This extended support is comprehensive and includes:

Core Kubernetes components: API server, etcd, kubelet, kube-proxy, and all core Kubernetes functionality
AKS-managed add-ons, extensions, and AKS features: (see AKS integrations for the complete list), including networking, monitoring, and security components, with the exception of Istio which is coming soon as mentioned in the Looking Ahead section. For the current list of unsupported add-ons and features, please refer to our LTS unsupported add-ons documentation
Node and OS components: Operating system patches, security updates, and node-level components for both Linux and Windows nodes

This holistic approach ensures that every layer of your full Kubernetes stack remains supported, secure, and stable throughout the entire LTS lifecycle.

Support Timeline Example:

AKS LTS 1.28 (GA: Nov 2023): Supported until Feb 2026
AKS LTS 1.29 (GA: Mar 2024): Supported until Apr 2026
AKS LTS 1.30 (GA: Jul 2024): Supported until Jul 2026
Future LTS versions will follow the same 24-month support commitment from their respective GA dates of the AKS community versions.

For the most up-to-date information on LTS version support timelines, please refer to the AKS LTS calendar.

Add-on Versioning Policy: AKS add-ons, components and extensions are pinned to specific versions that align with each Kubernetes release. This versioning policy remains consistent between community-supported AKS and AKS LTS—add-ons receive the same version pinning approach to ensure stability and compatibility throughout the support lifecycle.

Throughout the support lifecycle, AKS LTS versions receive:

Security patches and vulnerability fixes
Fixes for critical bugs that impact stability or functionality
Compatibility maintenance for Azure service integrations (not new feature additions)
Ecosystem compatibility updates to maintain existing functionality (not new capabilities)

Stay Informed: You can track LTS patch rollouts and new LTS version availability in real-time using the AKS Release Tracker under the Kubernetes version tab. Additionally, you can use the az aks get-upgrades CLI command or the GET upgradeProfiles API (which maps to Microsoft.ContainerService/managedClusters/upgradeProfiles/read permission) to view available LTS versions for your clusters. These tools provide complete visibility into your LTS support lifecycle and upcoming releases.

Creating AKS LTS Clusters

Creating an AKS cluster with LTS support is straightforward and uses the same familiar tools and processes you already know. Here's a simple example using Azure CLI:

# Create an AKS cluster with LTS support
az aks create \
    --resource-group myResourceGroup \
    --name myAKSLTSCluster \
    --tier premium \
    --k8s-support-plan AKSLongTermSupport \
    --kubernetes-version 1.29 \
    --node-count 3

You can also deploy AKS clusters with LTS support using:

Azure Portal: Select "Long Term Support" option during cluster creation
ARM Templates/Bicep: Specify the LTS tier in your infrastructure-as-code deployments
Terraform: Updated AKS provider supports LTS cluster configuration

Important: AKS LTS is available as part of the Premium tier. For detailed pricing information, please refer to the AKS Premium tier pricing.

AKS clusters with LTS support maintain full compatibility with the existing AKS ecosystem, including AKS Features, Add-ons and Extensions.

Industry-Leading Ecosystem Support: Unlike other cloud providers that typically limit LTS guarantees to the base Kubernetes API, AKS LTS provides comprehensive support for popular add-ons and components that enterprises depend on for production workloads. This includes coordinated breaking change management, timely CVE fixes (tracked via AKS security bulletins), and compatibility assurance throughout the LTS lifecycle.

Choosing Between Community and LTS Support

While AKS LTS provides extended stability, it's important to understand when each support option makes the most sense for your organization. Both community and LTS support have distinct advantages depending on your requirements.

When to Choose Community Support:

Cost optimization: Community support comes at no additional cost beyond standard AKS pricing
Latest features: Immediate access to the newest Kubernetes capabilities and AKS innovations
Development environments: Non-critical workloads and test environments where ability to test newer functionality is a requirement
Frequent upgrade tolerance: Teams comfortable with 12-14 month upgrade cycles and regular maintenance windows. Workloads that can handle seamless restarts.

When to Choose LTS Support:

Less maintenance to workloads and clusters: Production systems where reducing upgrade frequency and operational overhead is prioritized
Compliance requirements: Environments demanding extended support commitments for regulatory purposes
Complex upgrade validation: Organizations requiring extensive testing cycles before adopting new versions
Resource-constrained teams: Limited operational capacity for frequent cluster maintenance

Decision Framework: Consider these factors when choosing your support model:

Workload criticality: How much downtime can your applications tolerate?
Operational resources: What's your team's capacity for regular maintenance?
Innovation requirements: How quickly do you need access to new Kubernetes features?
Budget considerations: Can you justify Premium tier costs for extended support?
Compliance mandates: Do regulations require specific support timelines?

AKS LTS Version Compatibility

Every currently supported AKS Kubernetes version is now also available for long term support (LTS). This means you can immediately access long-term support for your existing clusters without requiring cluster upgrades or migrations.

Whether you're running the latest version or an older supported version, you can transition to LTS support coverage today by upgrading your cluster to the Premium tier.

Immediate Benefits:

No forced migrations: Your existing clusters can adopt LTS support in-place
Complete coverage: Every supported version gets the same comprehensive LTS treatment
Instant value: Start benefiting from extended support timelines immediately

When planning your LTS adoption, consider:

Current cluster assessment: Evaluate which of your existing clusters would benefit most from extended support
Workload criticality: Prioritize mission-critical production workloads for LTS coverage
Compliance requirements: Align LTS adoption with your organization's governance and compliance needs

For detailed migration guidance and best practices, refer to our AKS LTS migration documentation on Microsoft Learn.

Upgrading Between LTS Versions

When it's time to upgrade from one AKS LTS version to another, you can take advantage of all the robust upgrade functionality already available in AKS. Popular upgrade options include:

MaxUnavailable configuration: Control cluster capacity during upgrades by specifying the maximum number of nodes that can be unavailable simultaneously
Undrainable node behavior: Configure how AKS handles nodes that cannot be drained, ensuring predictable upgrade outcomes
OS Security Patch channel: Automate operating system security updates through configurable patch channels
Node Image channel: Keep node images updated with the latest security patches and OS improvements
Planned maintenance windows: Schedule upgrades during off-peak hours to minimize business impact

For comprehensive guidance on all available upgrade strategies, see the AKS cluster upgrade documentation. These production-tested upgrade mechanisms ensure smooth transitions between LTS versions while maintaining workload availability.

Transitioning Between Support Models

Community to LTS Transition: Moving from community to LTS support is straightforward—simply upgrade your cluster to the Premium tier. Your current Kubernetes version immediately receives extended 24-month support coverage without requiring a cluster upgrade or migration.

LTS to Community Transition: Transitioning from LTS to community support requires more planning, especially if you're near the end of your 24-month LTS window. Key considerations include:

Version gap assessment: After 24 months on LTS, you may need to upgrade across multiple Kubernetes versions to reach a currently supported community version
Multi-hop upgrades: While Kubernetes upstream focuses on n-1 to n upgrades, AKS provides support for multi-version upgrades with upgrade path guidance and testing recommendations for larger version jumps
Shared responsibility: You're responsible for workload compatibility testing across version gaps, while AKS ensures the upgrade path is technically viable and provides breaking change documentation
Planning window: Begin transition planning 6-12 months before your LTS support expires to allow adequate time for testing and validation

Best Practice: Consider your long-term strategy when initially choosing LTS to avoid complex transitions later.

Looking Ahead

AKS LTS represents our commitment to supporting enterprises at every stage of their Kubernetes journey. While LTS focuses on stability, we continue to innovate rapidly in standard AKS, ensuring you have access to the latest Kubernetes capabilities when you need them.

Over the coming months, we'll be expanding AKS LTS capabilities based on your feedback, including:

Istio support for LTS: Unlike other add-ons whose minor versions are pinned to minor version of the AKS version, Istio add-on having its sidecar inside user's pod allows for minor version and upgrades to be explicitly controlled by the user today, thus complicating the permutations to be considered for LTS. LTS scope for Istio version(s) when deployed on top of AKS LTS versions is currently being finalized and will be announced in a future update
KMS V2 support for LTS: Enhanced Key Management Service V2 support for AKS LTS tentatively CY2026H1, providing improved encryption key management capabilities for enterprise security requirements

Enhanced AKS Upgrade Capabilities (coming soon for both standard AKS and LTS Support Plan):

Agent pool Blue-Green upgrades: Node pool-level blue-green upgrade strategy that enables workload validation batch by batch, with the ability to rollback newly created green nodes within a configurable soak period
Component Version API: A dedicated API to surface breaking changes in AKS components, including AKS features, add-ons, extensions, and OS components, helping customers understand compatibility impacts before upgrading to the next AKS LTS version
Enhanced Pod Disruption Budget management: Simplified Pod Disruption Budget(PDB) creation and management capabilities to streamline PDB setup and reduce upgrade complexity

Real-World Customer Scenarios

Financial Services: A major bank running regulatory compliance workloads told us: "We need Kubernetes for modern app development, but our compliance team requires extended stability guarantees. AKS LTS gives us both innovation and the predictability our auditors demand with 24 months of support."

Healthcare: A healthcare provider managing patient data systems shared: "Frequent cluster upgrades mean extensive testing and validation cycles. With AKS LTS, we can focus on improving patient outcomes instead of constant infrastructure maintenance."

Manufacturing: An IoT platform managing factory operations explained: "Our edge clusters run critical production line controls. Unexpected upgrades could halt manufacturing. AKS LTS gives us the stability to keep factories running while still benefiting from modern Kubernetes capabilities."

Azure Linux Support for AKS LTS

Azure Linux Container Host for AKS now supports AKS Long Term Support, starting with Kubernetes v1.29. This completes our OS support matrix for LTS—while Ubuntu and Windows Server node pools have been available with LTS since launch, Azure Linux Container Host is now the final piece, providing comprehensive OS choice for enterprise LTS deployments.

Azure Linux Container Host for AKS brings several key advantages for LTS deployments:

Secure supply chain: Built and maintained by Microsoft with full supply chain security
Secure by default: Streamlined attack surface with only essential components and security-hardened configuration
Optimized performance: Purpose-built for Azure infrastructure with container workload optimizations
Consistent updates: Aligned with AKS LTS lifecycle for predictable maintenance windows
Microsoft support: Full integration with Azure support processes and enterprise SLAs

This combination of Azure Linux Container Host and AKS LTS provides enterprises with a fully Microsoft-supported stack from the operating system through the Kubernetes platform, completing our commitment to comprehensive OS support for long-term enterprise deployments.

For detailed information about Azure Linux Container Host support for AKS LTS, see our Azure Linux LTS announcement.

Getting Started with AKS LTS

To reduce operational overhead and gain enterprise-grade stability, create your first AKS LTS cluster using our comprehensive quickstart guide.

Immediate Next Steps:

Quick assessment: Identify 1-2 production clusters that would benefit from 24-month support
Pilot deployment: Create a test AKS LTS cluster to evaluate the experience
Plan transition: Review our migration documentation for production workloads

Questions? Connect with the AKS team and community in our GitHub discussions or share your feedback and suggestions.

AKS LTS provides enterprise Kubernetes with stability, predictability, and comprehensive support. This offering makes that reality accessible for your organization while maintaining access to Kubernetes innovation.

Debugging DNS in AKS with Inspektor Gadget

Wed, 23 Jul 2025 00:00:00 GMT

If you're reading this, you likely have heard the phrase "It's always DNS." This is a common joke amongst developers that the root of many issues is related to DNS.

In this blog we aim to empower you to identify the root cause of DNS issues and get back to green. You can also watch the video walkthrough from Microsoft Build Breakout Session #181 starting at the 5-minute mark.

In AKS, DNS resolution involves three main components:

Application Pod: The pod running your application that needs to resolve DNS names, e.g. order-service in the following diagram.
kube-dns: The in-cluster DNS service.

note
Consider that the in-cluster DNS service is called kube-dns for historical reasons, but the actual implementation is CoreDNS. That's why the backend pods running the service are named coredns-xxxx.
Upstream DNS Server: Azure's default or a custom DNS.

In practice, DNS resolution in Kubernetes falls into two main categories: resolving local endpoints and resolving external domains.

Local Endpoint Resolution

When your application needs to resolve a cluster‑internal service (for example, another-service), it sends a DNS query directly to the in‑cluster DNS service kube‑dns. The DNS service looks up the internal service record and returns the pod IP. The following diagram illustrates this flow, showing the pod issuing the query, kube‑dns responding, and the application receiving the IP for the service:

External Endpoint Resolution

For external domains such as myexternalendpoint.com, kube-dns receives the pod's query and generates its own DNS request to the configured upstream DNS server (in our case, a custom DNS). Once the upstream server responds, kube-dns sends that response back to the pod. The following diagram illustrates this end-to-end flow:

Demo Environment & Failing Scenario

Our setup for this demo includes:

An AKS cluster configured to use a custom DNS server at 10.224.0.92.
The custom DNS server is intentionally misconfigured to not respond to queries with name myexternalendpoint.com. Any other domain name is resolved correctly.
A simple store application where one of the services needs to connect to myexternalendpoint.com.

note

The application code alongside the detailed steps to reproduce the environment (setup-env.md) and the demo (detailed-guide.md) are available at blanquicet/aks-store-demo.

Once the environment is ready, let's deploy the application:

Then, access the application through the store-front Ingress IP, which exposes it to the Internet:

And finally, try the application. You'll notice that when the product Inspektor Gadget is in the cart and you try to check out, the UI hangs and eventually fails with the following message:

And this is the error we will debug.

Troubleshooting the DNS Issue with Inspektor Gadget

This is the part of the application we are focusing on:

Let's start by checking order‑service pod logs:

Finding: The logs show "getaddrinfo EAI_AGAIN myexternalendpoint.com", indicating DNS lookup failures for the external endpoint.
Now that we know it's a DNS issue, let's trace the DNS traffic using Inspektor Gadget. Inspektor Gadget is a versatile open source tool for observability. In this case, it will allow us to easily view DNS queries and responses to identify the root cause of the issue. If you are not running it yet, deploy it into your cluster by following the installation instructions.

With IG installed, let's run the trace_dns gadget to trace the DNS requests generated by the order-service pod for the myexternalendpoint.com. domain name:

note
Ensure that the filter value is a fully qualified domain name (FQDN) by adding a dot (.) to the end of the name.

Observations: The trace output shows multiple DNS queries (QR=Q) from the order-service pod to kube-dns for myexternalendpoint.com., followed by response packets (QR=R) from kube-dns with response code ServerFailure. This indicates that kube-dns received the queries but failed to resolve the name, returning a server failure error.
Now, given that it's resolving an external domain, let's check if that failure is coming from the custom DNS or it's being generated by the kube-dns service itself. To do this, we will trace the DNS traffic between kube-dns and the custom DNS server.

The output confirmed that the custom DNS server is reachable but it's not replying to the queries related with the myexternalendpoint.com. name:

Wrap-up

With Inspektor Gadget, we were able to follow the entire DNS resolution process and see exactly where things went wrong. Starting from the order-service pod, we saw DNS queries being sent to kube-dns and confirmed that kube-dns was responding with ServerFailure. This alone gave us confidence that the issue wasn't within the application logic or networking rules.

We then followed the trail from kube-dns to the upstream DNS server and confirmed that kube-dns was generating the proper external queries—but the upstream server wasn't replying at all. This clearly explained the server failure responses kube-dns was returning to the application. To validate that the issue was isolated to a specific domain, we traced other DNS queries to the upstream server and saw successful responses, proving the DNS server itself was reachable and generally working.

Having this visibility into DNS traffic at each step of the chain made troubleshooting faster and more reliable. Inspektor Gadget didn't just help us observe behaviors in real time—it let us isolate the issue with confidence and understand the impact across components.

Complementarily, you can get networking metrics and logs using the Container Network Observability feature of Advanced Container Networking Services (ACNS). This feature provides networking observability for AKS clusters and also supports Windows nodepools.

For a broader guide on diagnosing DNS issues in production clusters, check out this DNS troubleshooting walkthrough on Microsoft Learn.

Resources

Summary of resources used in this article:

Scaling Safely with Azure AKS Spot Node Pools Using Cluster Autoscaler Priority Expander

Thu, 17 Jul 2025 00:00:00 GMT

As engineering teams seek to optimize costs and maintain scalability in the cloud, leveraging Azure Spot Virtual Machines (VMs) in Azure Kubernetes Service (AKS) can help dramatically reduce compute costs for workloads tolerant of interruption.

However, operationalizing spot nodes safely—especially for production or critical workloads—requires deliberate strategies around cluster autoscaling and workload placement.

Here's how to utilize cluster autoscaler's priority expander feature to improve workload availability with spot on AKS.

1. Understanding Azure Spot Node Pools in AKS

Azure Spot VMs provide up to 90% savings compared to pay-as-you-go prices but come with the risk of eviction when Azure needs the compute back. To use Spot VMs with cluster autoscaler in AKS:

Your AKS cluster must use Virtual Machine Scale Sets (VMSS) for its node pools.
You can’t use Spot VMs for the default system node pool; only user node pools can be created as spot node pools.
The priority property of the node pool determines if it's a spot pool or regular VM.

2. Setting Up a Safe Node Pool Architecture

A resilient AKS architecture for spot scaling typically looks like:

Node Pool Type	Purpose	Node VM Priority
System (Default)	Core system workloads	Regular
On-demand	User/service-critical pods	Regular
Spot	Cost-optimized workloads	Spot

The default node pool (system) runs kube-system and other critical pods on regular VMs.
Additional node pools can be created for workload pods: some regular, some spot.
Workloads are assigned to node pools via Kubernetes node selectors and taints/tolerations.

3. Enabling and Configuring the Cluster Autoscaler and spot VM nodepool

Cluster Autoscaler automatically adjusts the number of nodes to meet pod scheduling needs. On AKS:

You can enable autoscaler when creating a node pool:

az aks nodepool add \
  --resource-group  \
  --cluster-name  \
  --name spotpool \
  --priority Spot \
  --eviction-policy Delete \
  --enable-cluster-autoscaler \
  --min-count 1 \
  --max-count 10

Each pool can scale independently by setting different min/max counts.

4. Using the Priority Expander

The priority expander lets you influence which node pool the cluster autoscaler scales first. For example, you might want the autoscaler to scale spot pools before on-demand pools to optimize for cost, but fall back to regular VMs if no spot capacity is available.

In AKS, set the expander profile to priority and define your node pool priorities:

Add an autoscaler profile to cluster creation or update, specifying expander=priority:

az aks update \
  --resource-group  \
  --name  \
  --cluster-autoscaler-profile expander=priority

Configure a ConfigMap with pool patterns and their numeric priorities: higher number = higher priority.
The ConfigMap must be named cluster-autoscaler-priority-expander and placed in the kube-system namespace.
Below is an example of a configuration providing higher priority for spot VM node pools (higher number):

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
data:
  priorities: |
    50:
      - .*spot.*
    10:
      - .*on-demand.*
    1:
      - .*catch-all.*

Apply the above ConfigMap:

kubectl apply -f

5. Best Practices for Spot Node Pool Scaling

Eviction Handling: Create disruption budgets and readiness checks so pods are safely rescheduled if spot nodes are reclaimed.
Hedge capacity across SKUs: Create multiple node pools of different VM family and SKUs to increase probability of spot capacity availability.
Pod Scheduling: Use nodeSelector, affinity, and taints/tolerations to schedule tolerant workloads onto spot nodes, while keeping critical workloads on on-demand pools.
Zone Redundancy: Distribute node pools across multiple availability zones to minimize simultaneous spot interruptions.
Cost Monitoring: Use Azure monitoring to track node evictions (vmss activity logs), pool utilization, and right-size your pools regularly.
Horizontal Pod Autoscaler: Combine with HPA to orchestrate scaling at both node and pod level for optimal elasticity.

6. Failover and Reliability Patterns

If spot capacity runs out or spot nodes are evicted:

The priority expander allows autoscaler to fall back gracefully, scaling the next-eligible (on-demand) pool.
Application workloads can continue on on-demand nodes, maintaining uptime and minimizing interruption.
Use multiple node pools with appropriate affinity/anti-affinity to balance workloads and risk.

7. Clean-Up and Observation

When scaling down, the autoscaler will cordon and drain underutilized nodes, maintaining minimum pool counts and moving pods as needed. Always validate behavior in test environments before onboarding production workloads.

Summary: Implementing spot node pool scaling in Azure AKS, combined with the cluster autoscaler’s priority expander, brings cost savings and elasticity to Kubernetes workloads—while protecting critical applications from unwanted interruptions through flexible, intelligent failover strategies.

Performance Tuning AKS for Network Intensive Workloads

Tue, 15 Jul 2025 00:00:00 GMT

As more intelligent applications are deployed and hosted on Azure Kubernetes Service (AKS), network performance becomes increasingly critical to ensuring a seamless user experience. For example, a chatbot server running in an AKS cluster needs to handle high volumes of network traffic with low latency, while retrieving contextual data — such as conversation history and user feedback from a database or cache, and interacting with a LLM (Large Language Model) endpoint through prompt requests and streamed inference responses.

In this blog post, we share how we conducted simple benchmarks to evaluate and compare network performance across various VM (Virtual Machine) SKUs and series. We also provide recommendations on key kernel settings to help you explore the trade-offs between network performance and resource usage.

Benchmark

Our methodology involves conducting tests and measurements to identify key factors affecting network performance for applications running on AKS. We simulated a common use case: a pair of pods communicating in TCP protocol across two different nodes within the same cluster. We measured various performance metrics, including throughput, round-trip time (RTT), and retransmission rate in the presence of packet loss at high bandwidth usage.

In our experiment, iperf3 was run as a container within Kubernetes pods in the host network namespace on selected nodes to generate single or multiple TCP streams simulating application traffic. All underlying Kubernetes nodes had identical hardware specifications: 48 CPU cores, 192 GB of memory, and were running Linux 5.X kernels. During each test, we also monitor CPU and memory usage of both client and server containers to make sure iperf3 is not resource constrained.

Hardware Matters Most

We compared the test results of Azure's older generation series Dsv3 and newer generation series Dsv6 on AKS, and observated signifncant network performance difference:

Up to 35% higher throughput for Azure Dsv6 compared to Dsv3 when trying to maximize network bandwidth usage.

Up to 3x~6x lower RTT for Azure Dsv6 compared to Dsv3 when tests are limited to the same network bandwidth usage.

The TCP retransmission rate remained consistently at 0% on Azure Dsv6, while it was noticeably higher on Dsv3.

The primary reasons for the significant leap in network performance on Dsv6 is because the next-generation network interface Microsoft Azure Network Adaptor (MANA) supports Jumbo Frame and MTU 9000.

A higher MTU (Maximum Transmission Unit) and MSS (Maximum Segment Size) allow TCP traffic to transmit more data per packet, reducing the total number of packets that need to be processed by network interface and OS kernel. This leads to fewer hardware interrupts, less buffering, and reduced overhead in data movement, ultimately improving overall network efficiency.

We realized there is no way to fully abstract hardware from the application — fundamentally, application network performance is dictated by the capabilities of the underlying CPU, memory, and network interfaces on the physical host. Identifying the appropriate VM SKUs and series is essential to ensure the application meets its networking performance requirements.

The MTU on Azure Dsv6-series VMs is not set to 9000 by default. You can manually adjust the MTU for a specific network interface using the following Linux command:

ip link set $device mtu 9000

To enable Jumbo Frames across all nodes in an AKS cluster, you can deploy a DaemonSet, such as this example. In addition, you can optimize MTU settings dynamically using Path MTU Discovery

Kernel Settings Tuning

If migrating to a newer VM SKUs or series like Dsv6 isn’t a viable short-term option due to quota limitations or compatibility concerns, kernel-level tuning remains a practical path to improving network performance. In our testing, we explored several tuning parameters and found that adjusting the ring buffer size on the network interface card (NIC) had the most significant impact. As shown below, increasing the NIC receive buffer size from the default 1024 to 2048 packets on a Dsv3 VM resulted in a ~20% improvement in network throughput.

We also observed reductions in both TCP retransmission rate and round-trip time (RTT) under a combined bandwidth load of 15 Gbps. The benefit was especially pronounced when traffic was split across three parallel TCP streams, each operating at 5 Gbps.

TCP retransmissions often occur when the sender or receiver buffer is too small to handle a surge of packets, resulting in packet drops. For example, if the NIC name is enP28334s1 (typical name for an SR-IOV network interface), you can check how many packets were dropped due to ring buffer overflow with:

ethtool -S enP28334s1 | grep rx_out_of_buffer

If rx_out_of_buffer shows a non-zero value, it indicates that a ring buffer overflow has occurred. To increase the ring buffer size (e.g., to 2048 packets), use:

sudo ethtool -G enP28334s1 rx 2048

Again you can also deploy a daemonSet to enforce consistent ring buffer size across all nodes in AKS cluster following this example.

It's important to note that increasing the NIC ring buffer size has memory usage implications. Allocating larger buffers means the operating system reserves more memory specifically for handling network traffic, which reduces the amount of memory available for user-space applications. For example, doubling the receive buffer size from 1024 to 2048 packets per descriptor across multiple queues and interfaces can lead to a non-trivial (around 100 MB or more) increase in kernel memory usage — especially on systems with high network concurrency or multiple high-throughput NICs. While this tradeoff can significantly improve network performance, especially under high load, it should be balanced against the memory demands of the application workload running on the same VM.

Conclusion

Achieving optimal network performance on AKS requires a combination of choosing the right VM SKUs and series and fine-tuning kernel-level parameters. By understanding the trade-offs and evaluating different options, AKS users can unlock meaningful improvements in network performance and application responsiveness. Next we plan to expand our benchmarks to include newer Linux 6.x kernels, which incorporate recent networking enhancements. We also intend to measure and analyze the performance implications of container networking solutions (such as Cilium), and to explore additional optimization strategies. Please let us know if there are specific scenarios, workloads, or networking configurations you’d like to see included in future work.

Boosting PostgreSQL performance on AKS

Wed, 09 Jul 2025 00:00:00 GMT

PostgreSQL is one of the most popular stateful workloads on Azure Kubernetes Service (AKS). Thanks to the support of a vibrant community, we now have a strong PostgreSQL operator ecosystem that makes it easier for everyone to self-host PostgreSQL on Kubernetes.

One of the leading operators driving PostgreSQL adoption is CloudNativePG, an open-source PostgreSQL operator built from the ground up for Kubernetes. CloudNativePG embraces Kubernetes-native patterns for stateful workloads. It offers built-in support for high availability, rolling updates, backup orchestration, and automated failover---all using native Kubernetes resources.

This tight Kubernetes integration leads to more predictable behavior, easier observability, and a smoother developer experience. CloudNativePG is also a CNCF-hosted project, developed in the open with wide community participation, and backed by upstream PostgreSQL contributors. For teams looking to run production-grade PostgreSQL in Kubernetes without managing custom scripts or sidecars, CloudNativePG provides a straightforward and maintainable approach without retrofitting traditional PostgreSQL management practices into container environments.

However, optimizing PostgreSQL infrastructure performance can still be challenging. In this post, we'll demystify challenges on how storage impacts PostgreSQL and share how we dramatically improved performance on AKS by using local NVMe storage with Azure Container Storage.

The big bottleneck: storage

PostgreSQL's performance is tightly bound to storage I/O. To operate optimally, PostgreSQL performs frequent disk writes for transaction logs (WAL) and checkpoints. Even in predominantly read-heavy workloads, any write operations---including inserts, updates, and deletes---must wait for the storage subsystem to confirm that the WAL has been safely persisted, and any delay in storage throughput or latency directly affects query performance and database responsiveness.

This is because PostgreSQL is designed with strong durability guarantees: every transaction must be written to the Write-Ahead Log (WAL) before it is considered committed. If the underlying storage is slow or experiences high latency, these commit operations become a major bottleneck, causing application slowdowns and increased response times.

Additionally, PostgreSQL periodically performs checkpoints, flushing dirty pages from memory to disk to ensure data consistency and enable crash recovery. During these checkpoints, large bursts of I/O can occur, and if the storage cannot keep up, it can lead to increased query latency or even temporary stalls. Background processes like autovacuum and replication also generate significant I/O, further amplifying the dependency on fast, low-latency storage.

In high-concurrency environments, the situation is even more pronounced: with many clients issuing transactions simultaneously, the database's ability to process requests is often limited not by CPU or memory, but by how quickly it can read from and write to disk. As a result, storage IOPS (input/output operations per second) and latency become the primary factors that determine PostgreSQL's throughput and overall performance, especially for write-heavy or latency-sensitive workloads.

Storage options on AKS

AKS supports a variety of storage options through the Azure Disk CSI driver. Let's dive into a few of them:

Premium SSD: These general-purpose SSDs are widely used and support availability features like ZRS (Zone Redundant Storage) and fast snapshotting. They're ideal for many workloads, but IOPS and throughput are still constrained by the VM's limits on remote disk access.
Premium SSD v2: An evolution of Premium SSDs, this option decouples storage size from performance, letting you scale IOPS and throughput independently. With up to 80,000 IOPS and 1,200 MB/s throughput, they're more cost-efficient for I/O-intensive workloads.
Ultra Disk: Azure's highest-performing remote disk offering, Ultra Disk supports up to 400,000 IOPS and 10,000 MB/s. However, achieving this full performance requires very large VMs such as the Standard_E112ibds_v5 because remote disk performance is constrained by the VM's vCPU count and remote disk controller limits. This means you're forced to pay for massive compute resources just to unlock storage performance—even if your workload doesn't need 112 vCPUs.

Benchmarking: PostgreSQL performance with Azure Container Storage

This is where local NVMe storage fundamentally changes the game. Unlike remote disks that scale performance with VM size, local NVMe drives deliver their full performance regardless of vCPU count because they're physically attached to the VM and bypass the remote disk controller entirely.

Consider this stark contrast:

Ultra Disk: To get 400,000 IOPS, you need a 112-vCPU VM (Standard_E112ibds_v5)
Local NVMe: An 8-vCPU (Standard_L8s_v3) VM delivers 400,000 IOPS out of the box

That's 14x fewer vCPUs for the same IOPS performance, dramatically reducing your compute costs. The trade-off is that you're shifting data durability from the storage layer to the application layer, relying on PostgreSQL's WAL-based replication and backup orchestration instead of underlying storage persistence (which we address in the next section).

Historically, Kubernetes couldn't easily use local NVMe disks due to their ephemeral nature and lack of built-in abstraction. Azure Container Storage solves this: it aggregates local NVMe devices across nodes into a storage pool and exposes them through a Kubernetes-compatible storage class. You can reference this class directly when creating PersistentVolumeClaims.

info

Hmm, is this something I can do with the Azure Disks CSI driver? Not quite, which is what makes Azure Container Storage unique! Azure Container Storage lets you use advanced block storage products such as local NVMe, temp SSDs, and Azure Elastic SAN to create the PVCs you need to run stateful applications on Kubernetes.

What about persistence?

Yes, local NVMe is ephemeral—data is lost if the node is deallocated or the cluster shuts down. Azure Container Storage provides an annotation for volumes that enables a persistence-aware mode, which helps Kubernetes treat these ephemeral volumes more predictably. This doesn't change the nature of the storage—it simply signals to the platform and your team that you're opting into these trade-offs knowingly.

So how do you make use of it for something as critical as a database?

This is where application-level resilience comes in. PostgreSQL's Write-Ahead Log (WAL) ensures data durability by recording every change before it's applied, enabling point-in-time recovery even if the primary disk is lost. Modern operators like CloudNativePG leverage this by providing PostgreSQL-native high availability (HA), replicating data across nodes. We show you how to set up automatic backups to Azure Blob Storage in our official AKS PostgreSQL deployment guide

While data on any single NVMe volume is not durable, WAL-based replication and backup orchestration ensure that PostgreSQL can recover to a consistent state and continue functioning even if a node goes down. This approach provides insulation from underlying storage failures and allows you to balance performance with resilience.

Benchmark results

So why go through all this trouble? Because the performance is worth it! Using local NVMe with Azure Container Storage can provide 15,000+ transactions per second and <4.5ms average latency on Standard_L16s_v3 virtual machines.

info

If you're curious about our exact benchmark procedure, you can read CloudNativePG's official benchmarking documentation.

In a nutshell, we initialized the database via kubectl cnpg pgbench stgres-cluster -n postgres --job-name pgbench-init -- -i -s 1000. This command generates a database of 100,000,000 records using a scale factor of 1000. Then, we ran the benchmark command kubectl cnpg pgbench stgres-cluster -n postgres --job-name pgbench -- -c 64 -j 4 -t 10000 -P which runs the test with 64 concurrent clients, four worker threads, and 10,000 records per client (for a total of 64,000). The -p is a nice added touch to monitor the progress of the test live. This simulates a medium-to-high load on a production-like system, stressing both throughput and latency.

We also benchmarked our setup on larger L-series VM SKUs, and discovered performance increased to 26,000 transactions per second with 2.3ms average latency on Standard_L64s_v3.

At a technical level, PostgreSQL benefits from local NVMe because its architecture involves frequent small writes (WAL), background checkpointing, and page flushing---all of which are extremely sensitive to disk latency. Local NVMe delivers consistent microsecond-scale latency and high IOPS, giving PostgreSQL the I/O headroom it needs to scale under pressure.

We encourage you to benchmark PostgreSQL and Azure Container Storage yourself. For your convenience, we're sharing our setup scripts if you want to give it a shot. We encourage you to experiment with different virtual machines in your nodepool, different PostgreSQL parameters, and other AKS features! Some caveats to remember as you test different scenarios, particularly super large virtual machines:

Our benchmarking tool, pgbench, also stresses CPU and memory, so it's not purely I/O-bound.
High availability via CloudNativePG introduces synchronization overhead that limits maximum throughput. Once storage IOPS exceed a certain threshold, HA and replication become the new bottlenecks.
Those of you with sharp eyes might notice that we separate pgdata and the WAL onto different volumes when using Premium SSD/Premium SSD v2 and the Azure Disk CSI driver. This is a recommended best practice from CloudNativePG, with one key reason being that this configuration doubles the total pool of IOPS by creating two separate disks. But with local NVMe-backed storage pools, all I/O is hitting the same set of NVMe drives, so separate volumes doesn't add performance.

What's next

We're continuing to invest in simplifying and accelerating stateful workloads on Kubernetes. In our next release of Azure Container Storage, we're reducing PostgreSQL latency even further and increasing throughput, all while keeping the developer experience seamless.

If you're running PostgreSQL on AKS and are looking to squeeze out more performance without overpaying for compute, local NVMe + Azure Container Storage might be the best setup you haven't tried yet.

Want to try everything out for yourself? Visit our newly renovated guide on deploying PostgreSQL in Azure Kubernetes Service with CloudNativePG, as well as the benchmarking scripts we used for this blog post.

From 7B to 70B+: Serving giant LLMs efficiently with KAITO and ACStor v2

Tue, 08 Jul 2025 00:00:00 GMT

XL-size large language models (LLMs) are quickly evolving from experimental tools to essential infrastructure. Their flexibility, ease of integration, and growing range of capabilities are positioning them as core components of modern software systems.

Massive LLMs power virtual assistants and recommendations across social media, UI/UX design tooling and self-learning platforms. But how do they differ from your average language model? And how do you get the best bang for your buck running them at scale?

Let’s unpack why large models matter and how Kubernetes, paired with NVMe local storage, accelerates intelligent app development.

Aren’t LLMs large enough?

Large models aren’t just big for show — they’re smart, efficient, and versatile thanks to:

Efficient transformer architectures powering them
Compatibility with high-performance inference libraries like vLLM
Ability to handle long-context memory effectively, allowing them to score well on tasks like instructions, coding, math, and multilingual understanding (check out HuggingFace's Open LLM Leaderboard to learn more)

Running these XL-size LLMs at scale can be more cost-effective than relying on commercial APIs, if you play your cards right.

When self-hosting big models makes sense 💡

Self-hosting LLMs on Kubernetes is growing in popularity for organizations that are:

Running lots of inference: batch jobs, chatbots, agents, or apps
Have access to commercial GPUs like NVIDIA H100 or A100
Want to avoid per-token API fees (which easily skyrocket costs at scale)
Keen to fine-tune or customize the model — something closed APIs usually block
Have sensitive or proprietary data to keep ring-fenced and protected from accidental exposure through third-party logs

Using KAITO for self-hosting

Self-hosting with the Kubernetes AI Toolchain Operator (KAITO) helps you achieve all this and more! KAITO is a CNCF Sandbox project that simplifies and optimizes your inference and tuning workloads on Kubernetes. By default, it integrates with vLLM, a high-throughput LLM inference engine optimized for serving large models efficiently.

vLLM supports quantized models, reducing memory/GPU requirements drastically without major accuracy trade-offs. KAITO’s modular, plug-and-play setup allows you to go from model selection to production-grade API quickly:

Out-of-the-box OpenAI-compatible API means you can swap in KAITO with minimal application-side changes.
Built-in support for prompt formatting, batching, and streaming responses.
Self-hosting with KAITO on your AKS cluster ensures data never leaves your organization's controlled environment, ideal for highly regulated industries (finance, healthcare, defense) where cloud LLM APIs may be restricted due to compliance.

The catch? Managing huge model weights 🏋️‍♂️

Some models come with massive weight files, and even when fully quantized can weigh hundreds of gigabytes (based on model type and version). Handling and deploying such model serving workloads isn’t trivial, especially if you want reproducible, scalable workflows on Kubernetes.

KAITO balances simplicity and efficiency by using container images to manage most LLMs - but it can become difficult to distribute large model files that result in heavy-weight images.

Luckily, KAITO inferencing now supports these model weights with the power of a local file cache and striped Non-Volatile Memory Express (NVMe) PersistentVolume managed by Azure Container Storage.

note

A local file cache significantly reduces latency during model downloads and reads, enhances reliability with persistent storage and avoids repetitive downloads after container restarts - all without extra storage fees!

What is Azure Container Storage?

Azure Container Storage (ACStor) is a cloud-based volume management, deployment, and orchestration service built natively for containers. It integrates with Kubernetes, allowing you to dynamically and automatically provision persistent volumes to store data for stateful applications running on Kubernetes clusters.

The latest version of this project, ACStor v2, is purpose-built for AI and high-performance computing (HPC) workloads that demand ultra-fast data processing on local NVMe disks. It delivers performance close to raw NVMe speeds, all while providing seamless Kubernetes-native operations.

We’re excited to share that these capabilities are now available in KAITO through early integration. When deployed with KAITO, ACStor v2 provisions striped volumes across local NVMe disks, serving large model files efficiently.

How does this work?

ACStor v2 aggregates local NVMe drives (available by default in several Azure GPU-enabled VM sizes) as Kubernetes PersistentVolumes (PVs)
Bypasses network storage bottlenecks for ultra-low latency & high IOPS & high throughput - ideal for AI inference at scale
Abstracts multiple NVMe drives into (as low as) a single persistent volume, so pods automatically land on nodes with maximum fast storage
Supports StatefulSets to handle stateful workloads and edge data pipelines smoothly

Why is ACStor v2 ideal for distributed inference with KAITO?

Benefit	Why it Matters
⚡Performance	Max throughput & IOPS of local NVMe SSDs
🎯Data Locality	Pods get scheduled where storage is available, avoiding failures
💸 Zero added costs	Local storage used by default to avoid external storage fees
📦Kubernetes Native	Full CSI support, PVC lifecycle management, and integration with AKS/VMSS
🔁 Repeatability	Ideal for ML pipelines and reproducible runs

To test this out, we performed a performance benchmark test on the Llama-3.1-8B-Instruct LLM:

We see over a 5X improvement in model file loading performance when using ACStor v2 with a locally striped NVMe volume, compared to using an ephemeral OS disk!

Ready to dive in? Try LLaMA 3.3 70B today on your AKS cluster

Using this KAITO inference workspace, you'll leverage NVMe local storage to serve a model as large as LLaMA 3.3 70B LLM (140GB in size) easier and more efficient than ever on AKS. The Llama 3.3 70B tuned model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks. You get fast inference, scalable deployments, and complete control over your AI stack – now available in open-source KAITO v0.5.0 and coming soon to the AKS managed add-on experience.

For teams looking to adopt Azure Container Storage v2 directly in their own Kubernetes environment, a standalone Kubernetes extension is scheduled for release by September 2025. Stay tuned for more updates!

We want your feedback

Want your favorite XL-size large language model supported with KAITO? Jump in and submit your feature requests to the Upstream project roadmap!

What's New?! Guidance Updates for Stateful Workloads on AKS

Tue, 10 Jun 2025 00:00:00 GMT

Helping you deploy on AKS

Building on our initial announcement for Deploying Open Source Software on Azure Azure is excited to announce we have expanded our library of technical best practice deployment guides for stateful workloads on AKS. We have developed a comprehensive guide for deploying Kafka on AKS, and updated our Postgres guidance with additional storage considerations for data resiliency, performance and cost. We have also added Terraform templates to our Mongo DB and Valkey guides for automated deployments.

These guides are designed to help you accelerate the integration of some of the most critical and heavily adopted open source projects onto Azure, utilizing best practices and optimizations for AKS. Jump to our collection of Stateful and AI guides below.

Introducing Guidance for Deploying Apache Kafka on AKS with the Strimzi Operator

Apache Kafka is an open source distributed event streaming platform designed to handle high-volume, high-throughput, and real-time streaming data. It is widely used by thousands of companies for mission-critical applications but managing and scaling Kafka clusters on Kubernetes can be challenging. Strimzi simplifies the deployment and management of Kafka on Kubernetes by providing a set of Kubernetes Operators and container images that automate complex Kafka operational tasks.

The Kafka on Azure Kubernetes Service (AKS) guide covers essential storage and compute considerations, ensuring you Kafka deployment meets your needs. Additionally, we provide guidance for tuning the Java Virtual Machine (JVM), which is critical for optimal Kafka broker and controller performance.

On top of basic installation steps, the Kafka guide provides best practice recommendations for configuring networking, monitoring, and managed identity.

Updates to our current Guidance

Aside from writing new guidance, we have also reviewed our existing portfolio of guides to ensure they are accurate and up to date. As new services and features have been developed, we have provided guidance to help you seamlessly integrate them with your stateful applications.

Azure Container Storage integration

Azure Container Storage is a volume management, deployment, and orchestration service built for Kubernetes and based on OpenEBS. Azure Container Storage offers your clusters a significant advantage by effortlessly provisioning high-performance persistent volumes on ephemeral local NVMe drives, surpassing typical CSI drivers.

We have revised our PostgreSQL guidance and Kafka guidance to include Azure Container Storage alongside other storage configuration options, so you can make an informed choice on what best suits your objectives.

For the most durable data resiliency, you can use the Azure Disks CSI driver with Premium SSD disks which provide zone-redundant storage resiliency to your PostgreSQL deployment.
For the best cost savings at scale, you can use Premium SSD v2 disks which can let you set disk capacity independently from performance settings.
For maximum performance, Azure Container Storage and ephemeral disks can provide the extremely low sub-millisecond latency and high input/output operations per second (IOPS) that transactional database workloads benefit from.

Automate Deployment Guides with Terraform templates

We also listened to your requests for adding Terraform templates alongside the AzCli guidance so that you can use Infrastructure as Code for your deployments.

We have updated both MongoDB and Valkey guides with Terraform. Additionally, we have developed an Azure Verified Module for deploying a production-grade AKS Cluster.

Deploy Stateful and AI workloads on Azure Kubernetes Service

Postgres - Create infrastructure for deploying a highly available PostgreSQL database on AKS
Apache Airflow - Create the infrastructure for deploying Apache Airflow on Azure Kubernetes Service (AKS)
Apache Kafka - Prepare the infrastructure for deploying Kafka on Azure Kubernetes Service (AKS)
Ray - Deploy a Ray cluster on Azure Kubernetes Service (AKS)
Valkey - Create the infrastructure for running a Valkey cluster on Azure Kubernetes Service (AKS)
Mongo DB - Create the infrastructure for running a MongoDB cluster on Azure Kubernetes Service (AKS)
Kubernetes AI Toolchain Operator (KAITO) - Deploy KAITO on AKS using Terraform

What's next?

We will continue expanding our library of technical best practice guidance and update the remaining guides to also include Terraform templates. Soon, we will develop more guides for databases like Cassandra DB, which will also include AcStor integration guidance, helping you choose the best storage option for your needs.

Using Stream Analytics to Filter AKS Control Plane Logs

Fri, 30 May 2025 00:00:00 GMT

While AKS does not provide access to the cluster's managed control plane, it does provide access to the control plane component logs via diagnostic settings. The easiest option to persist and search this data is to send it directly to Azure Log Analytics, however there is a large amount of data in those logs, which makes it cost prohibitive in Log Analytics. Alternatively, you can send all the data to an Azure Storage Account, but then searching and alerting can be challenging.

To address the above challenge, one option is to stream the data to Azure Event Hub, which then gives you the option to use Azure Stream Analytics to filter out events that you deem important and then just store the rest in cheaper storage (ex. Azure Storage) for potential future diagnostic needs.

In this walkthrough we'll create an AKS cluster, enable diagnostic logging to Azure Stream Analytics and then demonstrate how to filter out some key records.

Cluster & Stream Analytics Setup

In this setup, the cluster will be a very basic single node AKS cluster that will simply have diagnostic settings enabled. We'll also create the Event Hub instance that will be used in the diagnostic settings.

# Set some environment variables
RG=LogFilteringLab
LOC=eastus2
CLUSTER_NAME=logfilterlab
NAMESPACE_NAME="akscontrolplane$RANDOM"
EVENT_HUB_NAME="logfilterhub$RANDOM"
DIAGNOSTIC_SETTINGS_NAME="demologfilter"

# Create a resource group
az group create -n $RG -l $LOC

# Create the AKS Cluster
az aks create \
-g $RG \
-n $CLUSTER_NAME \
-c 1

# Get the cluster credentials
az aks get-credentials -g $RG -n $CLUSTER_NAME

# Create an Event Hub Namespace
az eventhubs namespace create --name $NAMESPACE_NAME --resource-group $RG -l $LOC

# Create an event hub
az eventhubs eventhub create --name $EVENT_HUB_NAME --resource-group $RG --namespace-name $NAMESPACE_NAME

AKS_CLUSTER_ID=$(az aks show -g $RG -n $CLUSTER_NAME -o tsv --query id)
EVENT_HUB_NAMESPACE_ID=$(az eventhubs namespace show -g $RG -n $NAMESPACE_NAME -o tsv --query id)

Next, we'll enable the diagnostic settings for the AKS cluster. In the example below, for simplicity, we will only enable the 'kube-audit' logs, however you can edit this to include any additional control plane logs you'd like to leverage. You can review the logs available

# Apply the diagnostic settings to the AKS cluster to enable Kubernetes audit log shipping
# to our Event Hub
az monitor diagnostic-settings create \
--resource $AKS_CLUSTER_ID \
-n $DIAGNOSTIC_SETTINGS_NAME \
--event-hub $EVENT_HUB_NAME \
--event-hub-rule "${EVENT_HUB_NAMESPACE_ID}/authorizationrules/RootManageSharedAccessKey" \
--logs '[ { "category": "kube-audit", "enabled": true, "retentionPolicy": { "enabled": false, "days": 0 } } ]'

Stream Analytics

As we'll use Stream Analytics to filter through the log messages for what we want to capture, we'll need to create a Stream Analytics Job. This job will take the Event Hub as it's input source, will run a query and will send the query results to an output target. This output target can be a number of options, but for the purposes of our test we'll write the filtered records out to a Service Bus Queue, which we can watch in real time.

We have the Event Hub already, now lets create the Azure Service Bus Queue and then the Stream Analytics Job to tie it all together.

Create the Service Bus Queue

SERVICE_BUS_NAMESPACE_NAME=kubecontrolplanelogs
SERVICE_BUS_QUEUE_NAME=kubeaudit

# Create the service bus namespace
az servicebus namespace create --resource-group $RG --name $SERVICE_BUS_NAMESPACE_NAME --location $LOC

# Create the service bus queue
az servicebus queue create --resource-group $RG --namespace-name $SERVICE_BUS_NAMESPACE_NAME --name $SERVICE_BUS_QUEUE_NAME

Stream Analytics Job

For the Stream Analytics Job we'll switch over to the portal, so go ahead and open https://portal.azure.com and navigate to your resource group.

Click on the 'Create' button at the top of your resource group:
Search for 'Stream Analytics Job'
Click 'Create' on the Stream Analytics Job search result
Leave all defaults, but provide a name under the 'Instance Details' section and then click 'Review + Create'
After the validation is complete, just click 'Create'. This typically completes very quickly.
Click on 'Go to Resource' or navigate back to your resource group and click on the Stream Analytics Job you just created.
In the stream analytics job, expand 'Job topology' and then click on 'Inputs' so we can add our Event Hub input
Click on 'Add Input' and select 'Event Hub'
The Event Hub's new input creation pane should auto-populate with your Event Hub details as well as default to creation of a new access policy, but verify that all of the details are correct and then click 'Save'.
Now we need to attach the Service Bus we created as the output target, so under 'Job topology' click on 'Outputs'.
In the 'Outputs' window, click on 'Add output' and select 'Service Bus queue'
Again, it should bring up a window with the queue configuration details already pre-populated, but verify all the details and update as needed and then click 'Save'.
To process the records from AKS we'll need to parse some JSON, so we need to add a function to the Stream Analytics Job to parse JSON. Under 'Job topology' click on 'Functions'.
In the functions window, click on 'Add Function' and then select 'Javascript UDF' for Javascript User Defined Function
In the 'Function alias' name the function 'jsonparse' and in the editor window add the following:
```
function main(x) {
  var json = JSON.parse(x);
  return json;
}
```
Click on 'Save' to save the function
Now, under 'Job topology' in the stream analytics job, click on 'Query' to start adding a query. When loaded, the inputs, outputs and functions should pre-populate for you.
We'll first create a basic query to select all records and ship them to the output target. In the query window paste the following, updating the input and output values to match the names of your input and output. The function name should be the same unless you changed it.
```
WITH DynamicCTE AS (
SELECT UDF.jsonparse(individualRecords.ArrayValue.properties.log) AS log
FROM [logfilterhub28026]
CROSS APPLY GetArrayElements(records) AS individualRecords
)
SELECT *
INTO [kubeaudit]
FROM DynamicCTE
```
Click 'Save Query' at the top of the query window
In the top left of the query window, click on 'Start Job' to kick off the stream analytics job.

tip
You may need to click out of and then back into the 'Query' tab before the 'Start job' button becomes active.
In the 'Start job' window, leave the start time set to 'Now' and click 'Start'
Click on the 'Overview' tab in the stream analytics job, and refresh every once in a while until the job 'Status' says 'Running'
Navigate back to your Resource Group and then click on your service bus namespace.
Assuming everything worked as expected you should now be seeing a lot of messages coming through the Service Bus Queue
Click on the queue at the bottom of the screen to open the Queue level view
At the queue level, click on 'Service Bus Explorer' to view the live records
To view the records already created' click on 'Peek from start' and then choose a record to view
Navigate back to the stream analytics job and click on 'Stop job' to stop sending records through to the service bus.

Great! You should now have a very basic stream analytics job that takes the control plane 'kube-audit' log from an AKS cluster through Event Hub, queries that data and then pushes it to a Service Bus Queue. While this is great, the goal is to filter out some records, so lets move on to that!

Setup a test workload to trigger audit log entries

To test out our stream analytics query, we need some test data we can filter on. Let's create some requests to the API server that will be denied. To do that we'll create a service account with no rights and then create a test pod using that service account. We'll then use the service account token to try to reach the Kubernetes API server.

# Create a new namespace
kubectl create ns demo-ns

# Create a service account in the namespace
kubectl create sa demo-user -n demo-ns

# Create a test secret
kubectl create secret generic demo-secret -n demo-ns --from-literal 'message=hey-there'

# Check that you can read the secret
kubectl get secret demo-secret -n demo-ns -o jsonpath='{.data.message}'|base64 --decode

# Create a test pod to try to query the API server
kubectl run curlpod --rm -it \
--image=curlimages/curl -n demo-ns \
--overrides='{ "spec": { "serviceAccount": "demo-user" }  }' -- sh

#############################################
# From within the pod run the following
#############################################
# Point to the internal API server hostname
export APISERVER=https://kubernetes.default.svc

# Path to ServiceAccount token
export SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount

# Read this Pod's namespace
export NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)

# Read the ServiceAccount bearer token
export TOKEN=$(cat ${SERVICEACCOUNT}/token)

# Reference the internal certificate authority (CA)
export CACERT=${SERVICEACCOUNT}/ca.crt

# Explore the API with TOKEN
# This call will pass
curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET ${APISERVER}/api

# This call to get secrets will fail
curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET ${APISERVER}/api/v1/namespaces/$NAMESPACE/secrets/

# Now run it under a watch to trigger continuous deny errors
watch 'curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET ${APISERVER}/api/v1/namespaces/$NAMESPACE/secrets/'

Update Stream Analytics to Look for Forbidden Requests

So, we have a user trying to execute requests against our cluster for which they are not authorized. We can easily update our stream analytics query to filter out forbidden requests against our namespace.

Navigate back to your 'Stream Analytics' instances in the Azure Portal
If the job is still running, make sure you click 'Stop job' as you cannot edit queries while the job is running
Click on the 'Query' tab

Update the query as follows, to filter out audit messages about our 'demo-ns' namespace that also have a status code of 403 (Forbidden)

tip

Be sure that your 'FROM' still points to your Event Hub input target and that your 'INTO' still points to your Service Bus output target.

 WITH DynamicCTE AS (
 SELECT UDF.jsonparse(individualRecords.ArrayValue.properties.log) AS log
 FROM [logfilterhub28026]
 CROSS APPLY GetArrayElements(records) AS individualRecords
 )
 SELECT *
 INTO [kubeaudit]
 FROM DynamicCTE
 WHERE log.objectRef.namespace = 'demo-ns'
 AND log.responseStatus.code = 403

Click 'Save query'
Once the save completes click 'Start Job'

tip
Again, you may need to click out of and then back into the 'Query' tab before the 'Start job' button becomes active.
Go back to the 'Overview' tab in your Stream Analytics job, and refresh until the job 'Status' shows as 'Running'.

Once your job is running, you should be able to navigate back to your Service Bus Queue and watch the messages flowing through.

If you've built up a large backlog of messages, you can change from 'Peek Mode' to 'Receive Mode'.

Then when you click on 'Receive Messages', you can set the messages to 'ReceiveAndDelete' and specify the number of messages you'd like to receive and delete to clear out your queue.

Conclusion

Congratulations! You now have an end-to-end fully working Stream Analytics instance that can filter AKS control plane logs to extract specific messages. You can manipulate the diagnostic settings to add additional logs to the input and modify the query to extract the exact messages critical to your cluster's health and security. This is an extremely versatile solution that is also capable of handling log records of multiple clusters across your enterprise.

Azure VM Generations and AKS

Wed, 23 Apr 2025 00:00:00 GMT

What are Virtual Machine Generations?

If you are a user of Azure, you may be familiar with virtual machines. What you may not have known is the fact that Azure now offers two generations of virtual machines!

Before going further, let's first break down virtual machines. Azure virtual machines are offered in various "sizes," which are broken down by the amount and type of each resource allocated, such as CPU, memory, storage, and network bandwidth. These resources are tied to a portion of a physical server's hardware capabilities. Physical servers may be broken down into many different VM size series or configurations available utilizing its resources.

As the physical hardware ages and newer components become available, older hardware and VMs get retired, while newer generation hardware and VM products are made available.

In this blog, we will go over Generation 1 and newer Generation 2 virtual machines. Both have their own use cases, and picking the right one to suit your workloads is critical in ensuring you get the best possible experience, capabilities, and cost.

Virtual Machine Generation Overview

Azure VM sizes (v5 and older) have largely supported both Generation 1 and Generation 2 VMs. This page gives a thorough breakdown on VM series and the generation they support.

The latest v6 VMs (whether they are Intel, AMD, or ARM will exclusively support Generation 2 VMs.

Comparing Generation 1 & Generation 2

Generation 2 VMs offer exclusive features over Generation 1 VMs, such as increased memory, improved CPU performance, support for NVMe disks, and support for Trusted Launch. With some exceptions, it is generally recommended to migrate to Generation 2 VMs to take advantage of the newest features and functionalities in Azure VMs.

The table below summarizes some key differences between Generation 1 and Generation 2 VMs. For a more detailed comparison, please refer to this page.

Feature	Generation 2 VM	Generation 1 VM
Firmware Interface	UEFI (Unified Extensible Firmware Interface)-based boot (Additional security features and faster boot times)	BIOS (Basic Input/Output System)-based boot (Legacy)
Latest v6 VM Support	v6 VMs support Generation 2 VMs	v6 VMs do NOT support Generation 1 VMs
Trusted Launch	Can enable Trusted Launch, which includes protections like virtual Trusted Platform Module (vTPM)	Can NOT enable Trusted Launch
NVMe Interface Support	Supports NVMe disks, which requires NVM-enabled Generation 2 image	Does NOT support NVMe disks

Implications for your Virtual Machines

If you are already running on Generation 2 VMs, you are all set to deploy on the majority of Azure VMs, including the latest v6 VMs. You can also enable Trusted Launch and the NVMe Interface.

If you are running on Generation 1 VMs, you can continue running on most v5 and older Azure VMs. Migration to Generation 2 VMs is recommended, especially if any of the following requirements apply:

You require Trusted Launch for your workloads
You require NVMe interface for your workloads
You want/need to migrate to the latest v6 VMs

VM Generation Support on AKS

Generation 2 Default

AKS supports both Generation 1 and 2 VMs with all operating systems on AKS. The VM size and operating system that you select when creating an AKS node pool determines which VM Generation you will use.

When creating Linux node pools on AKS, the default will be a Generation 2 VM unless the VM size does not support it.
When creating Windows Server 2025 node pools on AKS, the default will be a Generation 2 VM unless the VM size does not support it.
When creating Windows Server 2019 and Windows Server 2022 node pools on AKS, the default will be Generation 1 VM unless the VM size does not support it. To use a Generation 2 VM, you must add --aks-custom-headers UseWindowsGen2VM=true during node pool creation.

For more information on Generation 2 default behavior on AKS, see AKS documentation.

For a list of supported VM sizes for Generation 1 and Generation 2, please refer to the table on this page.

Generation 1 VM Retirements

When a VM size or series reaches its retirement date, the VM will be deallocated. VM deallocation means that your AKS node pool may experience breakage.

If you would like to confirm whether your Generation 1 VM sizes are retired or are being retired, please search in the Azure Updates page.

Migrating From Retired VM Sizes

If you are using a VM size that is retiring/retired, to prevent any potential disruption to your service, it is recommended to resize your node pool(s) to a supported VM size. AKS does not currently support transitioning to a new VM size within the same node pool, so a new node pool will be created and workloads moved to it during the resizing process.

What VM sizes are my nodes?

To determine the size of your nodes, navigate to the Azure Portal, access your Resource Group, and then select your AKS resource. Within the "Overview" tab, you will find the size of your node pool.

Alternatively, you may run this command in the Azure CLI. Make sure you fill in the names of your resource group and cluster name:

az aks nodepool list \
--resource-group  \
--cluster-name  \
--query "[].{Name:name, VMSize:vmSize}" \
--output table

Resizing your node pools

After you determine the appropriate node pool(s) to take action on, you can resize your node pool(s) to a supported VM size.

When resizing a node pool, you'll go through the process of creating a new node pool with your desired VM size while the existing node pool is cordoned, drained, and ultimately removed.

Depending on the needs of your infrastructure and workloads, when resizing your node pool, please make sure that you pick a new VM size that will best suit your needs.

Note that these instructions are in reference to node pools. If you are using Virtual Machines node pools, VMs should all be Generation 2 by default.

Enhancing Your Operating System's Security with OS Security Patches in AKS

Tue, 22 Apr 2025 00:00:00 GMT

Traditional patching and the need for Managed patching

Operating System (OS) security patches are critical for safeguarding systems against vulnerabilities that malicious actors could exploit. These patches help ensure your system remains protected against emerging threats. Traditionally, customers have relied on nightly updates, such as unattended upgrades in Ubuntu or Automatic Guest OS Patching at the virtual machine (VM) level. However, when kernel security packages were updated, a host machine reboot was often required, typically managed using tools like kured.

This approach, while effective, introduced challenges. Untested or uncontrolled security packages occasionally caused outages, emphasizing the need for a more reliable and managed patching mechanism at the node level. Additionally, maintaining reboot daemonsets like kured added operational overhead for many customers. Ideally, customers prefer managed Kubernetes services, such as Azure Kubernetes Service (AKS), to handle OS security patching comprehensively and seamlessly.

Automatic Node OS Security Patching mechanisms at AKS

AKS provides two managed and tested mechanisms to deliver the latest security packages to your Node Operating System.

Automatic Node Image Channel - AKS updates nodes weekly with a newly patched VHD for security and bug fixes. This update follows maintenance windows and upgrade configuration settings. Automatic Node image upgrades are supported as long as the cluster's Kubernetes minor version is in support. These AKS-tested node images are fully managed and applied with safe deployment practices.

OS Security Patch Channel - Several customers may need only the security packages for their OS without additional bug fixes and updates. The OS Security Patch channel provides a fully managed, attended Node OS security-only solution. The Security-Patch channel reimages nodes only when necessary and provides live security patching updates with zero disruption, respecting planned maintenance windows and follows azure safe deployment practices.

Choosing OS Security Patch or Automatic Node Image Channel?

Choosing between the OS Security Patch channel and the Automatic Node Image channel depends on your specific requirements and operational constraints. Here's a breakdown based on common scenarios:

Speed of Patching is Critical: For urgent CVE fixes, the OS Security Patch channel applies security patches within 5 days, while the Automatic Node Image channel takes approximately 1-2 weeks.
Require Comprehensive Security Fixes and Bug Fixes: For environments where both security patches and additional bug fixes or binaries are essential, the Node Image channel is ideal. It provides a more comprehensive update approach, ensuring both security and functionality improvements.
Workload Sensitive to Multiple Reimages in a Month: If your workloads cannot tolerate frequent disruptions, the OS Security Patch channel is preferable. It minimizes disruptions by focusing solely on security packages, reimaging nodes 60-70% less frequently, and performing live security patching during other times.
Using Windows Environment for Running Workloads: Currently, the OS Security Patch channel is not yet available in Windows environment. For Windows environments, the Automatic Node Image channel is the recommended option for consuming OS security fixes.
Operating in Capacity-Constrained Regions or SKUs: In regions or SKUs with limited capacity, the OS Security Patch channel is beneficial as it primarily performs live security patching, avoiding the need for surge nodes unless there is a re-image required. If using Automatic Node Image channel, you can set surge nodes to zero by configuring the MaxUnavailable setting especially on capacity constrained environments.

By carefully evaluating these factors, you can select the channel that best aligns with your operational needs and workload requirements.

How to Enable OS Security Patch Channel?

You can enable the OS Security Patch Channel using the API, CLI, or the AKS portal. For detailed CLI configuration steps, refer to this guide.

Best Practices when using OS Security Patch Channel

Here are some tips to go about doing this.

Configure Maintenance Windows: Configure Planned Maintenance window to apply security patches during periods of low activity. This minimizes the impact on workloads and ensures that updates are applied seamlessly. OS security patch channel does all of its security patching i.e both live security patching as well as any re-images during the maintenance window.
Configure Cluster Auto-Upgrade Channel: To maximize the benefits of OS security patches, it is recommended to enable the SecurityPatch channel alongside the Kubernetes cluster auto upgrade channel. This dual-channel approach ensures that both the control plane and node pools are kept up-to-date with the latest security patches.
Configure PDB and MaxSurge: Use Pod Disruption Budgets (PDB) to protect critical workloads during updates. Adjust patching speed and parallelism with the MaxSurge setting.
Configure Upgrade Monitoring: Regularly monitor the status of ongoing upgrades to ensure that patches are applied successfully. Utilize tools such as the AKS Communication Manager to get periodic notifications on OS Patching updates.
Use Release Tracker: AKS release tracker provides region by region updates on what security patch version runs in a particular region. These real-time updates are crucial for closely tracking CVEs. There is also an AKS CVE status tab.

What Is Coming Next for OS Security Patch?

The future of OS security patches promises several enhancements aimed at improving the patching process and ensuring even greater security:

To learn more OS Security patch intricacies as well as other interesting AKS topics refer to this Youtube video.

Simplifying InfiniBand on AKS

Fri, 11 Apr 2025 00:00:00 GMT

High performance computing (HPC) workloads, like large-scale distributed AI training and inferencing, often require fast, reliable data transfer and synchronization across the underlying compute. Model training, for example, requires shared memory across GPUs because the parameters and gradients need to be constantly shared. For models with billions of parameters, the available memory in a single GPU node may not be enough, so "pooling" the memory across multiple nodes also requires high memory bandwidth due to the sheer volume of data involved. A common way to achieve this at scale is with a high-speed, low-latency network interconnect technology called InfiniBand (IB).

If you are not familiar with what InfiniBand is yet, imagine this interconnect network as an incredibly fast highway for data transfer, and the data as cars that need to travel from one city to another. On a typical highway (like traditional IP networks), cars must follow speed limits, obey traffic signals, and sometimes stop in traffic jams, which slows them down. InfiniBand networking, on the other hand, can be considered a highway built just for race cars - it has no speed limits, no traffic lights, and wide lanes, allowing the cars to zoom at top speed without any interruptions. This makes data travel incredibly fast and efficiently.

There are two ways to use this fast InfiniBand highway:

Remote Direct Memory Access (RDMA) over InfiniBand: Similar to driving a race car on the fast InfiniBand highway. It maximizes speed and performance but may require specific application design and networking configuration to operate these race cars on the race car highway.
IP over InfiniBand (IPoIB): This is comparable to regular cars using the race car highway - may be easy to implement and compatible with off-the-shelf applications, but you don't get the full speed benefits.

Choosing between these two approaches depends on whether you need compatibility and ease (regular cars on a race car highway) or top-notch speed and performance (race cars on a race car highway).

RDMA over InfiniBand versus IPoIB

RDMA over InfiniBand enables data transfer directly between the memory of different machines without involving the CPU (as shown in the diagram below for GPUs) which significantly reduces latency and improves throughput.

To use this approach, your application needs to be RDMA-aware, meaning that an RDMA API or RDMA-aware message passing framework is used to enable high performance communication. Check out this RDMA programming on NVIDIA guide to learn more.

If your application is not RDMA-aware, IPoIB alternatively can be used to provide an IP network emulation layer on top of InfiniBand networks. IPoIB enables your application to transmit data over IB but may lower performance and increase latency as it relies on the IP protocol stack.

The table below summarizes these key differences:

Feature	RDMA over IB	IPoIB
Data Transfer	Direct Memory Access	IP Protocol Over IB
Latency	Very Low	Higher
Throughput	Very High	Lower
CPU Involvement	Minimal	Significant
Complexity	More specialized (requires RDMA awareness)	Low (easier to implement in existing applications)

InfiniBand on AKS

In the Kubernetes world, there are a range of tools and plugins that support HPC workloads and InfiniBand - so where is a good place to start?

Choosing the right compute in your node pool is an important building block. Consider using Azure HBv3 and HBv4 HPC VM sizes or ND series GPU VM sizes with built-in NVIDIA networking which are all suitable for HPC applications.

When using NVIDIA VM sizes, the Network Operator and GPU Operator are useful tools that package networking and device specific components for ease of installation on Kubernetes. However, setting up your cluster for multi-node distributed HPC workloads may involve tasks such as installing a device plugin, networking component, and node labelling configurations.

As a cluster admin or AI service provider, these steps shouldn't increase time-to-value for your developers or end users! That's why we recently created an open-source InfiniBand on AKS guide to simplify and streamline this setup, walking you through step-by-step instructions to:

Determine the appropriate InfiniBand approach for your new or existing AKS application.
Configure the NVIDIA Network Operator with specific namespace and node labelling to properly schedule your pods deployments.
Apply out-of-box Kubernetes network policy and test the associated pod configuration to achieve maximum performance, resource efficiency, or support non-RDMA aware apps.
Optionally set up the NVIDIA GPU Operator and view an example pod configuration to claim both GPUs and InfiniBand resources created from your selected device plugin managed via Network Operator.
Validate the end-to-end setup with example test scripts on your chosen VM size.

The AKS team is actively building out this repository with examples and updates for new component versions. We encourage you to set up InfiniBand following these best practices for HPC workloads like AI training or inferencing at scale, starting in your AKS test environment(s). Please submit any feedback and/or enhancements by creating a new issue, or review/comment on existing issues in the project!

Optimize AKS Traffic with externalTrafficPolicy Local

Fri, 04 Apr 2025 00:00:00 GMT

Managing external traffic in Kubernetes clusters can be a complex task, especially when striving to maintain service reliability, optimize performance, and ensure seamless user experiences. With the increasing adoption of Kubernetes in production environments, understanding and implementing best practices for external traffic management when using the Azure Load Balancer has become essential.

In this blog, we delve into the intricacies of Kubernetes externalTrafficPolicy=Local setting and explore strategies to gracefully handle pod shutdowns, rolling updates, and pod distribution. By following these best practices, you can enhance the resilience and reliability of your services while optimizing resource utilization across your AKS clusters.

The Advantages of Local ExternalTrafficPolicy

The key differences between ExternalTrafficPolicy=Local and ExternalTrafficPolicy=Cluster is traffic routing are:

With Local, only nodes that have healthy pods for the service receive traffic. The node routes the traffic solely to the pods residing on it.
With Cluster, all nodes are behind the Azure Load Balancer. The incoming external traffic is distributed across all nodes in the cluster—even those that don’t have any pods for the service. Each node then routes the traffic internally to the available pods for that service.

These architectural differences give ExternalTrafficPolicy=Local some key benefits over type cluster including:

1. Localized Impact of Node Downtime: The impact during node downtime is more confined. Specifically:

Local Mode: Traffic is affected only if the downed node is running a service pod, impacting that pod’s share of the traffic (i.e., 1/N, where N is the total number of service pods).
Cluster Mode: Not only is the traffic affected on the node running the service pod (1/N), but a downed on any other node also affects an additional 1/M of the traffic, where M is the total number of nodes.

2. Preservation of the Client Source IP: The client’s original IP is maintained because traffic is only routed to nodes hosting healthy pods. This is crucial for security, logging, and analytics.

To learn more, you can refer to Kubernetes Traffic Policies

How `externalTrafficPolicy=Local` Works

As detailed above, externalTrafficPolicy=Local routes traffic directly to nodes hosting service pods and which meet the health check requirements. Below is an illustration of how this policy works in practice:

Let's look into how each of the following components work with the Local Mode:

When you set a Service's external traffic policy to Local in AKS, you'll see an additional field in the Service description: HealthCheck NodePort. This is a dedicated NodePort (e.g. port number in the 30000+ range) that Azure's Standard Load Balancer uses to verify which nodes have healthy pods for this Service.

Health Probe on Each Node: Azure automatically configures a health probe on the load balancer that targets the HealthCheckNodePort across all nodes in the LB's backend pool. Kubernetes ensures that this port only returns a successful response on nodes that are running at least one ready pod for the Service. Nodes with no pods for that Service will fail the health check.
Load Balancer Backend Pool: With externalTrafficPolicy=Local, all cluster nodes are listed in the LB's backend pool. But due to the health probes, nodes without a Service pod are marked unhealthy and won't receive traffic. Only nodes with healthy pods respond to the probe and remain in rotation. By contrast, in Cluster mode, every node responds (since even if it has no pod, kube-proxy will forward the traffic), so the LB sees all nodes as healthy. The kube-proxy component manages this port and ensures it responds in accordance with the trafficpolicy selected.
IPTables Rules: IPTables rules are configured to only forward incoming traffic from the Azure Load Balancer (ALB) directly to pods running on the same node. These rules ensure that traffic is never forwarded to other nodes. This localized traffic routing reduces latency and ensures that external connections continue to be served even during node update operations.

By combining these mechanisms, externalTrafficPolicy=Local provides a robust way to manage external traffic while maintaining source IP visibility and ensuring traffic is routed to healthy pods directly.

Best Practices to gracefully close existing connections and shut service pods

Gracefully handling pod shutdowns is critical to maintaining service reliability and avoiding disruptions, especially in scenarios involving HTTP keep-alive connections or long-lived client sessions. Without a graceful shutdown process, external customers could see errors like - connection refused and connection reset by peer during node related events.

Below are detailed best practices for how pods can handle Kubernetes initiated termination requests (eg: pod evictions or a scale down) to ensure a smooth shutdown process.

Preventing New Connections to an Unhealthy Pod when using externalTrafficPolicy=Local

To avoid routing new requests to a pod that is in the process of shutting down, it is important to manage its health status effectively. The below image shows the timeline for a pod receiving a TERM signal and gracefully shutting down without impact on external traffic:

Immediate Health Check Response: As soon as a pod is marked for deletion, kube-proxy’s healthCheckNodePort begins returning HTTP 500. This immediately signals external load balancers that the pod is no longer healthy and should stop receiving traffic.
Load Balancer Probe Delay: External load balancers will take a few seconds (upto 10 sec) to detect the unhealthy status of a pod. During this time, the pod might still receive new connections.

To ensure your pods follow a similar timeline to gracefully shutdown, make sure to assess the following:

Delay Termination: After receiving the TERM signal, we recommend your application wait for at least 10 seconds before proceeding with shutdown tasks (to address the Load balancer probe delay). This can be achieved using a preStop hook in Kubernetes.

note
When using annotations service.beta.kubernetes.io/azure-load-balancer-health-probe-interval and/or service.beta.kubernetes.io/azure-load-balancer-health-probe-num-of-probe, consider changing the wait time to cover the annotation's needs (see documentation)
Announce Readiness as false: The application should update its readiness probe to indicate it is no longer ready to serve traffic after the above delay has completed. This allows Kubernetes to stop routing new connections to the pod after the load balancer removes it from the list of active pods.
Gracefully Close Existing Connections: The application should close all active connections, ensuring that no in-flight requests are dropped.
Exit the Process: Once all shutdown tasks are complete, the application should terminate its process cleanly.

Gracefully Closing Existing Connections

When a pod is shutting down (receiving the TERM signal), it is essential to ensure that existing client connections are closed properly to avoid abrupt disconnections or errors. Failing to handle this gracefully could result in clients encountering errors like connection reset by peer or connection refused, leading to a poor user experience and potential service disruptions.

For HTTP/1.1 Connections: After receiving the TERM signal, the server should include a Connection: close header in its responses to all active and new incoming requests. This signals to clients that the connection will be closed and should not be reused, allowing idle connections to terminate gracefully. Use Case: Applications serving REST APIs or web traffic where clients rely on persistent connections for performance optimization.

note
In HTTP/1.1, there is a potential race condition where the server might close an idle connection at the same time the client sends a new request. In such cases, the client must handle this scenario by retrying the request on a new connection.
For HTTP/2 Connections: The server should send a GOAWAY frame to notify clients that the connection is being closed. This allows clients to gracefully terminate the connection and retry requests on a new connection if necessary. Use Case: Applications using gRPC or HTTP/2 for high-performance communication between services or with external clients.

By implementing these best practices, you can minimize disruptions during pod shutdowns, maintain a seamless user experience, and ensure the reliability of your services in production environments.

Best Practices for Rolling Updates and Pod Rotation

While the above works when a pod is being taken down in isolation, it does not cover cases like upgrades and rolling restarts which require coordination between the time the pod goes down and a new one comes up, ready to serve traffic. To optimize pod rotation, add the following best practice to your deployment:

Set minReadySeconds: Configure the minReadySeconds parameter in your deployment (we recommend around 10 sec) to introduce a delay before Kubernetes is able to mark the pod as "available" (i.e the pod has been ready long enough that the rolling upgrade can move to the next pod - making it different from the "ready" state which implies the application is ready to receive new connections). This buffer gives the load balancer enough time to register the new pod and start routing traffic to it, while also preventing Kubernetes from deleting the old pod prematurely.

By implementing this strategy, you can achieve smoother rolling updates and maintain a consistent user experience during application changes.

Pod Distribution Best Practices

Achieving an even distribution of pods across nodes is important for load balancing and resource utilization, especially for pods receiving external traffic via externalTrafficPolicy=Local. The diagram below demonstrates an example of uneven pod distribution which leads to imbalanced traffic across pods:

In this situation, even though the load balancer divides the traffic evenly between nodes, the pods on the node with 2 replicas serve 25% of the traffic each, while the pod in the single replica node serves the full 50% of the total traffic.

Below are some best practices you can follow to evenly distribute pods across your nodes based on your workload needs:

Pod Anti-Affinity:
- Ensures pods with the same label are not scheduled on the same node.
- Requires more nodes than pods, including surge pods during evictions or deployments.
Topology Spread Constraints:
- Distributes pods as evenly as possible (best effort) across available nodes and zones (specified with the topologyKey).
  
  note
  If your application requires strict spreading of pods (i.e., your ideal behavior is to leave a pod in pending if spread is not possible), you can set the whenUnsatisfiable to DoNotSchedule
MatchLabelKeys:
- Provides fine-grained control over pod scheduling decisions.
- Ensures pods from different deployment versions do not overlap on the same nodes.

For additional best practices, you can refer to Deployment and Cluster Reliability Best Practices for AKS

Conclusion

To conclude, while externalTrafficPolicy=Local is a powerful option for optimizing traffic routing in AKS, it also requires careful planning of your pod lifecycle. With a professional, proactive approach to how your services handle start-up and shutdown, you can reap the benefits of Local traffic policy -- getting client IP transparency and efficient routing -- without sacrificing reliability during deployments or scaling events. Kubernetes gives us the knobs; it's up to us as SREs and engineers to turn them correctly for our particular workloads. Happy load balancing!

AKS Engineering Blog

Azure Monitor dashboards with Grafana in Azure Portal

Introduction​

Why Grafana in Azure Portal?​

When to upgrade to Azure Managed Grafana?​

Feature Comparison​

Customization and Advanced Features​

Getting Started​

Real-world Use Cases​

Roadmap​

Frequently Asked Questions​

Conclusion and Next Steps​

Announcing Azure Container Storage v2.0.0: Transforming Performance for Stateful Workloads on AKS

Introduction​

Improved Performance with Local NVMe​

Integration with KAITO for Fast AI Model Loading​

Simplified Architecture – No StoragePool CRDs, No Prometheus Hassles​

Pricing changes​

Open-Source Foundations​

What’s Next​

Getting started​

Pair llm-d Inference with KAITO RAG Advanced Search to Enhance your AI Workflows

Overview​

Introduction​

Quick vocab check​

Benefits of llm-d and its intersection with RAG​

Let’s get started: KAITO RAGEngine backed by llm-d with P/D Disaggregation​

Practical Example: Indexing and Querying 10-K Filings​

Next steps​

Observe Smarter: Leveraging Real-Time insights via the AKS-MCP Server

Introduction​

Background​

Getting Started​

Troubleshooting Connectivity Issues​

Uncovering Hidden Bugs in Workloads​

Identifying Slow DNS Resolution​

What's Next?​

Announcing the CLI Agent for AKS: Agentic AI-powered operations and diagnostics at your fingertips

Why We Built This?​

Built on Open Source: HolmesGPT + AKS-MCP​

Designing for Safety: Why We Started with a CLI Experience​

🔌 Extensible and Customizable​

How to Get Started​

🧠 Node NotReady​

🌐 DNS Failures​

🕵️ Pod Scheduling Failures​

🔄 Upgrade Failures​

General CloudOps and Optimizations​

🌐 Vision: Omnichannel AI Across AKS Interfaces​

📣 Join the Preview​

💬 Final Thoughts​

Announcing the AKS-MCP Server: Unlock Intelligent Kubernetes Operations

The Problem: Why Do We Need MCP Now?​

Our thought process​

The Solution: AKS-MCP's Open, Secure Protocol​

How does aks-mcp server authenticate and maintain RBAC compliance?​

Getting Started with AKS MCP​

Option A: VS Code Extension (Recommended)​

Option B: Download Binaries/Container Image from GitHub​

Real-World Example Use Cases​

Diagnose Resource Health​

Understand Network Security​

Automate Operations​

Get Involved​

Accelerate DNS Performance with LocalDNS

Background: The Hidden Cost of DNS in Production Kubernetes​

Why Centralized CoreDNS Becomes a Bottleneck in Kubernetes Clusters​

Introducing LocalDNS for Faster, More Reliable DNS Resolution​

How We Tested LocalDNS​

The Results​

1. Improved DNS Query Resolution Times​

In-Cluster DNS Resolution Time (cluster.local)​

External DNS Resolution Time​

2. Better Distribution of Requests Across CoreDNS Pods​

3. Additional Operational Improvements​

Conclusion​

Streamlining Temporal Worker Deployments on AKS

Getting Started​

Project Structure and Configuration​

Crafting Your Temporal Worker Implementation​

Introduction

Why Grafana in Azure Portal?

When to upgrade to Azure Managed Grafana?

Feature Comparison

Customization and Advanced Features

Getting Started

Real-world Use Cases

Roadmap

Frequently Asked Questions

Conclusion and Next Steps

Introduction

Improved Performance with Local NVMe

Integration with KAITO for Fast AI Model Loading

Simplified Architecture – No StoragePool CRDs, No Prometheus Hassles

Pricing changes

Open-Source Foundations

What’s Next

Getting started

Overview

Introduction

Quick vocab check

Benefits of llm-d and its intersection with RAG

Let’s get started: KAITO RAGEngine backed by llm-d with P/D Disaggregation

Practical Example: Indexing and Querying 10-K Filings

Next steps

Introduction

Background

Getting Started

Troubleshooting Connectivity Issues

Uncovering Hidden Bugs in Workloads

Identifying Slow DNS Resolution

What's Next?

Why We Built This?

Built on Open Source: HolmesGPT + AKS-MCP

Designing for Safety: Why We Started with a CLI Experience

🔌 Extensible and Customizable

How to Get Started

🧠 Node NotReady

🌐 DNS Failures

🕵️ Pod Scheduling Failures

🔄 Upgrade Failures

General CloudOps and Optimizations

🌐 Vision: Omnichannel AI Across AKS Interfaces

📣 Join the Preview

💬 Final Thoughts

The Problem: Why Do We Need MCP Now?

Our thought process

The Solution: AKS-MCP's Open, Secure Protocol

How does aks-mcp server authenticate and maintain RBAC compliance?

Getting Started with AKS MCP

Option A: VS Code Extension (Recommended)

Option B: Download Binaries/Container Image from GitHub

Real-World Example Use Cases

Diagnose Resource Health

Understand Network Security

Automate Operations

Get Involved

Background: The Hidden Cost of DNS in Production Kubernetes

Why Centralized CoreDNS Becomes a Bottleneck in Kubernetes Clusters

Introducing LocalDNS for Faster, More Reliable DNS Resolution

How We Tested LocalDNS

The Results

1. Improved DNS Query Resolution Times

In-Cluster DNS Resolution Time (cluster.local)

External DNS Resolution Time

2. Better Distribution of Requests Across CoreDNS Pods

3. Additional Operational Improvements

Conclusion

Getting Started

Project Structure and Configuration

Crafting Your Temporal Worker Implementation

Preparing Your Containers for Kubernetes

Automated Deployment Process

Local Development

Validating Worker Connectivity and Resource Management

Troubleshooting and Configuration Details

Why AKS LTS Matters

What is AKS LTS?

Enterprise Benefits at a Glance

Reduced Operational Overhead