Skip to main content

21 posts tagged with "Operations"

Operational best practices and management strategies for AKS.

View All Tags

AKS Control Plane Enhancements

· 5 min read
Kevin Thomas
Product Manager for Azure Kubernetes Service

Azure Kubernetes Service (AKS) now includes several control plane enhancements to enable large clusters scale more efficiently and operate reliably. These enhancements include streaming LIST responses, higher control plane resource limits, API server guard and etcd defragmentation optimizations.

Dynamic Resource Allocation (DRA) with NVIDIA virtualized GPU (vGPU) on AKS

· 6 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Suraj Deshmukh
Software Engineer at Microsoft

Recently, dynamic resource allocation (DRA) has emerged as the standard mechanism to consume GPU resources in Kubernetes. With DRA, accelerators like GPUs are no longer exposed as static extended resources (for example, nvidia.com/gpu) but are dynamically allocated through DeviceClasses and ResourceClaims. This unlocks richer scheduling semantics and better integration with virtualization technologies like NVIDIA vGPU.

Virtual accelerators such as NVIDIA vGPU are commonly used for smaller workloads because they allow a single physical GPU to be securely partitioned across multiple tenants or apps. This is especially valuable for enterprise AI/ML development environments, fine-tuning, and audio/visual processing. vGPU enables predictable performance profiles while still exposing CUDA capabilities to containerized workloads.

On Azure, the NVadsA10_v5 virtual machine (VM) series is backed by the physical NVIDIA A10 GPU in the host and offers this resource model. Instead of assigning the entire GPU to a single VM, the vGPU technology is used to partition the GPU into multiple fixed-size slices at the hypervisor layer.

In this post, we’ll walk through enabling the NVIDIA DRA driver on a node pool backed by an NVadsA10_v5 series vGPU on Azure Kubernetes Service (AKS).

DRA with fractional A10 vGPU node on AKS

Running more with less: Multi-instance GPU (MIG) with Dynamic Resource Allocation (DRA) on AKS

· 8 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service
Product Manager at Microsoft

GPUs power a wide range of production Kubernetes workloads across industries. For example, media platforms rely on them for video encoding/transcoding, financial services firms run quantitative risk simulations, and research groups process and visualize large datasets. In each of these scenarios, GPUs significantly improve job throughput, yet individual workloads often consume only a portion of the available device.

By default, Kubernetes schedules GPUs as entire units; when a workload requires only a fraction of a GPU, the remaining capacity can remain unused. Over time, this leads to lower hardware utilization and higher infrastructure costs within a cluster.

Multi-instance GPU (MIG) combined with dynamic resource allocation (DRA) helps address this challenge. MIG partitions a physical GPU into isolated instances with dedicated compute and memory resources, while DRA enables those instances to be provisioned and bound dynamically through Kubernetes resource claims. Rather than treating a GPU as an indivisible resource, the cluster can allocate right-sized GPU partitions to multiple workloads at the same time!

Deploying KubeVirt on AKS

· 8 min read
Product Manager at Microsoft
Senior Software Engineer at Microsoft

Many organizations still depend on virtual machines (VMs) to run applications to meet technical, regulatory, or operational requirements. While Kubernetes adoption continues to grow, not every workload can or should be redesigned for containers.

KubeVirt is a Cloud Native Computing Foundation (CNCF) incubating open-source project that allows users to run, deploy, and manage VMs in their Kubernetes clusters.

In this post, you will learn how KubeVirt lets you run, deploy, and manage VMs in AKS.

Azure Container Registry Repository Permissions with Attribute-based Access Control (ABAC)

· 7 min read
Johnson Shi
Senior Product Manager at Microsoft

Enterprises are converging on centralized container registries that serve multiple business units and application domains. Azure role-based access control (RBAC) uses role assignments to control access to Azure resources. Each Azure RBAC role assignment specifies an identity (who will gain permissions), an Azure role with Entra actions and data actions (what permissions are granted), and an assignment scope (which resources). For Azure Container Registry (ACR), traditional Azure RBAC scopes are limited to the subscription, resource group, or registry level—meaning permissions apply to all repositories within a registry.

In this shared registry model, traditional Azure role-based access control (RBAC) forces an all-or-nothing choice: either grant broad registry-wide permissions or manage separate registries per team. Neither approach aligns with least-privilege principles or modern zero trust architectures.

Microsoft Entra attribute-based access control (ABAC) for Azure Container Registry solves this challenge. ABAC augments Azure RBAC with fine-grained conditions, enabling platform teams to scope permissions precisely to specific repositories or namespaces within a shared registry. CI/CD pipelines and Azure Kubernetes Service (AKS) clusters can now access only their authorized repositories, eliminating overprivileged authorization while maintaining operational simplicity.

AKS cluster pulling from ACR with ABAC

Recommendations for container and security optimized OS options on Azure Kubernetes Service (AKS)

· 7 min read
Ally Ford
Product Manager 2 at Microsoft
Thilo Fromm
Principal Software Engineering Manager at Microsoft
Sudhanva Huruli
Principal Program Manager at Microsoft

Selecting an operating system for your Kubernetes deployments may appear straightforward; however, this decision can significantly influence both security and operational complexity. In this blog, we’ll share key recommendations to help you select a container optimized OS for your AKS deployments.

Collecting Custom Metrics on AKS with Telegraf

· 13 min read
Diego casati
Microsoft App Innovation Global Blackbelt team

What if you need to collect your own custom metrics from workloads or nodes in AKS, but don't want to run a full monitoring stack? In this post, we will discuss how to integrate custom metrics into Azure's managed monitoring stack with minimal setup using Telegraf DaemonSet, for flexible metric collection, Azure Monitor managed service for Prometheus, for scraping and storage, and Azure Managed Grafana for visualization and alerting.

Recommendations for Major OS Version Upgrades with Azure Kubernetes Service (AKS)

· 11 min read
Flora Taagen
Product Manager 2 at Microsoft
Ally Ford
Product Manager 2 at Microsoft

Introduction

Upgrading the operating system version on your AKS nodes is a critical step that can impact workload security, stability, and performance. In this blog, we’ll share key recommendations to help you plan and execute major OS version upgrades smoothly and confidently on AKS.

Azure Monitor dashboards with Grafana in Azure Portal

· 7 min read
Aritra Ghosh
Senior Product Manager at Microsoft
Kayode Prince
Senior Program Manager at Microsoft

Introduction

As Kubernetes adoption accelerates, engineers need streamlined, cost-effective tools for cluster observability. Until now, this often meant deploying and managing separate monitoring stacks. Azure Monitor's latest integration with Grafana changes this: cluster insights are now just a click away in the Azure portal.

Announcing the CLI Agent for AKS: Agentic AI-powered operations and diagnostics at your fingertips

· 9 min read
Pavneet Ahluwalia
Principal PM Lead for the Azure Kubernetes Service
Julia Yin
Product Manager at Microsoft
Aritra Ghosh
Senior Product Manager at Microsoft

At KubeCon India earlier this month, the AKS team shared our newest Agentic AI-powered feature with the broader Kubernetes community: the CLI Agent for AKS. CLI Agent for AKS is a new AI-powered command-line experience designed to help Azure Kubernetes Service (AKS) users troubleshoot, optimize, and operate their clusters with unprecedented ease and intelligence.

Announcing the AKS-MCP Server: Unlock Intelligent Kubernetes Operations

· 9 min read
Pavneet Ahluwalia
Principal PM Lead for the Azure Kubernetes Service

We're excited to announce the launch of the AKS-MCP Server. An open source Model Context Protocol (MCP) server designed to make your Azure Kubernetes Service (AKS) clusters AI-native and more accessible for developers, SREs, and platform engineers through Agentic AI workflows.

AKS-MCP isn't just another integration layer. It empowers cutting-edge AI assistants (such as Claude, Cursor, and GitHub Copilot) to interact with AKS through a secure, standards-based protocol—opening new possibilities for automation, observability, and collaborative cloud operations.

Using Stream Analytics to Filter AKS Control Plane Logs

· 11 min read
Steve Griffith
Microsoft App Innovation Global Blackbelt team

While AKS does not provide access to the cluster's managed control plane, it does provide access to the control plane component logs via diagnostic settings. The easiest option to persist and search this data is to send it directly to Azure Log Analytics, however there is a large amount of data in those logs, which makes it cost prohibitive in Log Analytics. Alternatively, you can send all the data to an Azure Storage Account, but then searching and alerting can be challenging.

To address the above challenge, one option is to stream the data to Azure Event Hub, which then gives you the option to use Azure Stream Analytics to filter out events that you deem important and then just store the rest in cheaper storage (ex. Azure Storage) for potential future diagnostic needs.

In this walkthrough we'll create an AKS cluster, enable diagnostic logging to Azure Stream Analytics and then demonstrate how to filter out some key records.

Azure VM Generations and AKS

· 6 min read
Product Manager at Microsoft
Ally Ford
Product Manager 2 at Microsoft
Sarah Zhou
Product Manager at Microsoft

What are Virtual Machine Generations?

If you are a user of Azure, you may be familiar with virtual machines. What you may not have known is the fact that Azure now offers two generations of virtual machines!

Before going further, let's first break down virtual machines. Azure virtual machines are offered in various "sizes," which are broken down by the amount and type of each resource allocated, such as CPU, memory, storage, and network bandwidth. These resources are tied to a portion of a physical server's hardware capabilities. Physical servers may be broken down into many different VM size series or configurations available utilizing its resources.

As the physical hardware ages and newer components become available, older hardware and VMs get retired, while newer generation hardware and VM products are made available.

In this blog, we will go over Generation 1 and newer Generation 2 virtual machines. Both have their own use cases, and picking the right one to suit your workloads is critical in ensuring you get the best possible experience, capabilities, and cost.

Enhancing Your Operating System's Security with OS Security Patches in AKS

· 6 min read
Kaarthikeyan Subramanian
Senior Product Manager for the Azure Kubernetes Service

Traditional patching and the need for Managed patching

Operating System (OS) security patches are critical for safeguarding systems against vulnerabilities that malicious actors could exploit. These patches help ensure your system remains protected against emerging threats. Traditionally, customers have relied on nightly updates, such as unattended upgrades in Ubuntu or Automatic Guest OS Patching at the virtual machine (VM) level. However, when kernel security packages were updated, a host machine reboot was often required, typically managed using tools like kured.

Limitless Kubernetes Scaling for AI and Data-intensive Workloads: The AKS Fleet Strategy

· 7 min read
Pavneet Ahluwalia
Principal PM Lead for the Azure Kubernetes Service

With the fast-paced advancement of AI workloads, building and fine-tuning of multi-modal models, and extensive batch data processing jobs, more and more enterprises are leaning into Kubernetes platforms to take advantage of its ability to scale and optimize compute resources. With AKS, you can manage up to 5,000 nodes (upstream K8s limit) in a single cluster under optimal conditions, but for some large enterprises, that might not be enough.

Building Community with CRDs: Kube Resource Orchestrator

· 3 min read
Bridget Kromhout
Principal Product Manager at Microsoft Azure

Kube Resource Orchestrator introduces a Kubernetes-native, cloud-agnostic way to define groupings of Kubernetes resources. With kro, you can group your applications and their dependencies as a single resource that can be easily consumed by end users.

Just as we collaborate in upstream Kubernetes, Azure is partnering with AWS and Google Cloud on kro (pronounced “crow”) to make Kubernetes APIs simpler for all Kubernetes users. We’re centering the needs of customers and the cloud native community to offer tooling that works seamlessly no matter where you run your K8s clusters.

Local Development on AKS with mirrord

· 11 min read
Gemma Tipper
Software Engineer at MetalBear
Quentin Petraroia
Product Manager for Azure Kubernetes Service

Developing applications for Kubernetes can mean a lot of time spent waiting and relatively little time spent writing code. Whenever you want to test your code changes in the cluster, you usually have to build your application, deploy it to the cluster, and attach a remote debugger (or add a bunch of logs). These iterations can be incredibly time-consuming. Thankfully, there is a way to bridge the gap between your local environment and a remote cluster, making them feel seamlessly connected. mirrord, which can be used as a plugin for VSCode or IntelliJ or directly in the CLI, is an open-source tool that does exactly that (and much more).

Deploy and take Flyte with an end-to-end ML orchestration solution on AKS

· 7 min read
Sachi Desai
Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Data is often at the heart of application design and development - it fuels user-centric design, provides insights for feature enhancements, and represents the value of an application as a whole. In that case, shouldn’t we use data science tools and workflows that are flexible and scalable on a platform like Kubernetes, for a range of application types?

In collaboration with David Espejo and Shalabh Chaudhri from Union.ai, we’ll dive into an example using Flyte, a platform built on Kubernetes itself. Flyte can help you manage and scale out data processing and machine learning pipelines through a simple user interface.