AI Inference on AKS enabled by Azure Arc: Series Introduction and Scope
This series gives you practical, step-by-step guidance for experimentation with generative and predictive AI inference workloads on Azure Kubernetes Service (AKS) enabled by Azure Arc clusters, using CPUs, GPUs, and neural processing units (NPUs). The scenarios target on‑premises and edge environments, specifically Azure Local, with a focus on repeatable, hands-on experimentation rather than abstract examples.

Introduction
Part 1 covered why running AI inference at the edge matters. This post defines the series scope, ground rules, and shared prerequisites so each tutorial can focus on the hands-on walkthrough.
Scope and expectations
These tutorials are designed for experimentation and learning. The configurations shown are not production-ready and should not be deployed to production environments without additional security, reliability, performance hardening, and following standard practices.
This series assumes familiarity with Kubernetes fundamentals, proficiency with kubectl, Azure CLI, and Helm, and experience using AKS enabled by Azure Arc on Azure Local. The focus is not on model development or training.
All scenarios use the same AKS enabled by Azure Arc environment and follow a consistent structure. Inference execution always occurs locally on the cluster. No managed Azure AI services are used. Each scenario follows the same steps: connect and verify cluster access, prepare the accelerator if required, deploy the inference workload, validate inference with a test request, and clean up resources.
For reference:
- Kubernetes fundamentals
- AKS enabled by Azure Arc overview
- AKS enabled by Azure Arc documentation
- GPU Operator internals
Series outline
The series is designed to evolve. New topics will be added as additional scenarios, runtimes, and architectures are explored.
Topics covered in this series
| Topic | Type | Status |
|---|---|---|
| Ollama: open-source LLM server | Generative | ✅ Available |
| vLLM: high-throughput LLM engine | Generative | ✅ Available |
| Triton + ONNX: ResNet‑50 image classification | Predictive | ✅ Available |
| Triton + TensorRT‑LLM: optimized large-model inference | Generative | ✅ Available |
| Triton + vLLM backend: vision-language model serving | Generative | 🔜 Coming soon |
Prerequisites
All scenarios in this series run on an AKS enabled by Azure Arc cluster deployed on Azure Local. Before you begin, make sure you have the following in place:
-
AKS enabled by Azure Arc clusters with a GPU node: An Azure Local cluster with at least one GPU node and appropriate NVIDIA drivers installed. The GPU node needs the NVIDIA device plugin (via the NVIDIA GPU Operator) running so pods can access nvidia.com/gpu resources.
-
Azure CLI with Azure Arc extensions: The Azure CLI installed on your admin machine and
connectedk8sextensions (for Azure Arc-enabled Kubernetes). Useaz extension list -o tableto confirm these are installed. -
kubectl: The Kubernetes CLI installed on your workstation for applying manifests and managing cluster resources.
-
Helm: The Helm package manager installed (v3), for deploying the GPU Operator and helm charts as needed.
-
PowerShell 7+ (optional): If using PowerShell for CLI steps and REST calls, upgrade to PowerShell 7.4 or later (older Windows PowerShell 5.1 may cause JSON quoting issues in our examples).
-
Cluster access: Ensure you can connect to your AKS enabled by Azure Arc clusters (e.g. same network or VPN to the Azure Local environment). After logging in to Azure and retrieving cluster credentials, verify access by listing nodes:
az login
# Use this command when you have AKS RBAC to export cluster credentials.
az aks get-credentials --resource-group <YourResourceGroup> --name <YourClusterName>
# Otherwise, use this command to access the cluster via the proxy without exporting credentials.
az connectedk8s proxy --resource-group <YourResourceGroup> --name <YourClusterName>
# This should show your cluster’s nodes, including any GPU node(s).
kubectl get nodes
Note: On Windows 11, you can use winget to quickly install prerequisites. For example:
# Install PowerShell
winget install -e --id Microsoft.PowerShell
pwsh -v
# Install or Update - Azure CLI, Kubectl, Helm, Git
winget install -e --id Microsoft.AzureCLI
winget install -e --id Kubernetes.kubectl
winget install -e --id Helm.Helm
winget install -e --id Git.Git
winget update -e --id Microsoft.AzureCLI
winget update -e --id Kubernetes.kubectl
winget update -e --id Helm.Helm
winget update -e --id Git.Git
# Install or Update – Azure CLI Extensions
az extension add --name aksarc
az extension add --name connectedk8s
az extension update --name aksarc
az extension update --name connectedk8s
Install the NVIDIA GPU operator
Next, install the NVIDIA GPU Operator on the cluster. This operator installs the necessary drivers and Kubernetes device plugin to expose GPU resources to your workloads. vLLM requires the NVIDIA Kubernetes plugin to access the GPU hardware.
- Add the NVIDIA Helm repository: If you haven’t already, add NVIDIA’s Helm chart repository and update it:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
This adds the official NVIDIA chart source (which contains the GPU operator chart) to your Helm client.
- Install the GPU operator: Use Helm to install the NVIDIA GPU Operator onto your cluster:
helm install --wait --generate-name nvidia/gpu-operator
This will install the GPU operator into your cluster (in its default namespace) and wait for all components to be ready. The --generate-name flag automatically assigns a name to the Helm release. The operator will set up the NVIDIA device plugin and drivers on your cluster nodes.
Ensure your cluster nodes have internet connectivity to pull the necessary container images for the operator. This may take a few minutes the first time as images are downloaded.
