Scaling multi-node LLM inference with NVIDIA Dynamo-Grove on AKS (Part 4)
This blog post is co-authored with Nikhar Maheshwari, Anish Maddipoti, Rohan Varma, Clement Pakkam Isaac, and Stephen Mccoulough from NVIDIA.
In the first three blog posts of this series, we introduced NVIDIA Dynamo on AKS, covered SLO-driven scaling with the Dynamo Planner and Profiler, and explored KV-cache-aware routing to establish the foundations for high-performance LLM serving. In this post, we move up one layer of the inference stack: how to describe and operate the distributed inference workload on a Kubernetes cluster.
As inference deployments evolve from simple model-serving replicas to systems with disaggregated prefill and decode phases, multi-node model instances, and explicit startup dependencies, an inference deployment benefits from a single API that describes the full serving pipeline: which roles exist, how they relate, and what constitutes a deployable unit of inference-serving capacity.
NVIDIA Grove addresses this layer of the stack with a Kubernetes-native API for describing an AI inference service — so the platform can schedule, start, and scale the deployment in units that match the inference architecture.









