Ernest Wong - 2 posts | AKS Engineering Blog

Pair llm-d Inference with KAITO RAG Advanced Search to Enhance your AI Workflows

September 12, 2025 · 10 min read

Ernest Wong

Software Engineer at Microsoft

Sachi Desai

Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Overview

In this blog, we'll guide you through setting up an OpenAI API compatible inference endpoint with llm-d and integrating with retrieval- augmented generation (RAG) on AKS. This blog will showcase its value in a key finance use case: indexing the latest SEC 10-K filings for the two S&P 500 companies and querying them. We’ll also highlight the benefits of llm-d based on its architecture and its synergy with RAG.

Simplifying InfiniBand on AKS

April 11, 2025 · 5 min read

Sachi Desai

Product Manager for AI/ML, GPU workloads on Azure Kubernetes Service

Suraj Deshmukh

Software Engineer at Microsoft

Ernest Wong

Software Engineer at Microsoft

High performance computing (HPC) workloads, like large-scale distributed AI training and inferencing, often require fast, reliable data transfer and synchronization across the underlying compute. Model training, for example, requires shared memory across GPUs because the parameters and gradients need to be constantly shared. For models with billions of parameters, the available memory in a single GPU node may not be enough, so "pooling" the memory across multiple nodes also requires high memory bandwidth due to the sheer volume of data involved. A common way to achieve this at scale is with a high-speed, low-latency network interconnect technology called InfiniBand (IB).

Overview​

Overview