Disaggregated inference separates the AI inference pipeline into independent stages (prefill, decode, and routing), each optimized for different hardware needs and scaling patterns, allowing companies to dramatically improve GPU utilization and reduce infrastructure costs. Most AI companies today run their large language models (LLMs) using a single monolithic serving process that handles everything from processing input prompts to generating output tokens. But this approach forces fundamentally different computational tasks onto the same hardware, leaving expensive GPUs underutilized and making it nearly impossible to scale efficiently. Why Do Prefill and Decode Stages Have Such Different Needs? To understand the problem, you need to know what happens inside an LLM when it processes a request. When a user submits a prompt, the model first runs a "prefill" stage, which processes the entire input at once. This is compute-intensive work that benefits from raw processing power. Then the model enters the "decode" stage, where it generates output tokens one at a time, autoregressively. This stage is memory-bandwidth-bound, meaning it needs fast access to data stored in the GPU's high-bandwidth memory (HBM), not raw computing power. In traditional aggregated serving, a single process handles both stages sequentially. The GPU alternates between two completely different workloads, never fully optimizing for either. It's like forcing a truck designed for hauling cargo to also function as a sports car. You end up with a vehicle that's mediocre at both tasks. Disaggregated architectures solve this by splitting the pipeline into distinct, independent services. Each stage runs on hardware optimized for its specific needs, and each can scale independently based on actual demand patterns. How Does Disaggregated Inference Actually Work? In a disaggregated setup, three main components work together. Prefill workers process input prompts and are optimized for high throughput and aggressive parallelization. Decode workers generate output tokens one at a time and are optimized for memory bandwidth. A router or gateway directs incoming requests, manages the Key-Value (KV) cache routing between prefill and decode stages, and handles load balancing across workers. The practical benefits are substantial. Because each stage is separate, you can match GPU resources, model sharding techniques, and batch sizes to each stage's specific needs rather than compromise on a single approach. A long-context prompt creates a large prefill burst but a steady decode stream, so scaling each stage independently lets you respond to actual demand instead of over-provisioning for worst-case scenarios. Most importantly, separating stages lets each saturate its target resource, prefill saturating compute and decode saturating memory bandwidth, rather than alternating between both. What's the Kubernetes Challenge? Deploying disaggregated inference on Kubernetes, the industry-standard container orchestration platform, introduces a new complexity layer. Simply splitting your model into separate services isn't enough. How the Kubernetes scheduler places pods (containerized workloads) across your cluster directly impacts performance. Placing a Tensor Parallel (TP) group's pods on the same rack with high-bandwidth NVIDIA NVLink interconnects can mean the difference between fast inference and a network bottleneck. Three scheduling capabilities become critical for multi-pod inference performance on Kubernetes. Gang scheduling ensures all pods in a group are placed with all-or-nothing semantics, preventing partial deployments that waste GPUs. Hierarchical gang scheduling extends basic gang scheduling to multi-level workloads, ensuring nested minimum guarantees per component. In disaggregated inference, each Tensor Parallel group must be scheduled atomically, and the full system also needs system-level coordination. Without this, one role can consume all available GPUs while the other waits indefinitely, creating a partial deployment that holds resources but cannot serve requests. Topology-aware placement colocates tightly coupled pods on nodes with high-bandwidth interconnects, minimizing inter-node communication latency. Steps to Deploy Disaggregated Inference on Kubernetes - Define Role Structure: Use higher-level workload abstractions like LeaderWorkerSet (LWS) or NVIDIA Grove to declaratively express the structure of your inference application, specifying which roles exist, how they relate to each other, and what topology constraints matter. - Configure Scheduling Constraints: Translate application-level intent into concrete scheduling constraints, including PodGroups and gang requirements, that determine what gangs to create and when, ensuring the orchestration layer and scheduler work closely throughout the application lifecycle. - Implement Advanced Scheduling: Deploy an AI scheduler like KAI Scheduler that supports gang scheduling, hierarchical gang scheduling, and topology-aware placement to satisfy the constraints and optimize pod placement across your cluster. - Set Up Auto-Scaling: Use application-level autoscalers like NVIDIA Dynamo or llm-ds workload variant autoscaler to maintain optimal ratios across roles based on inference-specific metrics, scaling per-role and per-tensor-parallel-group independently. - Monitor and Optimize: Continuously track GPU utilization and inference latency across prefill and decode stages, adjusting resource allocation and scaling policies based on actual workload patterns rather than theoretical worst-case scenarios. The orchestration layer plays a critical role in determining what needs to be gang-scheduled and when. For example, when prefill scales independently, something needs to decide that the new pods form a gang with a minimum availability guarantee without disrupting existing decode pods. This requires close coordination between the orchestration layer and the scheduler for the entire application lifecycle, handling multi-level auto-scaling, rolling updates, and more to ensure optimal runtime conditions for AI workloads. Frameworks like NVIDIA Dynamo and llm-ds already implement the disaggregated inference pattern, but the question of how to orchestrate it effectively on Kubernetes is still being solved by the broader ecosystem. Operators like NVIDIA Grove unify scheduling, scaling, and topology constraints across roles within a single declarative resource, simplifying the deployment process for teams without deep Kubernetes expertise. Why Should Companies Care About This Now? As LLM inference workloads grow in complexity and scale, the inefficiencies of traditional monolithic serving become increasingly expensive. A single monolithic serving process starts to hit its limits when handling diverse request patterns and long-context prompts. The cost of running inference at scale is dominated by GPU infrastructure, so even modest improvements in utilization translate directly to significant cost savings. Disaggregated inference offers a path to better resource efficiency, but only if you can orchestrate it correctly on your infrastructure. The challenge is that most teams deploying LLMs today don't have the expertise to manually configure gang scheduling, hierarchical gang scheduling, and topology-aware placement. This is where higher-level abstractions and AI-aware schedulers become essential. They translate application-level intent into concrete scheduling constraints, removing the burden of manual configuration and allowing teams to focus on their models rather than infrastructure plumbing. The inference scaling era is here, and disaggregated architectures represent a fundamental shift in how companies will deploy LLMs at scale. The companies that master this orchestration challenge will have a significant cost advantage over those still running monolithic serving processes.