|
| 1 | +<!-- |
| 2 | +SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +SPDX-License-Identifier: Apache-2.0 |
| 4 | +
|
| 5 | +Licensed under the Apache License, Version 2.0 (the "License"); |
| 6 | +you may not use this file except in compliance with the License. |
| 7 | +You may obtain a copy of the License at |
| 8 | +
|
| 9 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +Unless required by applicable law or agreed to in writing, software |
| 12 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 13 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 14 | +See the License for the specific language governing permissions and |
| 15 | +limitations under the License. |
| 16 | +--> |
| 17 | + |
| 18 | +# Dynamo Architecture Flow |
| 19 | + |
| 20 | +This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/llm](https://github.com/ai-dynamo/dynamo/tree/main/examples/llm). Color-coded flows indicate different types of operations: |
| 21 | + |
| 22 | +## 🔵 Main Request Flow (Blue) |
| 23 | +The primary user journey through the system: |
| 24 | + |
| 25 | +1. **Discovery (S1)**: Client discovers the service endpoint |
| 26 | +2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000) |
| 27 | +3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing |
| 28 | +4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker |
| 29 | + |
| 30 | +## 🟠 Decision and Allocation Flow (Orange) |
| 31 | +The system's intelligent routing and resource allocation: |
| 32 | + |
| 33 | +4. **Query (S4)**: Decode Worker queries for prefix cache hits to optimize processing |
| 34 | +5. **Disagg Decision (S5)**: Based on prefill length and queue size, the system decides whether it needs remote prefill |
| 35 | +5a. **Allocate (S5a)**: Decode Worker pre-allocates KV cache blocks in its local GPU memory |
| 36 | +6. **Queue (S6)**: If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue |
| 37 | + |
| 38 | +## 🟢 Prefill Worker Flow (Green) |
| 39 | +The dedicated prefill processing pipeline: |
| 40 | + |
| 41 | +7. **NATS Pull (S7)**: PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers |
| 42 | +8. **Load Metadata (S8)**: PrefillWorker loads NIXL metadata from ETCD to establish GPU communication |
| 43 | +9. **Prefill (S9)**: Worker executes the prefill computation on the input tokens |
| 44 | +10. **NIXL Transfer (S10)**: Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker's pre-allocated blocks |
| 45 | + |
| 46 | +## 🟣 Completion Flow (Purple) |
| 47 | +The response generation and delivery: |
| 48 | + |
| 49 | +11. **Notify (S11)**: PrefillWorker sends completion notification to Decode Worker |
| 50 | +12. **Decode (S12)**: Decode Worker decodes from its local KV cache containing prefilled data |
| 51 | +13. **Response (S13)**: The system sends the generated response to the Processor for post-processing, then through the Frontend to the Client |
| 52 | + |
| 53 | +## 🔗 Infrastructure Connections (Dotted lines) |
| 54 | +Coordination and messaging support: |
| 55 | + |
| 56 | +### ETCD Connections (Gray, dotted) |
| 57 | +- **Frontend, Processor, Planner**: Service discovery and registration |
| 58 | +- **Decode Worker, PrefillWorker**: NIXL metadata storage for GPU communication setup |
| 59 | + |
| 60 | +### NATS Connections (Teal, dotted) |
| 61 | +- **PrefillQueue**: JetStream consumer group for reliable work distribution |
| 62 | +- **Processor**: Load balancing across workers |
| 63 | + |
| 64 | +### Planning Connections (Gold, dotted) |
| 65 | +- **Frontend → Planner**: Metrics collection for auto-scaling decisions |
| 66 | +- **Planner → Workers**: Resource scaling commands for both Decode Worker and PrefillWorker |
| 67 | + |
| 68 | +## Technical Implementation Details |
| 69 | + |
| 70 | +### NIXL (NVIDIA Interchange Library): |
| 71 | +- Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe |
| 72 | +- Decode Worker publishes GPU metadata to ETCD for coordination |
| 73 | +- PrefillWorker loads metadata to establish direct communication channels |
| 74 | +- Block-based transfers (64–128 tokens per block) for efficient batching |
| 75 | + |
| 76 | +### Disaggregated KV Cache: |
| 77 | +- Each Decode Worker maintains local KV cache in its GPU memory |
| 78 | +- No shared storage bottlenecks—all transfers are direct worker-to-worker |
| 79 | +- Pre-allocated blocks ensure deterministic memory layout and performance |
| 80 | + |
| 81 | +```mermaid |
| 82 | +%%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%% |
| 83 | +graph TD |
| 84 | + %% Top Layer - Client & Frontend |
| 85 | + Client["<b>HTTP Client</b>"] |
| 86 | + S1[["<b>1 DISCOVERY</b>"]] |
| 87 | + Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"] |
| 88 | + S2[["<b>2 REQUEST</b>"]] |
| 89 | +
|
| 90 | + %% Processing Layer |
| 91 | + Processor["<b>Processor</b><br/><i>Request Handler & Router</i>"] |
| 92 | + S3[["<b>3 VALIDATE</b>"]] |
| 93 | +
|
| 94 | + %% Infrastructure - Positioned strategically to minimize crossings |
| 95 | + subgraph INF["<b>Infrastructure Layer</b>"] |
| 96 | + ETCD[("<b>ETCD</b><br/><i>Service Discovery &<br/>NIXL Metadata</i>")] |
| 97 | + NATS[("<b>NATS</b><br/><i>Message Broker</i>")] |
| 98 | + Planner["<b>Planner</b><br/><i>Resource Management<br/>Auto-scaling</i>"] |
| 99 | + end |
| 100 | +
|
| 101 | + %% Worker Layer - Main processing |
| 102 | + subgraph WL["<b>Worker Layer</b>"] |
| 103 | + %% VllmWorker section |
| 104 | + VllmWorker["<b>Decode Worker</b><br/><i>Handles Decoding & Disagg Decisions</i>"] |
| 105 | + S4[["<b>4 QUERY</b>"]] |
| 106 | + S5[["<b>5 DISAGG DECISION</b>"]] |
| 107 | + S5a[["<b>5a ALLOCATE</b>"]] |
| 108 | + S12[["<b>12 DECODE</b>"]] |
| 109 | + S6[["<b>6 QUEUE</b>"]] |
| 110 | + S13[["<b>13 RESPONSE</b>"]] |
| 111 | +
|
| 112 | + %% Storage positioned near workers |
| 113 | + LocalKVCache[("<b>Local KV Cache</b><br/><i>Pre-allocated Blocks</i>")] |
| 114 | +
|
| 115 | + %% Prefill System - Right side to minimize crossings |
| 116 | + subgraph PS["<b>Prefill System</b>"] |
| 117 | + PrefillQueue["<b>Prefill Queue</b><br/><i>NATS JetStream<br/>Consumer Group</i>"] |
| 118 | + PrefillWorker["<b>Prefill Worker</b><br/><i>Dedicated Prefill Processing<br/>(Multiple Instances)</i>"] |
| 119 | + S7[["<b>7 NATS PULL</b>"]] |
| 120 | + S8[["<b>8 LOAD METADATA</b>"]] |
| 121 | + S9[["<b>9 PREFILL</b>"]] |
| 122 | + S10[["<b>10 NIXL TRANSFER</b>"]] |
| 123 | + S11[["<b>11 NOTIFY</b>"]] |
| 124 | + end |
| 125 | + end |
| 126 | +
|
| 127 | + %% Main Request Flow (Blue) - Clean vertical flow |
| 128 | + Client -.-> S1 |
| 129 | + S1 -->|HTTP API Call| Frontend |
| 130 | + Frontend -.-> S2 |
| 131 | + S2 -->|Process & Validate| Processor |
| 132 | + Processor -.-> S3 |
| 133 | + S3 -->|Route to Worker| VllmWorker |
| 134 | +
|
| 135 | + %% VllmWorker Internal Flow (Orange) |
| 136 | + VllmWorker -.-> S4 |
| 137 | + S4 -->|Query Prefix Cache Hit| S5 |
| 138 | + S5 -->|Prefill Length & Queue Check| S5a |
| 139 | + S5a -->|Continue to Decode| S12 |
| 140 | +
|
| 141 | + %% Allocation & Queuing (Orange) - Minimize crossings |
| 142 | + S5a -->|Allocate KV Cache Blocks| LocalKVCache |
| 143 | + VllmWorker --> S6 |
| 144 | + S6 -->|Put RemotePrefillRequest| PrefillQueue |
| 145 | +
|
| 146 | + %% Prefill Worker Flow (Green) - Self-contained within PS |
| 147 | + PrefillQueue -.-> S7 |
| 148 | + S7 -->|Consumer Group Pull| PrefillWorker |
| 149 | + PrefillWorker -.-> S8 |
| 150 | + PrefillWorker -.-> S9 |
| 151 | + S9 -->|Execute Prefill| S10 |
| 152 | + S10 -->|Direct GPU Transfer| LocalKVCache |
| 153 | + PrefillWorker --> S11 |
| 154 | +
|
| 155 | + %% Return Flow (Purple) - Clean return path |
| 156 | + S11 -->|Completion Notification| S12 |
| 157 | + S12 -->|Decode from KV Cache| S13 |
| 158 | + S13 -->|Post-process Response| Processor |
| 159 | + Processor -->|HTTP Response| Frontend |
| 160 | + Frontend -->|Final Response| Client |
| 161 | +
|
| 162 | + %% Infrastructure Connections - Organized to avoid crossings |
| 163 | + %% ETCD Connections - Grouped by proximity |
| 164 | + Frontend -.->|Service Discovery| ETCD |
| 165 | + Processor -.->|Service Discovery| ETCD |
| 166 | + VllmWorker -.->|NIXL Metadata| ETCD |
| 167 | + PrefillWorker -.->|NIXL Metadata| ETCD |
| 168 | + S8 -.->|Load NIXL Metadata| ETCD |
| 169 | + Planner -.->|Service Discovery| ETCD |
| 170 | +
|
| 171 | + %% NATS Connections - Direct to queue system |
| 172 | + PrefillQueue -.->|JetStream| NATS |
| 173 | + Processor -.->|Load Balancing| NATS |
| 174 | +
|
| 175 | + %% Planning Connections - Strategic positioning |
| 176 | + Frontend -.->|Metrics| Planner |
| 177 | + Planner -.->|Auto-scaling| VllmWorker |
| 178 | + Planner -.->|Auto-scaling| PrefillWorker |
| 179 | +
|
| 180 | + %% Styling - Each component with unique colors |
| 181 | + classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px |
| 182 | + classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px |
| 183 | + classDef processor fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px |
| 184 | + classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px |
| 185 | + classDef prefillQueue fill:#fff8e1,stroke:#E65100,stroke-width:3px |
| 186 | + classDef prefillWorker fill:#fce4ec,stroke:#C2185B,stroke-width:3px |
| 187 | + classDef prefillBox fill:#eceff1,stroke:#455A64,stroke-width:3px |
| 188 | + classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px |
| 189 | + classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px |
| 190 | + classDef etcd fill:#fff9c4,stroke:#F9A825,stroke-width:3px |
| 191 | + classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px |
| 192 | + classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px |
| 193 | + classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px |
| 194 | +
|
| 195 | +
|
| 196 | + class Client client |
| 197 | + class Frontend frontend |
| 198 | + class Processor processor |
| 199 | + class VllmWorker worker |
| 200 | + class PrefillQueue prefillQueue |
| 201 | + class PrefillWorker prefillWorker |
| 202 | + class Planner planner |
| 203 | + class LocalKVCache storage |
| 204 | + class ETCD etcd |
| 205 | + class NATS nats |
| 206 | + class PS prefillBox |
| 207 | + class INF infraLayer |
| 208 | + class WL workerLayer |
| 209 | +
|
| 210 | +
|
| 211 | +
|
| 212 | + %% Flow Colors - Different line styles to reduce visual clutter |
| 213 | + %% Main Request Flow - Blue (solid) |
| 214 | + linkStyle 0 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3 |
| 215 | + linkStyle 1 stroke:#1565C0,stroke-width:4px |
| 216 | + linkStyle 2 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3 |
| 217 | + linkStyle 3 stroke:#1565C0,stroke-width:4px |
| 218 | + linkStyle 4 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3 |
| 219 | + linkStyle 5 stroke:#1565C0,stroke-width:4px |
| 220 | +
|
| 221 | + %% Decision & Allocation Flow - Orange (mixed) |
| 222 | + linkStyle 6 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3 |
| 223 | + linkStyle 7 stroke:#E65100,stroke-width:4px |
| 224 | + linkStyle 8 stroke:#E65100,stroke-width:4px |
| 225 | + linkStyle 9 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3 |
| 226 | +
|
| 227 | + %% KV Cache & Queue - Orange (solid) |
| 228 | + linkStyle 10 stroke:#E65100,stroke-width:4px |
| 229 | + linkStyle 11 stroke:#E65100,stroke-width:4px |
| 230 | + linkStyle 12 stroke:#E65100,stroke-width:4px |
| 231 | +
|
| 232 | + %% Prefill Worker Flow - Green (mixed) |
| 233 | + linkStyle 13 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3 |
| 234 | + linkStyle 14 stroke:#2E7D32,stroke-width:4px |
| 235 | + linkStyle 15 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3 |
| 236 | + linkStyle 16 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3 |
| 237 | + linkStyle 17 stroke:#2E7D32,stroke-width:4px |
| 238 | + linkStyle 18 stroke:#2E7D32,stroke-width:4px |
| 239 | + linkStyle 19 stroke:#2E7D32,stroke-width:4px |
| 240 | +
|
| 241 | + %% Completion Flow - Purple (mixed) |
| 242 | + linkStyle 20 stroke:#6A1B9A,stroke-width:4px |
| 243 | + linkStyle 21 stroke:#6A1B9A,stroke-width:3px,stroke-dasharray: 3 3 |
| 244 | + linkStyle 22 stroke:#6A1B9A,stroke-width:4px |
| 245 | + linkStyle 23 stroke:#6A1B9A,stroke-width:4px |
| 246 | + linkStyle 24 stroke:#6A1B9A,stroke-width:4px |
| 247 | +
|
| 248 | + %% Infrastructure Flows - Lighter and dotted to reduce visual noise |
| 249 | + %% ETCD Connections - Gray (dotted, thinner) |
| 250 | + linkStyle 25 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8 |
| 251 | + linkStyle 26 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8 |
| 252 | + linkStyle 27 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8 |
| 253 | + linkStyle 28 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8 |
| 254 | + linkStyle 29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8 |
| 255 | + linkStyle 30 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8 |
| 256 | +
|
| 257 | + %% NATS Connections - Teal (dotted, thinner) |
| 258 | + linkStyle 31 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8 |
| 259 | + linkStyle 32 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8 |
| 260 | +
|
| 261 | + %% Planning Connections - Gold (dotted, thinner) |
| 262 | + linkStyle 33 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8 |
| 263 | + linkStyle 34 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8 |
| 264 | + linkStyle 35 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8 |
| 265 | +``` |
0 commit comments