Skip to content

Commit be9d082

Browse files
arunramankmkelle-nvnealvaidya
authored andcommitted
docs: add Dynamo architecture flow documentation and diagram (#1697)
Signed-off-by: Arun Raman <arunraman@users.noreply.github.com> Co-authored-by: Kristen Kelleher <kkelleher@nvidia.com> Co-authored-by: Neal Vaidya <neal098@gmail.com>
1 parent 286eb8c commit be9d082

File tree

3 files changed

+266
-0
lines changed

3 files changed

+266
-0
lines changed

docs/architecture/dynamo_flow.md

Lines changed: 265 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
# Dynamo Architecture Flow
19+
20+
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/llm](https://github.com/ai-dynamo/dynamo/tree/main/examples/llm). Color-coded flows indicate different types of operations:
21+
22+
## 🔵 Main Request Flow (Blue)
23+
The primary user journey through the system:
24+
25+
1. **Discovery (S1)**: Client discovers the service endpoint
26+
2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
27+
3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing
28+
4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker
29+
30+
## 🟠 Decision and Allocation Flow (Orange)
31+
The system's intelligent routing and resource allocation:
32+
33+
4. **Query (S4)**: Decode Worker queries for prefix cache hits to optimize processing
34+
5. **Disagg Decision (S5)**: Based on prefill length and queue size, the system decides whether it needs remote prefill
35+
5a. **Allocate (S5a)**: Decode Worker pre-allocates KV cache blocks in its local GPU memory
36+
6. **Queue (S6)**: If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue
37+
38+
## 🟢 Prefill Worker Flow (Green)
39+
The dedicated prefill processing pipeline:
40+
41+
7. **NATS Pull (S7)**: PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers
42+
8. **Load Metadata (S8)**: PrefillWorker loads NIXL metadata from ETCD to establish GPU communication
43+
9. **Prefill (S9)**: Worker executes the prefill computation on the input tokens
44+
10. **NIXL Transfer (S10)**: Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker's pre-allocated blocks
45+
46+
## 🟣 Completion Flow (Purple)
47+
The response generation and delivery:
48+
49+
11. **Notify (S11)**: PrefillWorker sends completion notification to Decode Worker
50+
12. **Decode (S12)**: Decode Worker decodes from its local KV cache containing prefilled data
51+
13. **Response (S13)**: The system sends the generated response to the Processor for post-processing, then through the Frontend to the Client
52+
53+
## 🔗 Infrastructure Connections (Dotted lines)
54+
Coordination and messaging support:
55+
56+
### ETCD Connections (Gray, dotted)
57+
- **Frontend, Processor, Planner**: Service discovery and registration
58+
- **Decode Worker, PrefillWorker**: NIXL metadata storage for GPU communication setup
59+
60+
### NATS Connections (Teal, dotted)
61+
- **PrefillQueue**: JetStream consumer group for reliable work distribution
62+
- **Processor**: Load balancing across workers
63+
64+
### Planning Connections (Gold, dotted)
65+
- **Frontend → Planner**: Metrics collection for auto-scaling decisions
66+
- **Planner → Workers**: Resource scaling commands for both Decode Worker and PrefillWorker
67+
68+
## Technical Implementation Details
69+
70+
### NIXL (NVIDIA Interchange Library):
71+
- Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
72+
- Decode Worker publishes GPU metadata to ETCD for coordination
73+
- PrefillWorker loads metadata to establish direct communication channels
74+
- Block-based transfers (64–128 tokens per block) for efficient batching
75+
76+
### Disaggregated KV Cache:
77+
- Each Decode Worker maintains local KV cache in its GPU memory
78+
- No shared storage bottlenecks—all transfers are direct worker-to-worker
79+
- Pre-allocated blocks ensure deterministic memory layout and performance
80+
81+
```mermaid
82+
%%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%%
83+
graph TD
84+
%% Top Layer - Client & Frontend
85+
Client["<b>HTTP Client</b>"]
86+
S1[["<b>1 DISCOVERY</b>"]]
87+
Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
88+
S2[["<b>2 REQUEST</b>"]]
89+
90+
%% Processing Layer
91+
Processor["<b>Processor</b><br/><i>Request Handler & Router</i>"]
92+
S3[["<b>3 VALIDATE</b>"]]
93+
94+
%% Infrastructure - Positioned strategically to minimize crossings
95+
subgraph INF["<b>Infrastructure Layer</b>"]
96+
ETCD[("<b>ETCD</b><br/><i>Service Discovery &<br/>NIXL Metadata</i>")]
97+
NATS[("<b>NATS</b><br/><i>Message Broker</i>")]
98+
Planner["<b>Planner</b><br/><i>Resource Management<br/>Auto-scaling</i>"]
99+
end
100+
101+
%% Worker Layer - Main processing
102+
subgraph WL["<b>Worker Layer</b>"]
103+
%% VllmWorker section
104+
VllmWorker["<b>Decode Worker</b><br/><i>Handles Decoding & Disagg Decisions</i>"]
105+
S4[["<b>4 QUERY</b>"]]
106+
S5[["<b>5 DISAGG DECISION</b>"]]
107+
S5a[["<b>5a ALLOCATE</b>"]]
108+
S12[["<b>12 DECODE</b>"]]
109+
S6[["<b>6 QUEUE</b>"]]
110+
S13[["<b>13 RESPONSE</b>"]]
111+
112+
%% Storage positioned near workers
113+
LocalKVCache[("<b>Local KV Cache</b><br/><i>Pre-allocated Blocks</i>")]
114+
115+
%% Prefill System - Right side to minimize crossings
116+
subgraph PS["<b>Prefill System</b>"]
117+
PrefillQueue["<b>Prefill Queue</b><br/><i>NATS JetStream<br/>Consumer Group</i>"]
118+
PrefillWorker["<b>Prefill Worker</b><br/><i>Dedicated Prefill Processing<br/>(Multiple Instances)</i>"]
119+
S7[["<b>7 NATS PULL</b>"]]
120+
S8[["<b>8 LOAD METADATA</b>"]]
121+
S9[["<b>9 PREFILL</b>"]]
122+
S10[["<b>10 NIXL TRANSFER</b>"]]
123+
S11[["<b>11 NOTIFY</b>"]]
124+
end
125+
end
126+
127+
%% Main Request Flow (Blue) - Clean vertical flow
128+
Client -.-> S1
129+
S1 -->|HTTP API Call| Frontend
130+
Frontend -.-> S2
131+
S2 -->|Process & Validate| Processor
132+
Processor -.-> S3
133+
S3 -->|Route to Worker| VllmWorker
134+
135+
%% VllmWorker Internal Flow (Orange)
136+
VllmWorker -.-> S4
137+
S4 -->|Query Prefix Cache Hit| S5
138+
S5 -->|Prefill Length & Queue Check| S5a
139+
S5a -->|Continue to Decode| S12
140+
141+
%% Allocation & Queuing (Orange) - Minimize crossings
142+
S5a -->|Allocate KV Cache Blocks| LocalKVCache
143+
VllmWorker --> S6
144+
S6 -->|Put RemotePrefillRequest| PrefillQueue
145+
146+
%% Prefill Worker Flow (Green) - Self-contained within PS
147+
PrefillQueue -.-> S7
148+
S7 -->|Consumer Group Pull| PrefillWorker
149+
PrefillWorker -.-> S8
150+
PrefillWorker -.-> S9
151+
S9 -->|Execute Prefill| S10
152+
S10 -->|Direct GPU Transfer| LocalKVCache
153+
PrefillWorker --> S11
154+
155+
%% Return Flow (Purple) - Clean return path
156+
S11 -->|Completion Notification| S12
157+
S12 -->|Decode from KV Cache| S13
158+
S13 -->|Post-process Response| Processor
159+
Processor -->|HTTP Response| Frontend
160+
Frontend -->|Final Response| Client
161+
162+
%% Infrastructure Connections - Organized to avoid crossings
163+
%% ETCD Connections - Grouped by proximity
164+
Frontend -.->|Service Discovery| ETCD
165+
Processor -.->|Service Discovery| ETCD
166+
VllmWorker -.->|NIXL Metadata| ETCD
167+
PrefillWorker -.->|NIXL Metadata| ETCD
168+
S8 -.->|Load NIXL Metadata| ETCD
169+
Planner -.->|Service Discovery| ETCD
170+
171+
%% NATS Connections - Direct to queue system
172+
PrefillQueue -.->|JetStream| NATS
173+
Processor -.->|Load Balancing| NATS
174+
175+
%% Planning Connections - Strategic positioning
176+
Frontend -.->|Metrics| Planner
177+
Planner -.->|Auto-scaling| VllmWorker
178+
Planner -.->|Auto-scaling| PrefillWorker
179+
180+
%% Styling - Each component with unique colors
181+
classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px
182+
classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px
183+
classDef processor fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px
184+
classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px
185+
classDef prefillQueue fill:#fff8e1,stroke:#E65100,stroke-width:3px
186+
classDef prefillWorker fill:#fce4ec,stroke:#C2185B,stroke-width:3px
187+
classDef prefillBox fill:#eceff1,stroke:#455A64,stroke-width:3px
188+
classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px
189+
classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px
190+
classDef etcd fill:#fff9c4,stroke:#F9A825,stroke-width:3px
191+
classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px
192+
classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px
193+
classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px
194+
195+
196+
class Client client
197+
class Frontend frontend
198+
class Processor processor
199+
class VllmWorker worker
200+
class PrefillQueue prefillQueue
201+
class PrefillWorker prefillWorker
202+
class Planner planner
203+
class LocalKVCache storage
204+
class ETCD etcd
205+
class NATS nats
206+
class PS prefillBox
207+
class INF infraLayer
208+
class WL workerLayer
209+
210+
211+
212+
%% Flow Colors - Different line styles to reduce visual clutter
213+
%% Main Request Flow - Blue (solid)
214+
linkStyle 0 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
215+
linkStyle 1 stroke:#1565C0,stroke-width:4px
216+
linkStyle 2 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
217+
linkStyle 3 stroke:#1565C0,stroke-width:4px
218+
linkStyle 4 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
219+
linkStyle 5 stroke:#1565C0,stroke-width:4px
220+
221+
%% Decision & Allocation Flow - Orange (mixed)
222+
linkStyle 6 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
223+
linkStyle 7 stroke:#E65100,stroke-width:4px
224+
linkStyle 8 stroke:#E65100,stroke-width:4px
225+
linkStyle 9 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
226+
227+
%% KV Cache & Queue - Orange (solid)
228+
linkStyle 10 stroke:#E65100,stroke-width:4px
229+
linkStyle 11 stroke:#E65100,stroke-width:4px
230+
linkStyle 12 stroke:#E65100,stroke-width:4px
231+
232+
%% Prefill Worker Flow - Green (mixed)
233+
linkStyle 13 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
234+
linkStyle 14 stroke:#2E7D32,stroke-width:4px
235+
linkStyle 15 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
236+
linkStyle 16 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
237+
linkStyle 17 stroke:#2E7D32,stroke-width:4px
238+
linkStyle 18 stroke:#2E7D32,stroke-width:4px
239+
linkStyle 19 stroke:#2E7D32,stroke-width:4px
240+
241+
%% Completion Flow - Purple (mixed)
242+
linkStyle 20 stroke:#6A1B9A,stroke-width:4px
243+
linkStyle 21 stroke:#6A1B9A,stroke-width:3px,stroke-dasharray: 3 3
244+
linkStyle 22 stroke:#6A1B9A,stroke-width:4px
245+
linkStyle 23 stroke:#6A1B9A,stroke-width:4px
246+
linkStyle 24 stroke:#6A1B9A,stroke-width:4px
247+
248+
%% Infrastructure Flows - Lighter and dotted to reduce visual noise
249+
%% ETCD Connections - Gray (dotted, thinner)
250+
linkStyle 25 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
251+
linkStyle 26 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
252+
linkStyle 27 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
253+
linkStyle 28 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
254+
linkStyle 29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
255+
linkStyle 30 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
256+
257+
%% NATS Connections - Teal (dotted, thinner)
258+
linkStyle 31 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
259+
linkStyle 32 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
260+
261+
%% Planning Connections - Gold (dotted, thinner)
262+
linkStyle 33 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
263+
linkStyle 34 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
264+
linkStyle 35 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
265+
```

docs/images/dynamo_flow.png

1.33 MB
Loading

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ The examples below assume you build the latest image yourself from source. If us
8282
KV Block Manager <architecture/kvbm_intro.rst>
8383
KV Cache Routing <architecture/kv_cache_routing.md>
8484
Planner <architecture/planner_intro.rst>
85+
Dynamo Architecture Flow <architecture/dynamo_flow.md>
8586

8687
.. toctree::
8788
:hidden:

0 commit comments

Comments
 (0)