You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/architecture/kv_cache_routing.md
+93-16Lines changed: 93 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,9 +34,100 @@ The main KV-aware routing arguments:
34
34
>
35
35
> When `--kv-overlap-score-weight` is set to 0 or `--no-kv-events` is set, no KvIndexer will be launched to drain and process KV events. It's recommended to disable your backend workers from relaying events through `KvEventPublisher` to avoid event accumulation in JetStream. WIP to enable disabling publishing of KV events completely in these cases.
36
36
37
-
## Architecture
37
+
## Overview
38
+
39
+
The KV-aware router operates on two key principles to optimize request routing:
40
+
41
+
### Global KV Cache State via JetStream
42
+
43
+
First, KV events from engines are sent to a persistent NATS JetStream. Each KV router/indexer replica acts as a durable consumer, pulling messages from this shared stream to maintain a global view of cached blocks across all engines. This architecture ensures consistency across router replicas and persistence across restarts.
### Local Active Block Management with Replica Sync
84
+
85
+
Second, in addition to cached blocks, each router replica needs to track active blocks (blocks being used for ongoing generation) as load metrics. Since this information is highly time-sensitive, it must be predicted immediately when:
86
+
- The router receives and routes a request
87
+
- The first token is generated (prefill complete)
88
+
- The response ends (request freed)
89
+
90
+
This is managed locally in each router via a "slot manager". To maintain consistency across the system, router replicas synchronize these local predictions with each other through NATS core messaging.
91
+
92
+
```mermaid
93
+
sequenceDiagram
94
+
participant C1 as Client 1
95
+
participant R1 as Router 1<br/>(Slot Manager)
96
+
participant R2 as Router 2<br/>(Slot Manager)
97
+
participant C2 as Client 2
98
+
99
+
Note over R1,R2: Router Replica Sync Enabled
100
+
101
+
C1->>R1: Request A
102
+
activate R1
103
+
R1->>R1: Predict blocks & route to worker
104
+
R1-->>R2: Sync: AddRequest(A)
105
+
106
+
C2->>R2: Request B
107
+
activate R2
108
+
R2->>R2: Predict blocks & route to worker
109
+
R2-->>R1: Sync: AddRequest(B)
110
+
111
+
R1->>R1: First token received<br/>(prefill complete)
112
+
R1-->>R2: Sync: MarkPrefillCompleted(A)
113
+
R1->>C1: Stream response
38
114
39
-
Colloquially, we refer to a Dynamo component that serves an endpoint for LLM inference as a **worker**.
115
+
R2->>R2: First token received<br/>(prefill complete)
116
+
R2-->>R1: Sync: MarkPrefillCompleted(B)
117
+
R2->>C2: Stream response
118
+
119
+
R1->>R1: Response complete<br/>(free blocks)
120
+
R1-->>R2: Sync: Free(A)
121
+
deactivate R1
122
+
123
+
R2->>R2: Response complete<br/>(free blocks)
124
+
R2-->>R1: Sync: Free(B)
125
+
deactivate R2
126
+
127
+
Note over R1,R2: Both routers have consistent<br/>view of active blocks
128
+
```
129
+
130
+
This dual-layer approach—persistent global KV cache state via JetStream and ephemeral active block synchronization via router replicas—enables the system to make optimal routing decisions that balance cache reuse with load distribution.
40
131
41
132
## Basic Routing
42
133
Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.
@@ -182,20 +273,6 @@ Example calculation with `overlap_score_weight = 1.0`:
182
273
183
274
## Events
184
275
185
-
Dynamo supports KV Cache Routing across multiple backend implementations through a flexible event system. The KVPublisher component integrates with any framework to emit KV events, while the KVIndexer component maintains a global prefix tree of cached blocks by processing these events from all workers.
0 commit comments