Skip to content

Commit 096d117

Browse files
authored
docs: update router docs (#2148)
1 parent 708d7c3 commit 096d117

File tree

2 files changed

+146
-120
lines changed

2 files changed

+146
-120
lines changed

docs/components/router/README.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
-->
5+
6+
# KV Router
7+
8+
## Overview
9+
10+
Dynamo's KV Router makes intelligent routing decisions by evaluating the computational cost of processing requests on different workers. The router considers both the decoding cost (active blocks) and prefill cost (new blocks that need to be computed). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in a distributed inference setup.
11+
12+
## Quick Start
13+
14+
To launch the Dynamo frontend with the KV Router:
15+
16+
```bash
17+
python -m dynamo.frontend --router-mode kv --http-port 8080
18+
```
19+
20+
This command:
21+
- Launches the Dynamo frontend service with KV routing enabled
22+
- Exposes the service on port 8080 (configurable)
23+
- Automatically handles all backend workers registered to the Dynamo endpoint
24+
25+
Backend workers can register themselves using the `register_llm` API, and the KV Router will automatically include them in its routing decisions. The router will:
26+
- Track the state of all registered workers
27+
- Make intelligent routing decisions based on KV cache overlap
28+
- Balance load across available workers
29+
30+
### Important Arguments
31+
32+
The KV Router supports several key configuration options:
33+
34+
- **`--kv-cache-block-size <size>`**: Sets the KV cache block size (default: backend-specific). Larger blocks reduce overlap detection granularity but improve memory efficiency. This should match your backend configuration.
35+
36+
- **`--router-temperature <float>`**: Controls routing randomness (default: 0.0)
37+
- `0.0`: Deterministic selection of the best worker
38+
- `> 0.0`: Probabilistic selection using softmax sampling
39+
- Higher values increase randomness, helping prevent worker saturation
40+
41+
- **`--kv-events` / `--no-kv-events`**: Controls how the router tracks cached blocks (default: `--kv-events`)
42+
- `--kv-events`: Uses real-time events from workers for accurate cache tracking
43+
- `--no-kv-events`: Uses approximation based on routing decisions (lower overhead, less accurate)
44+
45+
For a complete list of available options:
46+
```bash
47+
python -m dynamo.frontend --help
48+
```
49+
50+
## KV Router Architecture
51+
52+
The KV Router tracks two key metrics for each worker:
53+
54+
1. **Potential Active Blocks**: The total number of blocks that would be actively used for decoding if a request were routed to that worker. This includes existing active blocks plus new blocks from the incoming request.
55+
56+
2. **Potential New Prefill Blocks**: The number of new tokens that would need to be prefilled (computed from scratch) on that worker, calculated as:
57+
- New prefill tokens = Total input tokens - (Overlap blocks × Block size)
58+
- Potential prefill blocks = New prefill tokens / Block size
59+
60+
### Block Tracking Mechanisms
61+
62+
The router maintains block information through two complementary systems:
63+
64+
- **Active Decoding Blocks**: Tracked locally by the router based on the request lifecycle:
65+
- Incremented when a new request is added
66+
- Updated as new tokens are generated
67+
- Decremented when a request completes
68+
69+
- **Cached Blocks**: Maintained globally by the KvIndexer, which builds a prefix tree from KV events reported by workers. This provides accurate overlap information for routing decisions.
70+
71+
## Cost Function
72+
73+
The KV Router's routing decision is based on a simple cost function:
74+
75+
```
76+
logit = kv_overlap_score_weight × potential_prefill_blocks + potential_active_blocks
77+
```
78+
79+
Where:
80+
- Lower logit values are better (less computational cost)
81+
- The router uses softmax sampling with optional temperature to select workers
82+
83+
### Key Parameter: kv-overlap-score-weight
84+
85+
The `kv-overlap-score-weight` parameter (default: 1.0) controls the balance between prefill and decode optimization:
86+
87+
- **Higher values (> 1.0)**: Emphasize reducing prefill cost
88+
- Prioritizes routing to workers with better cache hits
89+
- Optimizes for Time To First Token (TTFT)
90+
- Best for workloads where initial response latency is critical
91+
92+
- **Lower values (< 1.0)**: Emphasize decode performance
93+
- Distributes active decoding blocks more evenly
94+
- Optimizes for Inter-Token Latency (ITL)
95+
- Best for workloads with long generation sequences
96+
97+
## KV Events vs. Approximation Mode
98+
99+
By default, the router uses KV events from workers to maintain an accurate global view of cached blocks. However, you can disable this with the `--no-kv-events` flag:
100+
101+
- **With KV Events (default)**:
102+
- Accurate overlap calculation based on actual cached blocks
103+
- Higher accuracy but requires event processing overhead
104+
- Best for production deployments
105+
106+
- **Without KV Events (--no-kv-events)**:
107+
- Uses the ApproxKvIndexer to approximate cached blocks based on routing decisions
108+
- Assumes that recently routed requests will have their blocks cached
109+
- Lower overhead but potentially less accurate routing
110+
- Useful for testing or environments where event processing is a bottleneck
111+
112+
## Tuning Guidelines
113+
114+
### 1. Understand Your Workload Characteristics
115+
116+
- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
117+
- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`
118+
119+
### 2. Monitor Key Metrics
120+
121+
The router logs the cost calculation for each worker:
122+
```
123+
Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
124+
```
125+
126+
This shows:
127+
- Total cost (125.3)
128+
- Overlap weight × prefill blocks (1.0 × 100.5)
129+
- Active blocks (25.0)
130+
- Cached blocks that contribute to overlap (15)
131+
132+
### 3. Temperature-Based Routing
133+
134+
The `router_temperature` parameter controls routing randomness:
135+
- **0.0 (default)**: Deterministic selection of the best worker
136+
- **> 0.0**: Probabilistic selection, higher values increase randomness
137+
- Useful for preventing worker saturation and improving load distribution
138+
139+
### 4. Iterative Optimization
140+
141+
1. Start with default settings
142+
2. Monitor TTFT and ITL metrics
143+
3. Adjust `kv-overlap-score-weight` based on your optimization goals:
144+
- If TTFT is too high: Increase the weight
145+
- If ITL is too high: Decrease the weight
146+
4. Increase temperature if severe load imbalance occurs

docs/guides/kv_router_perf_tuning.md

Lines changed: 0 additions & 120 deletions
This file was deleted.

0 commit comments

Comments
 (0)