Skip to content

Commit d9d384e

Browse files
PeaBranenv-anants
authored andcommitted
feat: python bindings for the entire KvPushRouter + per-request router configs (#2658)
1 parent 072c063 commit d9d384e

File tree

18 files changed

+595
-218
lines changed

18 files changed

+595
-218
lines changed

Cargo.lock

Lines changed: 0 additions & 14 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@
44
[workspace]
55
members = [
66
"components/metrics",
7-
"components/router",
87
"launch/*",
98
"lib/llm",
109
"lib/runtime",

components/README.md

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -49,15 +49,6 @@ The frontend component provides the HTTP API layer and request processing:
4949
- **Router** - Routes requests to appropriate workers based on load and KV cache state
5050
- **Auto-discovery** - Automatically discovers and registers available workers
5151

52-
### [Router](router/)
53-
54-
A high-performance request router written in Rust that:
55-
56-
- Routes incoming requests to optimal workers based on KV cache state
57-
- Implements KV-aware routing to minimize cache misses
58-
- Provides load balancing across multiple worker instances
59-
- Supports both aggregated and disaggregated serving patterns
60-
6152
### [Planner](planner/)
6253

6354
The planner component monitors system state and dynamically adjusts worker allocation:

components/router/Cargo.toml

Lines changed: 0 additions & 37 deletions
This file was deleted.

components/router/src/main.rs

Lines changed: 0 additions & 98 deletions
This file was deleted.

docs/components/router/README.md

Lines changed: 68 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -143,4 +143,71 @@ The `router_temperature` parameter controls routing randomness:
143143
3. Adjust `kv-overlap-score-weight` to meet your performance goals:
144144
- To reduce TTFT: Increase the weight
145145
- To reduce ITL: Decrease the weight
146-
4. If you observe severe load imbalance, increase the temperature setting
146+
4. If you observe severe load imbalance, increase the temperature setting
147+
148+
## Using KvPushRouter Python API
149+
150+
Instead of launching the KV Router via command line, you can create a `KvPushRouter` object directly in Python. This allows per-request routing configuration overrides.
151+
152+
### Setup
153+
154+
First, launch your backend engines:
155+
```bash
156+
python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf --endpoint dyn://inference.vllm.generate
157+
```
158+
159+
### Example Script
160+
161+
```python
162+
import asyncio
163+
from dynamo._core import DistributedRuntime, KvPushRouter, KvRouterConfig
164+
165+
async def main():
166+
# Get runtime and create endpoint
167+
runtime = DistributedRuntime.detached()
168+
namespace = runtime.namespace("inference")
169+
component = namespace.component("vllm")
170+
endpoint = component.endpoint("generate")
171+
172+
# Create KV router
173+
kv_router_config = KvRouterConfig()
174+
router = KvPushRouter(
175+
endpoint=endpoint,
176+
block_size=16,
177+
kv_router_config=kv_router_config
178+
)
179+
180+
# Your input tokens
181+
token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
182+
183+
# Generate with per-request routing override
184+
stream = await router.generate(
185+
token_ids=token_ids,
186+
model="meta-llama/Llama-2-7b-hf",
187+
stop_conditions={
188+
"max_tokens": 20, # Generate exactly 20 tokens
189+
"ignore_eos": True, # Don't stop at EOS token
190+
},
191+
sampling_options={
192+
"temperature": 0.7,
193+
"top_p": 0.9,
194+
},
195+
router_config_override={
196+
"overlap_score_weight": 2.0, # Prioritize cache hits for this request
197+
"router_temperature": 0.5, # Add routing randomness
198+
}
199+
)
200+
201+
# Collect generated tokens
202+
generated_tokens = []
203+
async for response in stream:
204+
if isinstance(response, dict) and "token_ids" in response:
205+
generated_tokens.extend(response["token_ids"])
206+
207+
print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}")
208+
209+
if __name__ == "__main__":
210+
asyncio.run(main())
211+
```
212+
213+
The `router_config_override` parameter allows you to adjust routing behavior per request without recreating the router. This is useful for implementing different routing strategies based on request characteristics.

docs/hidden_toctree.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,6 @@
3939
components/backends/sglang/docs/multinode-examples.md
4040
components/backends/sglang/docs/sgl-http-server.md
4141
components/backends/sglang/slurm_jobs/README.md
42-
components/router/README.md
4342
examples/README.md
4443
guides/dynamo_deploy/create_deployment.md
4544
guides/dynamo_deploy/sla_planner_deployment.md

examples/basics/multinode/README.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -382,8 +382,6 @@ python -m dynamo.frontend \
382382

383383
However, for maximum performance with shared prefixes and multi-turn conversations, KV routing provides significant advantages by minimizing redundant computation.
384384

385-
For detailed router configuration and tuning options, see the [KV Router Documentation](../../../docs/components/router/README.md).
386-
387385
## Monitoring and Debugging
388386

389387
### Check Worker Registration

lib/bindings/python/Cargo.lock

Lines changed: 0 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

lib/bindings/python/rust/lib.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,8 @@ fn _core(m: &Bound<'_, PyModule>) -> PyResult<()> {
111111
m.add_class::<llm::kv::WorkerStats>()?;
112112
m.add_class::<llm::kv::KvStats>()?;
113113
m.add_class::<llm::kv::SpecDecodeStats>()?;
114+
m.add_class::<llm::kv::KvPushRouter>()?;
115+
m.add_class::<llm::kv::KvPushRouterStream>()?;
114116
m.add_class::<RouterMode>()?;
115117

116118
engine::add_to_module(m)?;

0 commit comments

Comments
 (0)