ai-dynamo
diff --git a/‎Cargo.lock‎
Lines changed: 0 additions & 14 deletions b/‎Cargo.lock‎
Lines changed: 0 additions & 14 deletions
diff --git a/‎Cargo.toml‎
Lines changed: 0 additions & 1 deletion b/‎Cargo.toml‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎components/README.md‎
Lines changed: 0 additions & 9 deletions b/‎components/README.md‎
Lines changed: 0 additions & 9 deletions
diff --git a/‎components/router/Cargo.toml‎
Lines changed: 0 additions & 37 deletions b/‎components/router/Cargo.toml‎
Lines changed: 0 additions & 37 deletions
diff --git a/‎components/router/src/main.rs‎
Lines changed: 0 additions & 98 deletions b/‎components/router/src/main.rs‎
Lines changed: 0 additions & 98 deletions
diff --git a/‎docs/components/router/README.md‎
Lines changed: 68 additions & 1 deletion b/‎docs/components/router/README.md‎
Lines changed: 68 additions & 1 deletion
diff --git a/‎docs/hidden_toctree.rst‎
Lines changed: 0 additions & 1 deletion b/‎docs/hidden_toctree.rst‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎examples/basics/multinode/README.md‎
Lines changed: 0 additions & 2 deletions b/‎examples/basics/multinode/README.md‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎lib/bindings/python/Cargo.lock‎
Lines changed: 0 additions & 1 deletion b/‎lib/bindings/python/Cargo.lock‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎lib/bindings/python/rust/lib.rs‎
Lines changed: 2 additions & 0 deletions b/‎lib/bindings/python/rust/lib.rs‎
Lines changed: 2 additions & 0 deletions
@@ -4,7 +4,6 @@
 [workspace]
 members = [
     "components/metrics",
-    "components/router",
     "launch/*",
     "lib/llm",
     "lib/runtime",
 
@@ -49,15 +49,6 @@ The frontend component provides the HTTP API layer and request processing:
 - **Router** - Routes requests to appropriate workers based on load and KV cache state
 - **Auto-discovery** - Automatically discovers and registers available workers
 
-### [Router](router/)
-
-A high-performance request router written in Rust that:
-
-- Routes incoming requests to optimal workers based on KV cache state
-- Implements KV-aware routing to minimize cache misses
-- Provides load balancing across multiple worker instances
-- Supports both aggregated and disaggregated serving patterns
-
 ### [Planner](planner/)
 
 The planner component monitors system state and dynamically adjusts worker allocation:
 
@@ -143,4 +143,71 @@ The `router_temperature` parameter controls routing randomness:
 3. Adjust `kv-overlap-score-weight` to meet your performance goals:
    - To reduce TTFT: Increase the weight
    - To reduce ITL: Decrease the weight
-4. If you observe severe load imbalance, increase the temperature setting
+4. If you observe severe load imbalance, increase the temperature setting
+
+## Using KvPushRouter Python API
+
+Instead of launching the KV Router via command line, you can create a `KvPushRouter` object directly in Python. This allows per-request routing configuration overrides.
+
+### Setup
+
+First, launch your backend engines:
+```bash
+python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf --endpoint dyn://inference.vllm.generate
+```
+
+### Example Script
+
+```python
+import asyncio
+from dynamo._core import DistributedRuntime, KvPushRouter, KvRouterConfig
+
+async def main():
+    # Get runtime and create endpoint
+    runtime = DistributedRuntime.detached()
+    namespace = runtime.namespace("inference")
+    component = namespace.component("vllm")
+    endpoint = component.endpoint("generate")
+
+    # Create KV router
+    kv_router_config = KvRouterConfig()
+    router = KvPushRouter(
+        endpoint=endpoint,
+        block_size=16,
+        kv_router_config=kv_router_config
+    )
+
+    # Your input tokens
+    token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+
+    # Generate with per-request routing override
+    stream = await router.generate(
+        token_ids=token_ids,
+        model="meta-llama/Llama-2-7b-hf",
+        stop_conditions={
+            "max_tokens": 20,        # Generate exactly 20 tokens
+            "ignore_eos": True,      # Don't stop at EOS token
+        },
+        sampling_options={
+            "temperature": 0.7,
+            "top_p": 0.9,
+        },
+        router_config_override={
+            "overlap_score_weight": 2.0,    # Prioritize cache hits for this request
+            "router_temperature": 0.5,       # Add routing randomness
+        }
+    )
+
+    # Collect generated tokens
+    generated_tokens = []
+    async for response in stream:
+        if isinstance(response, dict) and "token_ids" in response:
+            generated_tokens.extend(response["token_ids"])
+
+    print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+The `router_config_override` parameter allows you to adjust routing behavior per request without recreating the router. This is useful for implementing different routing strategies based on request characteristics.
@@ -39,7 +39,6 @@
    components/backends/sglang/docs/multinode-examples.md
    components/backends/sglang/docs/sgl-http-server.md
    components/backends/sglang/slurm_jobs/README.md
-   components/router/README.md
    examples/README.md
    guides/dynamo_deploy/create_deployment.md
    guides/dynamo_deploy/sla_planner_deployment.md
 
@@ -382,8 +382,6 @@ python -m dynamo.frontend \
 
 However, for maximum performance with shared prefixes and multi-turn conversations, KV routing provides significant advantages by minimizing redundant computation.
 
-For detailed router configuration and tuning options, see the [KV Router Documentation](../../../docs/components/router/README.md).
-
 ## Monitoring and Debugging
 
 ### Check Worker Registration
 
@@ -111,6 +111,8 @@ fn _core(m: &Bound<'_, PyModule>) -> PyResult<()> {
     m.add_class::<llm::kv::WorkerStats>()?;
     m.add_class::<llm::kv::KvStats>()?;
     m.add_class::<llm::kv::SpecDecodeStats>()?;
+    m.add_class::<llm::kv::KvPushRouter>()?;
+    m.add_class::<llm::kv::KvPushRouterStream>()?;
     m.add_class::<RouterMode>()?;
 
     engine::add_to_module(m)?;