feat: vllm prefill router #3155

PeaBrane · 2025-09-20T23:31:09Z

Overview:

A vllm prefill router can be launched with python -m dynamo.vllm_prefill_router. Eventually, we need to make this engine agnostic. Not dealing with this for now, and keeping the PR somewhat small. Closes #1895

Core changes

We first exposed via Python bindings KvRouter directly, particularly the find_best_match and free methods. These should not be used by typical users, so we are not adding docs for them.

Added the vllm_prefill_router service, with the flow being:

Call the find_best_match endpoint to get the best prefill worker id via the router client
Send the prefill request to said worker id via the worker client
Call the free endpoint to free the request tracking from the prefill router

Note, this is actually not the ideal flow, and it is cleaner to use the KvPushRouter directly, so the 3 steps are handled underneath the hood automatically. However, currently, this would require intrusive changes to PreprocessedRequest to support remote-prefill-specific extra args, so I am delaying doing this to a future PR where we would have a unified field for these args, and have all 3 frameworks using it consistently. (See #3159 )

Benchmarking

Done with 4P4D, on A100, 8b model, concurrency of 20

Done with 5P3D, on A100, 8b model, concurrency of 20

Signed-off-by: PeaBrane <yanrpei@gmail.com>

coderabbitai · 2025-09-20T23:41:25Z

Walkthrough

Default benchmark ports switched to 8000. Router launch scripts parameterized and expanded for prefill/decode workers. Added a Prefill Router service and integrated router-aware prefill selection in decode path. Introduced KV event publisher setup in vLLM main. Updated Rust KV router/scheduler APIs and Python bindings, including best_worker_id signature changes.

Changes

Cohort / File(s)	Summary
Benchmarks: default port to 8000 `benchmarks/router/ping.sh`, `benchmarks/router/prefix_ratio_benchmark.py`, `benchmarks/router/real_data_benchmark.py`	Change default localhost port from 8080 to 8000; no request/payload changes.
Benchmark engine launcher: prefill support and GPU offset `benchmarks/router/run_engines.sh`	Add --prefills and --base-gpu-offset options; validate GPU offset; compute GPU ranges with base offset; differentiate prefill/decode workers; refactor vLLM args assembly; expanded logs.
vLLM launch scripts: parameterization and prefill topology `components/backends/vllm/launch/agg_router.sh`, `components/backends/vllm/launch/disagg_router.sh`	Introduce PYTHONHASHSEED, MODEL, BLOCK_SIZE; restructure commands; enable eager mode; add prefill router and prefill workers; unify args across workers.
Decode handler: router-aware prefill selection `components/backends/vllm/src/dynamo/vllm/handlers.py`	Decode path optionally queries prefill router for worker selection; fallback to round-robin; background prefill availability check; constructor adds prefill_router_client parameter.
vLLM main: KV publisher setup and wiring `components/backends/vllm/src/dynamo/vllm/main.py`	Add setup_kv_event_publisher helper; `setup_vllm_engine` now returns `(engine_client, vllm_config, default_sampling_params)`; attach KV publisher and prefill_router_client to handlers.
Prefill Router service: new package entry `components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py`, `.../__init__.py`	New Prefill Router runtime with best_worker_id endpoint; CLI parsing; initialization and serving; robust `__version__` resolution in package init.
Python bindings for KV router `lib/bindings/python/rust/llm/kv.rs`, `lib/bindings/python/src/dynamo/_core.pyi`, `lib/bindings/python/src/dynamo/llm/__init__.py`	Remove context_id from `best_worker_id`; add `generate_from_request` and stream helper; enforce primary lease; remove drop-time cancellation; export KvPushRouter in public API; update type stubs.
Rust KV router core and scheduler `lib/llm/src/kv_router.rs`, `lib/llm/src/kv_router/scheduler.rs`	Add router cancellation token and Drop; change `find_best_match` to accept `Option<&str>` context; update callers; scheduler now carries optional request ID and validates presence.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant DecodeHandler
  participant PrefillRouter
  participant PrefillWorker
  participant RR as RoundRobin

  Note over DecodeHandler: New: optional router-aware prefill selection
  Client->>DecodeHandler: generate(request with prompt)
  alt Prefill router available and has instances
    DecodeHandler->>PrefillRouter: best_worker_id(token_ids)
    alt worker_id found
      PrefillRouter-->>DecodeHandler: worker_id, overlap_blocks
      DecodeHandler->>PrefillWorker: prefill.direct(request, worker_id)
      PrefillWorker-->>DecodeHandler: prefill result
    else no decision / error
      PrefillRouter--x DecodeHandler: fallback
      DecodeHandler->>RR: prefill via round_robin
      RR-->>DecodeHandler: prefill result
    end
  else Router unavailable
    DecodeHandler->>RR: prefill via round_robin
    RR-->>DecodeHandler: prefill result
  end
  DecodeHandler-->>Client: streamed generation

sequenceDiagram
  autonumber
  participant Main as vLLM Main
  participant Engine as vLLM Engine Setup
  participant Handler as Worker Handler
  participant KV as KV Event Publisher

  Note over Main: New: setup_kv_event_publisher()
  Main->>Engine: setup_vllm_engine()
  Engine-->>Main: engine_client, vllm_config, default_sampling_params
  Main->>Main: setup_kv_event_publisher(config, component, generate_endpoint, vllm_config)
  alt KV enabled
    Main-->>KV: create ZmqKvEventPublisher
    Main->>Handler: attach kv_publisher
  else KV disabled
    Main-->>Handler: no publisher attached
  end
  Main-->>Handler: start serving

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: python bindings for the entire KvPushRouter #2658 — Alters KvPushRouter Rust/Python bindings and streaming APIs; overlaps with best_worker_id changes and new generate/stream helpers.
feat: DIS-373 dynamo KVBM connector API integration with TRTLLM #2544 — Adjusts KV routing interfaces (best_worker_id/find_best_match) and router config paths; closely matches signature shifts here.
feat: add KV Event Publishing to vLLM v1 #1181 — Introduces ZeroMQ-based KV event publisher and wiring; aligns with new setup_kv_event_publisher integration.

Poem

I twitch my ears at ports anew, 8000’s where we rendezvous.
I hop between decode and prefill lanes,
A router whispers worker names.
KV bells softly chime in streams,
Rust and Python stitch the seams—
Benchmarks thump: my happiest dreams. 🐇✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "feat: vllm prefill router" succinctly and accurately summarizes the primary change in this PR — addition of a vLLM prefill router and related integration across handlers, bindings, and launch scripts as shown in the changeset. It is concise, specific, and follows conventional commit style, so a teammate scanning history will understand the primary intent.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	The PR description includes a clear Overview, substantive "Core changes" narrative, benchmarking results, and a "Closes #1895" reference, but it does not strictly follow the repository template: the required "Details" section is missing (replaced by "Core changes") and the "Where should the reviewer start?" heading is absent. Overall the content gives reviewers enough context to understand intent and impact, so the description is mostly complete.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/llm/src/kv_router/scheduler.rs (1)

389-458: Fix RNG API usage and softmax normalization (compile/runtime bugs).

rand = "0.9.0" found at lib/async-openai/Cargo.toml:41 — switch to thread_rng()/gen_range()/gen() and use proper min-max normalization (v - min) / (max - min) before negation.

File: lib/llm/src/kv_router/scheduler.rs Lines: 389-458

-        let mut rng = rand::rng();
-        let index = rng.random_range(0..min_keys.len());
+        let mut rng = rand::thread_rng();
+        let index = rng.gen_range(0..min_keys.len());
@@
-    let mut rng = rand::rng();
-    let sample: f64 = rng.random();
+    let mut rng = rand::thread_rng();
+    let sample: f64 = rng.gen();
@@
-        let normalized: Vec<_> = values
-            .iter()
-            .map(|&v| {
-                // Lower is better, so negate
-                // Note we don't need to do actual min-max norm here, just off by an offset
-                let norm = v / (max_val - min_val);
-                -norm
-            })
-            .collect();
+        let normalized: Vec<_> = values
+            .iter()
+            .map(|&v| {
+                // Lower is better, so negate after proper min-max normalization
+                let norm = (v - min_val) / (max_val - min_val);
+                -norm
+            })
+            .collect();

🧹 Nitpick comments (12)

components/backends/vllm/src/dynamo/vllm_prefill_router/__init__.py (1)
4-12: Consider more specific exception handling for version resolution.

While the fallback version resolution is resilient, catching all exceptions can mask unexpected errors during development.

Apply this diff to use more specific exception types:
 try:
     from ._version import __version__
-except Exception:
+except ImportError:
     try:
         from importlib.metadata import version as _pkg_version
 
         __version__ = _pkg_version("ai-dynamo")
-    except Exception:
+    except (ImportError, ModuleNotFoundError, KeyError):
         __version__ = "0.0.0+unknown"
components/backends/vllm/launch/agg_router.sh (1)
10-12: Consider parameterizing model and block size configuration.

While centralizing these values improves maintainability, consider making them configurable via environment variables or command-line arguments for flexibility across different deployments.

Apply this diff to make the configuration more flexible:
 # Common configuration
-MODEL="Qwen/Qwen3-0.6B"
-BLOCK_SIZE=64
+MODEL="${MODEL:-Qwen/Qwen3-0.6B}"
+BLOCK_SIZE="${BLOCK_SIZE:-64}"
components/backends/vllm/src/dynamo/vllm/handlers.py (2)
13-13: Fix import sorting issue flagged by pipeline.

The pipeline indicates that imports need to be sorted according to isort rules.

Run isort to fix the import order:
isort components/backends/vllm/src/dynamo/vllm/handlers.py
108-123: Add error recovery mechanism for prefill availability check.

The background task logs errors but doesn't implement any recovery strategy. Consider adding exponential backoff or circuit breaker pattern for resilience.

Apply this diff to add basic exponential backoff:
     async def _prefill_check_loop(self):
         """Background task that checks prefill worker availability every 5 seconds."""
+        backoff = 5  # Initial backoff in seconds
+        max_backoff = 60  # Maximum backoff in seconds
         while True:
             try:
                 if self.prefill_worker_client is not None:
                     self.can_prefill = len(self.prefill_worker_client.instance_ids())
                     logger.debug(f"Current Prefill Workers: {self.can_prefill}")
                 else:
                     self.can_prefill = 0
+                backoff = 5  # Reset backoff on success
             except asyncio.CancelledError:
                 logger.warning("Prefill check loop cancelled.")
                 raise
             except Exception as e:
                 logger.error(f"Error in prefill check loop: {e}")
+                backoff = min(backoff * 2, max_backoff)  # Exponential backoff
+                logger.debug(f"Backing off for {backoff} seconds")
 
-            await asyncio.sleep(5)
+            await asyncio.sleep(backoff)
components/backends/vllm/src/dynamo/vllm/main.py (1)
8-8: Fix import sorting issue flagged by pipeline.

The pipeline indicates that imports need to be sorted according to isort rules.

Run isort to fix the import order:
isort components/backends/vllm/src/dynamo/vllm/main.py
lib/llm/src/kv_router/scheduler.rs (2)
524-529: Reduce hot-path log level.

Per-request formula logs at INFO will spam logs. Use DEBUG or TRACE.
-            tracing::info!(
+            tracing::debug!(
                 "Formula for {worker_id} with {overlap} cached blocks: {logit:.3} \
                  = {overlap_weight:.1} * prefill_blocks + decode_blocks \
                  = {overlap_weight:.1} * {potential_prefill_block:.3} + {decode_block:.3}"
             );
42-43: Typo in error string.
-    #[error("no endpoints aviailable to route work")]
+    #[error("no endpoints available to route work")]
components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py (4)
68-71: Log full traceback on init failure.

Use logger.exception(...) to capture stack trace.
-        except Exception as e:
-            logger.error(f"Failed to initialize KvPushRouter: {e}")
+        except Exception:
+            logger.exception("Failed to initialize KvPushRouter")
             raise
72-87: Unused context arg; rename to _context to satisfy linters.
-    async def best_worker_id(self, request, context):
+    async def best_worker_id(self, request, _context):
127-133: Catch broad exceptions with traceback.

Switch to logger.exception to preserve details. Consider narrowing exception types later.
-        except Exception as e:
-            logger.error(f"Error finding best worker: {e}")
+        except Exception:
+            logger.exception("Error finding best worker")
             yield {
                 "status": "error",
-                "message": str(e),
+                "message": "internal error",
             }
198-199: Also log traceback on serve failure.
-    except Exception as e:
-        logger.error(f"Failed to serve endpoint: {e}")
+    except Exception:
+        logger.exception("Failed to serve endpoint")
lib/bindings/python/rust/llm/kv.rs (1)
845-885: Nice: centralize request→stream conversion.

Reduces duplication and isolates Pythonization. Consider closing the sender explicitly after loop to unblock consumers sooner (minor).
             tokio::spawn(async move {
                 let mut stream = stream;
                 while let Some(response) = stream.next().await {
@@
                 }
+                // Explicitly drop sender to signal completion
+                drop(tx);
             });

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 007b9d6 and 82f6694.

📒 Files selected for processing (15)

benchmarks/router/ping.sh (1 hunks)
benchmarks/router/prefix_ratio_benchmark.py (1 hunks)
benchmarks/router/real_data_benchmark.py (1 hunks)
benchmarks/router/run_engines.sh (5 hunks)
components/backends/vllm/launch/agg_router.sh (1 hunks)
components/backends/vllm/launch/disagg_router.sh (1 hunks)
components/backends/vllm/src/dynamo/vllm/handlers.py (3 hunks)
components/backends/vllm/src/dynamo/vllm/main.py (6 hunks)
components/backends/vllm/src/dynamo/vllm_prefill_router/__init__.py (1 hunks)
components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py (1 hunks)
lib/bindings/python/rust/llm/kv.rs (5 hunks)
lib/bindings/python/src/dynamo/_core.pyi (0 hunks)
lib/bindings/python/src/dynamo/llm/__init__.py (1 hunks)
lib/llm/src/kv_router.rs (7 hunks)
lib/llm/src/kv_router/scheduler.rs (4 hunks)

💤 Files with no reviewable changes (1)

lib/bindings/python/src/dynamo/_core.pyi

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/llm/src/kv_router/subscriber.rs:36-44
Timestamp: 2025-08-29T10:03:48.330Z
Learning: PeaBrane prefers to keep PRs contained in scope and is willing to defer technical improvements to future PRs when the current implementation works for the immediate use case. They acknowledge technical debt but prioritize deliverability over completeness in individual PRs.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3095
File: lib/llm/src/kv_router/subscriber.rs:200-223
Timestamp: 2025-09-17T20:55:41.392Z
Learning: In the dynamo codebase, PeaBrane prefers to maintain consistency with existing etcd key parsing patterns (like splitting on '/' and parsing the last segment) rather than introducing more robust parsing approaches, even when the current approach might be brittle, to keep the codebase aligned and avoid divergent patterns.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.313Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.

📚 Learning: 2025-06-02T19:37:27.666Z

Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

Applied to files:

lib/llm/src/kv_router.rs

📚 Learning: 2025-09-17T01:00:50.937Z

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Applied to files:

lib/bindings/python/rust/llm/kv.rs

🧬 Code graph analysis (6)

lib/bindings/python/src/dynamo/llm/__init__.py (2)

lib/bindings/python/rust/lib.rs (1)

_core (69-130)

lib/bindings/python/src/dynamo/_core.pyi (1)

KvPushRouter (1157-1256)

components/backends/vllm/src/dynamo/vllm/handlers.py (4)

lib/bindings/python/src/dynamo/runtime/logging.py (1)

configure_dynamo_logging (77-107)

lib/bindings/python/src/dynamo/_core.pyi (5)

instance_ids (256-263)

generate (1178-1210)

get (137-150)

direct (286-290)

round_robin (280-284)

lib/bindings/python/rust/lib.rs (5)

instance_ids (904-906)

generate (923-935)

get (608-624)

direct (1005-1043)

round_robin (939-968)

lib/bindings/python/rust/llm/kv.rs (2)

anext (1110-1122)

generate (946-1013)

components/backends/vllm/src/dynamo/vllm/main.py (4)

components/backends/vllm/src/dynamo/vllm/args.py (1)

Config (39-64)

examples/multimodal/utils/args.py (1)

Config (25-40)

lib/bindings/python/src/dynamo/_core.pyi (9)

component (191-195)

ZmqKvEventPublisher (811-825)

endpoint (210-214)

ZmqKvEventPublisherConfig (793-809)

lease_id (243-247)

block_size (625-629)

block_size (648-652)

namespace (38-42)

client (237-241)

examples/multimodal/components/worker.py (1)

setup_vllm_engine (117-174)

components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py (4)

lib/bindings/python/src/dynamo/_core.pyi (12)

KvPushRouter (1157-1256)

KvRouterConfig (867-869)

Client (249-290)

DistributedRuntime (31-62)

block_size (625-629)

block_size (648-652)

component (191-195)

endpoint (210-214)

client (237-241)

best_worker_id (1212-1229)

instance_ids (256-263)

get (137-150)

lib/bindings/python/src/dynamo/runtime/__init__.py (1)

dynamo_worker (36-62)

lib/bindings/python/src/dynamo/runtime/logging.py (1)

configure_dynamo_logging (77-107)

lib/bindings/python/rust/llm/kv.rs (3)

block_size (436-438)

block_size (497-499)

best_worker_id (1031-1058)

lib/llm/src/kv_router.rs (6)

lib/bindings/python/rust/llm/kv.rs (1)

drop (228-230)

lib/llm/src/kv_router/sequence.rs (1)

drop (854-864)

lib/llm/src/kv_router/approx.rs (1)

drop (404-406)

lib/llm/src/block_manager.rs (1)

drop (86-88)

lib/llm/src/kv_router/publisher.rs (1)

drop (175-177)

lib/llm/src/recorder.rs (1)

drop (377-379)

lib/bindings/python/rust/llm/kv.rs (4)

lib/bindings/python/src/dynamo/_core.pyi (3)

KvPushRouter (1157-1256)

new (117-135)

component (191-195)

components/backends/sglang/src/dynamo/sglang/protocol.py (1)

PreprocessedRequest (36-43)

lib/llm/src/kv_router.rs (3)

new (137-161)

new (224-321)

new (482-487)

lib/runtime/src/component.rs (2)

component (428-430)

component (575-581)

🪛 Ruff (0.13.1)

components/backends/vllm/src/dynamo/vllm_prefill_router/__init__.py

6-6: Do not catch blind exception: Exception

(BLE001)

11-11: Do not catch blind exception: Exception

(BLE001)

components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py

69-69: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

72-72: Unused method argument: context

(ARG002)

110-110: Abstract raise to an inner function

(TRY301)

110-110: Avoid specifying long messages outside the exception class

(TRY003)

127-127: Do not catch blind exception: Exception

(BLE001)

128-128: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

198-198: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3155/merge) by PeaBrane.

components/backends/vllm/src/dynamo/vllm/handlers.py

[error] 1-1: isort: files were modified by this hook. Fixing /home/runner/work/dynamo/dynamo/components/backends/vllm/src/dynamo/vllm/handlers.py

components/backends/vllm/src/dynamo/vllm/main.py

[error] 1-1: isort: files were modified by this hook. Fixing /home/runner/work/dynamo/dynamo/components/backends/vllm/src/dynamo/vllm/main.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Build and Test - dynamo
GitHub Check: pre-merge-rust (.)
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (lib/runtime/examples)

🔇 Additional comments (25)

benchmarks/router/real_data_benchmark.py (1)

121-121: Port change to align with benchmark defaults.

The default URL port has been updated from 8080 to 8000, which aligns consistently with the broader benchmark tooling changes across the router benchmarks.

benchmarks/router/prefix_ratio_benchmark.py (1)

312-312: Port change to align with benchmark defaults.

The default URL port has been updated from 8080 to 8000, which aligns consistently with the broader benchmark tooling changes across the router benchmarks.

benchmarks/router/ping.sh (1)

6-7: Port change to align with benchmark defaults.

The default port has been updated from 8080 to 8000, which aligns consistently with the broader benchmark tooling changes across the router benchmarks.

benchmarks/router/run_engines.sh (9)

11-12: New configuration variables for prefill router support.

The addition of USE_PREFILLS and BASE_GPU_OFFSET variables provides the foundation for configuring prefill workers and GPU allocation offsets.

34-41: CLI argument parsing for prefill router features.

The new command line options --prefills and --base-gpu-offset enable users to configure prefill worker mode and GPU allocation offset respectively.

84-87: Validation for BASE_GPU_OFFSET parameter.

The validation ensures BASE_GPU_OFFSET is a non-negative integer, preventing invalid GPU allocation configurations.

91-91: GPU range calculation with offset support.

The LAST_GPU calculation now accounts for the BASE_GPU_OFFSET, enabling proper GPU allocation when using GPU offsets.

94-94: Enhanced configuration display for prefill workers.

The configuration output now clearly indicates worker type (Prefill vs Decode) and shows the GPU range, improving operational visibility.

Also applies to: 99-99

114-115: Worker type labeling for improved logging.

The WORKER_TYPE variable enables consistent labeling throughout the script, making it easier to distinguish between prefill and decode workers in logs.

119-119: Consistent worker type labeling in status messages.

The status messages now clearly indicate whether workers are prefill or decode type, improving operational clarity.

Also applies to: 146-146, 161-161

122-122: GPU allocation with base offset support.

The START_GPU calculation now incorporates BASE_GPU_OFFSET, enabling proper GPU allocation when using GPU offsets for worker deployment.

148-157: vLLM argument construction with prefill worker support.

The refactored argument building approach uses an array to construct vLLM arguments, conditionally adding --is-prefill-worker when prefill mode is enabled. This provides a clean and maintainable way to handle the conditional argument passing.

components/backends/vllm/launch/agg_router.sh (1)

7-8: Good addition of deterministic hashing for KV event IDs.

Setting PYTHONHASHSEED=0 ensures consistent hash values across runs, which is important for KV event ID generation and debugging.

lib/bindings/python/src/dynamo/llm/__init__.py (1)

28-28: LGTM! KvPushRouter export aligns with updated bindings.

The addition of KvPushRouter to the public API is consistent with the broader prefill router changes in the PR.

components/backends/vllm/src/dynamo/vllm/main.py (1)

91-123: Well-structured KV event publisher setup.

The new setup_kv_event_publisher helper function properly encapsulates the KV publisher initialization logic with good error handling and configuration.
components/backends/vllm/launch/disagg_router.sh (1)
38-49: Syntax error: Missing backslash for line continuation.

Line 48 is missing a backslash for proper line continuation in the bash script.

Apply this diff to fix the syntax error:
 CUDA_VISIBLE_DEVICES=3 python3 -m dynamo.vllm \
     --model $MODEL \
     --block-size $BLOCK_SIZE \
-    --enforce-eager \
+    --enforce-eager \
     --is-prefill-worker
Likely an incorrect or invalid review comment.
lib/llm/src/kv_router.rs (3)

616-621: Graceful shutdown via cancellation token — LGTM.

Matches existing patterns elsewhere and prevents background task leaks.

354-363: Propagating context_id as Option<String> to scheduler — LGTM.

Keeps state updates conditional and aligns with new SchedulingRequest semantics.

323-333: find_best_match signature change — original concern is incorrect

Internal call sites were updated to the new signature (lib/llm/src/kv_router.rs — calls around lines 454 and 558). The public wrapper still exposes the old two-arg API and delegates with None (lib/llm/src/kv_router.rs:490–497), so the Python binding call at lib/bindings/python/rust/llm/kv.rs:1049–1052 (.find_best_match(&token_ids, router_config_override.as_ref())) matches the public API and does not need changes.

Likely an incorrect or invalid review comment.

lib/bindings/python/rust/llm/kv.rs (3)

915-921: Good preflight: prevent KV routing in static mode.

LGTM; avoids subtle runtime errors.

1011-1013: Reuse helper for generate path — LGTM.

Keeps binding thin and consistent.

1030-1057: API alignment verified — no action required.

lib/bindings/python/src/dynamo/_core.pyi defines async def best_worker_id(self, token_ids: List[int], router_config_override: Optional[JsonLike] = None) -> Tuple[int, int], and the pyo3 binding in lib/bindings/python/rust/llm/kv.rs has #[pyo3(signature = (token_ids, router_config_override=None))]; the caller components/backends/vllm/src/dynamo/vllm_prefill_router/main.py invokes best_worker_id(token_ids=token_ids) (no override) which is compatible.

lib/llm/src/kv_router/scheduler.rs (2)

251-257: Fail fast when update_states=true and request_id is missing

Silently skipping state updates biases scheduling — return an error to callers (fail fast) or at minimum log at warn and include the update_states flag in the message. Would you prefer returning Err(KvSchedulerError::AllWorkersBusy) or adding a new KvSchedulerError variant to surface this to callers?

58-71: Resolved — callers verified; no changes required

Only schedule() is called at lib/llm/src/kv_router.rs:355. generate() calls find_best_match with Some(context_id) and update_states=true; KvPushRouter calls with update_states=false. Scheduler's requirement that update_states=true implies Some(request_id) is satisfied.

components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py (1)

52-62: Confirmed: KvRouterConfig kwargs match binding initializer

router_track_active_blocks and router_reset_states are exposed in the Python binding (pyo3 signature in lib/bindings/python/rust/llm/entrypoint.rs) and match the Rust KvRouterConfig fields, so the KvRouterConfig(...) call is valid.

components/backends/vllm/launch/agg_router.sh

components/backends/vllm/launch/disagg_router.sh

components/backends/vllm/src/dynamo/vllm/handlers.py

components/backends/vllm/src/dynamo/vllm/main.py

Signed-off-by: PeaBrane <yanrpei@gmail.com>

components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py

components/backends/vllm/launch/agg_router.sh

components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py

components/backends/vllm/src/dynamo/vllm/handlers.py

Signed-off-by: PeaBrane <yanrpei@gmail.com>

Signed-off-by: PeaBrane <yanrpei@gmail.com> Signed-off-by: Jason Zhou <jasonzho@nvidia.com>

Signed-off-by: PeaBrane <yanrpei@gmail.com> Signed-off-by: Kyle H <kylhuang@nvidia.com>

PeaBrane added 12 commits September 19, 2025 18:18

handle graceful drop of KvRouter in rust land, not at python binding

fec3b30

Signed-off-by: PeaBrane <yanrpei@gmail.com>

prefill router hooked up to vllm prefill workers, not tested

9c83560

Signed-off-by: PeaBrane <yanrpei@gmail.com>

update launch scripts

c276004

Signed-off-by: PeaBrane <yanrpei@gmail.com>

get KvPushRouter into dynamo.llm

de715bb

Signed-off-by: PeaBrane <yanrpei@gmail.com>

debug instead of info notice of best_worker_id endpoint serve

5f227e2

Signed-off-by: PeaBrane <yanrpei@gmail.com>

need .data() for Annotated object

682228a

Signed-off-by: PeaBrane <yanrpei@gmail.com>

prefill router block_size config should be debug as well

ff43078

Signed-off-by: PeaBrane <yanrpei@gmail.com>

change ping default port to 8000

2c8cd42

Signed-off-by: PeaBrane <yanrpei@gmail.com>

no --endpoint cli arg for vllm

d126117

Signed-off-by: PeaBrane <yanrpei@gmail.com>

gpu offset and prefill workers in launch script

d8035b1

Signed-off-by: PeaBrane <yanrpei@gmail.com>

default ports 8000 for load gen

9c368dc

Signed-off-by: PeaBrane <yanrpei@gmail.com>

only prefill-route if there is at least one prefill router

82f6694

Signed-off-by: PeaBrane <yanrpei@gmail.com>

PeaBrane requested review from a team as code owners September 20, 2025 23:31

pull-request-size bot added the size/XL label Sep 20, 2025

github-actions bot added the feat label Sep 20, 2025

coderabbitai bot reviewed Sep 20, 2025

View reviewed changes

PeaBrane added 2 commits September 21, 2025 12:42

fmt

6ad8f55

Signed-off-by: PeaBrane <yanrpei@gmail.com>

disable kv events for prefill router

894057c

Signed-off-by: PeaBrane <yanrpei@gmail.com>

copy-pr-bot bot temporarily deployed to GITLAB September 21, 2025 19:44 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 21, 2025 19:47 Inactive

rid max_num_batched_tokens from router config

89944ad

Signed-off-by: PeaBrane <yanrpei@gmail.com>

copy-pr-bot bot temporarily deployed to GITLAB September 21, 2025 19:53 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 21, 2025 19:54 Inactive

bindings for KvRouter

00933d5

Signed-off-by: PeaBrane <yanrpei@gmail.com>

copy-pr-bot bot temporarily deployed to GITLAB September 21, 2025 20:50 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 21, 2025 20:51 Inactive

update states and free in python land (vllm disagg)

dc36704

Signed-off-by: PeaBrane <yanrpei@gmail.com>

copy-pr-bot bot temporarily deployed to GITLAB September 21, 2025 21:28 Inactive

don't publisher prefill kv events for now

0b2553e

Signed-off-by: PeaBrane <yanrpei@gmail.com>

copy-pr-bot bot temporarily deployed to GITLAB September 21, 2025 21:31 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 21, 2025 21:32 Inactive

if I get one more isort failure I'm crashing out

41cff0a

Signed-off-by: PeaBrane <yanrpei@gmail.com>

copy-pr-bot bot temporarily deployed to GITLAB September 21, 2025 21:33 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 21, 2025 21:37 Inactive

PeaBrane mentioned this pull request Sep 21, 2025

[FEATURE]: A field in PreprocessedRequest to include remote-prefill-specific args #3159

Closed

Merge branch 'main' into rupei/prefill-router

745bcaa

copy-pr-bot bot temporarily deployed to GITLAB September 21, 2025 22:58 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 21, 2025 22:59 Inactive

Merge branch 'main' into rupei/prefill-router

16528f7

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 21:04 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 21:05 Inactive

update benchmarking docs to include remote prefill

f974fa3

Signed-off-by: PeaBrane <yanrpei@gmail.com>

PeaBrane self-assigned this Sep 22, 2025

tedzhouhk reviewed Sep 22, 2025

View reviewed changes

revert use_kv_events to True

988efaa

Signed-off-by: PeaBrane <yanrpei@gmail.com>

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 21:48 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 21:49 Inactive

added todo on having decode worker not handling prefill routing

225bed6

Signed-off-by: PeaBrane <yanrpei@gmail.com>

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 21:52 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 21:54 Inactive

tedzhouhk approved these changes Sep 22, 2025

View reviewed changes

PeaBrane merged commit 031590f into main Sep 22, 2025
18 checks passed

PeaBrane deleted the rupei/prefill-router branch September 22, 2025 23:38

jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025

feat: vllm prefill router (#3155)

1ce4148

Signed-off-by: PeaBrane <yanrpei@gmail.com> Signed-off-by: Jason Zhou <jasonzho@nvidia.com>

jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025

feat: vllm prefill router (#3155)

2168d97

Signed-off-by: PeaBrane <yanrpei@gmail.com> Signed-off-by: Jason Zhou <jasonzho@nvidia.com>

kylehh pushed a commit that referenced this pull request Sep 25, 2025

feat: vllm prefill router (#3155)

f9dbe80

Signed-off-by: PeaBrane <yanrpei@gmail.com> Signed-off-by: Kyle H <kylhuang@nvidia.com>

coderabbitai bot mentioned this pull request Oct 3, 2025

feat: use KvPushRouter for prefill router #3401

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: vllm prefill router #3155

feat: vllm prefill router #3155

Uh oh!

PeaBrane commented Sep 20, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 20, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: vllm prefill router #3155

feat: vllm prefill router #3155

Uh oh!

Conversation

PeaBrane commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Core changes

Benchmarking

Uh oh!

coderabbitai bot commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PeaBrane commented Sep 20, 2025 •

edited

Loading

coderabbitai bot commented Sep 20, 2025 •

edited

Loading