Skip to content

Conversation

@PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented Sep 20, 2025

Overview:

A vllm prefill router can be launched with python -m dynamo.vllm_prefill_router. Eventually, we need to make this engine agnostic. Not dealing with this for now, and keeping the PR somewhat small. Closes #1895

Core changes

We first exposed via Python bindings KvRouter directly, particularly the find_best_match and free methods. These should not be used by typical users, so we are not adding docs for them.

Added the vllm_prefill_router service, with the flow being:

  • Call the find_best_match endpoint to get the best prefill worker id via the router client
  • Send the prefill request to said worker id via the worker client
  • Call the free endpoint to free the request tracking from the prefill router

Note, this is actually not the ideal flow, and it is cleaner to use the KvPushRouter directly, so the 3 steps are handled underneath the hood automatically. However, currently, this would require intrusive changes to PreprocessedRequest to support remote-prefill-specific extra args, so I am delaying doing this to a future PR where we would have a unified field for these args, and have all 3 frameworks using it consistently. (See #3159 )

Benchmarking

Done with 4P4D, on A100, 8b model, concurrency of 20

plots

Done with 5P3D, on A100, 8b model, concurrency of 20

plots

Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
@PeaBrane PeaBrane requested review from a team as code owners September 20, 2025 23:31
@github-actions github-actions bot added the feat label Sep 20, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 20, 2025

Walkthrough

Default benchmark ports switched to 8000. Router launch scripts parameterized and expanded for prefill/decode workers. Added a Prefill Router service and integrated router-aware prefill selection in decode path. Introduced KV event publisher setup in vLLM main. Updated Rust KV router/scheduler APIs and Python bindings, including best_worker_id signature changes.

Changes

Cohort / File(s) Summary
Benchmarks: default port to 8000
benchmarks/router/ping.sh, benchmarks/router/prefix_ratio_benchmark.py, benchmarks/router/real_data_benchmark.py
Change default localhost port from 8080 to 8000; no request/payload changes.
Benchmark engine launcher: prefill support and GPU offset
benchmarks/router/run_engines.sh
Add --prefills and --base-gpu-offset options; validate GPU offset; compute GPU ranges with base offset; differentiate prefill/decode workers; refactor vLLM args assembly; expanded logs.
vLLM launch scripts: parameterization and prefill topology
components/backends/vllm/launch/agg_router.sh, components/backends/vllm/launch/disagg_router.sh
Introduce PYTHONHASHSEED, MODEL, BLOCK_SIZE; restructure commands; enable eager mode; add prefill router and prefill workers; unify args across workers.
Decode handler: router-aware prefill selection
components/backends/vllm/src/dynamo/vllm/handlers.py
Decode path optionally queries prefill router for worker selection; fallback to round-robin; background prefill availability check; constructor adds prefill_router_client parameter.
vLLM main: KV publisher setup and wiring
components/backends/vllm/src/dynamo/vllm/main.py
Add setup_kv_event_publisher helper; setup_vllm_engine now returns (engine_client, vllm_config, default_sampling_params); attach KV publisher and prefill_router_client to handlers.
Prefill Router service: new package entry
components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py, .../__init__.py
New Prefill Router runtime with best_worker_id endpoint; CLI parsing; initialization and serving; robust __version__ resolution in package init.
Python bindings for KV router
lib/bindings/python/rust/llm/kv.rs, lib/bindings/python/src/dynamo/_core.pyi, lib/bindings/python/src/dynamo/llm/__init__.py
Remove context_id from best_worker_id; add generate_from_request and stream helper; enforce primary lease; remove drop-time cancellation; export KvPushRouter in public API; update type stubs.
Rust KV router core and scheduler
lib/llm/src/kv_router.rs, lib/llm/src/kv_router/scheduler.rs
Add router cancellation token and Drop; change find_best_match to accept Option<&str> context; update callers; scheduler now carries optional request ID and validates presence.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant DecodeHandler
  participant PrefillRouter
  participant PrefillWorker
  participant RR as RoundRobin

  Note over DecodeHandler: New: optional router-aware prefill selection
  Client->>DecodeHandler: generate(request with prompt)
  alt Prefill router available and has instances
    DecodeHandler->>PrefillRouter: best_worker_id(token_ids)
    alt worker_id found
      PrefillRouter-->>DecodeHandler: worker_id, overlap_blocks
      DecodeHandler->>PrefillWorker: prefill.direct(request, worker_id)
      PrefillWorker-->>DecodeHandler: prefill result
    else no decision / error
      PrefillRouter--x DecodeHandler: fallback
      DecodeHandler->>RR: prefill via round_robin
      RR-->>DecodeHandler: prefill result
    end
  else Router unavailable
    DecodeHandler->>RR: prefill via round_robin
    RR-->>DecodeHandler: prefill result
  end
  DecodeHandler-->>Client: streamed generation
Loading
sequenceDiagram
  autonumber
  participant Main as vLLM Main
  participant Engine as vLLM Engine Setup
  participant Handler as Worker Handler
  participant KV as KV Event Publisher

  Note over Main: New: setup_kv_event_publisher()
  Main->>Engine: setup_vllm_engine()
  Engine-->>Main: engine_client, vllm_config, default_sampling_params
  Main->>Main: setup_kv_event_publisher(config, component, generate_endpoint, vllm_config)
  alt KV enabled
    Main-->>KV: create ZmqKvEventPublisher
    Main->>Handler: attach kv_publisher
  else KV disabled
    Main-->>Handler: no publisher attached
  end
  Main-->>Handler: start serving
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

I twitch my ears at ports anew, 8000’s where we rendezvous.
I hop between decode and prefill lanes,
A router whispers worker names.
KV bells softly chime in streams,
Rust and Python stitch the seams—
Benchmarks thump: my happiest dreams. 🐇✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title "feat: vllm prefill router" succinctly and accurately summarizes the primary change in this PR — addition of a vLLM prefill router and related integration across handlers, bindings, and launch scripts as shown in the changeset. It is concise, specific, and follows conventional commit style, so a teammate scanning history will understand the primary intent.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed The PR description includes a clear Overview, substantive "Core changes" narrative, benchmarking results, and a "Closes #1895" reference, but it does not strictly follow the repository template: the required "Details" section is missing (replaced by "Core changes") and the "Where should the reviewer start?" heading is absent. Overall the content gives reviewers enough context to understand intent and impact, so the description is mostly complete.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/llm/src/kv_router/scheduler.rs (1)

389-458: Fix RNG API usage and softmax normalization (compile/runtime bugs).

rand = "0.9.0" found at lib/async-openai/Cargo.toml:41 — switch to thread_rng()/gen_range()/gen() and use proper min-max normalization (v - min) / (max - min) before negation.

File: lib/llm/src/kv_router/scheduler.rs Lines: 389-458

-        let mut rng = rand::rng();
-        let index = rng.random_range(0..min_keys.len());
+        let mut rng = rand::thread_rng();
+        let index = rng.gen_range(0..min_keys.len());
@@
-    let mut rng = rand::rng();
-    let sample: f64 = rng.random();
+    let mut rng = rand::thread_rng();
+    let sample: f64 = rng.gen();
@@
-        let normalized: Vec<_> = values
-            .iter()
-            .map(|&v| {
-                // Lower is better, so negate
-                // Note we don't need to do actual min-max norm here, just off by an offset
-                let norm = v / (max_val - min_val);
-                -norm
-            })
-            .collect();
+        let normalized: Vec<_> = values
+            .iter()
+            .map(|&v| {
+                // Lower is better, so negate after proper min-max normalization
+                let norm = (v - min_val) / (max_val - min_val);
+                -norm
+            })
+            .collect();
🧹 Nitpick comments (12)
components/backends/vllm/src/dynamo/vllm_prefill_router/__init__.py (1)

4-12: Consider more specific exception handling for version resolution.

While the fallback version resolution is resilient, catching all exceptions can mask unexpected errors during development.

Apply this diff to use more specific exception types:

 try:
     from ._version import __version__
-except Exception:
+except ImportError:
     try:
         from importlib.metadata import version as _pkg_version
 
         __version__ = _pkg_version("ai-dynamo")
-    except Exception:
+    except (ImportError, ModuleNotFoundError, KeyError):
         __version__ = "0.0.0+unknown"
components/backends/vllm/launch/agg_router.sh (1)

10-12: Consider parameterizing model and block size configuration.

While centralizing these values improves maintainability, consider making them configurable via environment variables or command-line arguments for flexibility across different deployments.

Apply this diff to make the configuration more flexible:

 # Common configuration
-MODEL="Qwen/Qwen3-0.6B"
-BLOCK_SIZE=64
+MODEL="${MODEL:-Qwen/Qwen3-0.6B}"
+BLOCK_SIZE="${BLOCK_SIZE:-64}"
components/backends/vllm/src/dynamo/vllm/handlers.py (2)

13-13: Fix import sorting issue flagged by pipeline.

The pipeline indicates that imports need to be sorted according to isort rules.

Run isort to fix the import order:

isort components/backends/vllm/src/dynamo/vllm/handlers.py

108-123: Add error recovery mechanism for prefill availability check.

The background task logs errors but doesn't implement any recovery strategy. Consider adding exponential backoff or circuit breaker pattern for resilience.

Apply this diff to add basic exponential backoff:

     async def _prefill_check_loop(self):
         """Background task that checks prefill worker availability every 5 seconds."""
+        backoff = 5  # Initial backoff in seconds
+        max_backoff = 60  # Maximum backoff in seconds
         while True:
             try:
                 if self.prefill_worker_client is not None:
                     self.can_prefill = len(self.prefill_worker_client.instance_ids())
                     logger.debug(f"Current Prefill Workers: {self.can_prefill}")
                 else:
                     self.can_prefill = 0
+                backoff = 5  # Reset backoff on success
             except asyncio.CancelledError:
                 logger.warning("Prefill check loop cancelled.")
                 raise
             except Exception as e:
                 logger.error(f"Error in prefill check loop: {e}")
+                backoff = min(backoff * 2, max_backoff)  # Exponential backoff
+                logger.debug(f"Backing off for {backoff} seconds")
 
-            await asyncio.sleep(5)
+            await asyncio.sleep(backoff)
components/backends/vllm/src/dynamo/vllm/main.py (1)

8-8: Fix import sorting issue flagged by pipeline.

The pipeline indicates that imports need to be sorted according to isort rules.

Run isort to fix the import order:

isort components/backends/vllm/src/dynamo/vllm/main.py
lib/llm/src/kv_router/scheduler.rs (2)

524-529: Reduce hot-path log level.

Per-request formula logs at INFO will spam logs. Use DEBUG or TRACE.

-            tracing::info!(
+            tracing::debug!(
                 "Formula for {worker_id} with {overlap} cached blocks: {logit:.3} \
                  = {overlap_weight:.1} * prefill_blocks + decode_blocks \
                  = {overlap_weight:.1} * {potential_prefill_block:.3} + {decode_block:.3}"
             );

42-43: Typo in error string.

-    #[error("no endpoints aviailable to route work")]
+    #[error("no endpoints available to route work")]
components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py (4)

68-71: Log full traceback on init failure.

Use logger.exception(...) to capture stack trace.

-        except Exception as e:
-            logger.error(f"Failed to initialize KvPushRouter: {e}")
+        except Exception:
+            logger.exception("Failed to initialize KvPushRouter")
             raise

72-87: Unused context arg; rename to _context to satisfy linters.

-    async def best_worker_id(self, request, context):
+    async def best_worker_id(self, request, _context):

127-133: Catch broad exceptions with traceback.

Switch to logger.exception to preserve details. Consider narrowing exception types later.

-        except Exception as e:
-            logger.error(f"Error finding best worker: {e}")
+        except Exception:
+            logger.exception("Error finding best worker")
             yield {
                 "status": "error",
-                "message": str(e),
+                "message": "internal error",
             }

198-199: Also log traceback on serve failure.

-    except Exception as e:
-        logger.error(f"Failed to serve endpoint: {e}")
+    except Exception:
+        logger.exception("Failed to serve endpoint")
lib/bindings/python/rust/llm/kv.rs (1)

845-885: Nice: centralize request→stream conversion.

Reduces duplication and isolates Pythonization. Consider closing the sender explicitly after loop to unblock consumers sooner (minor).

             tokio::spawn(async move {
                 let mut stream = stream;
                 while let Some(response) = stream.next().await {
@@
                 }
+                // Explicitly drop sender to signal completion
+                drop(tx);
             });
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 007b9d6 and 82f6694.

📒 Files selected for processing (15)
  • benchmarks/router/ping.sh (1 hunks)
  • benchmarks/router/prefix_ratio_benchmark.py (1 hunks)
  • benchmarks/router/real_data_benchmark.py (1 hunks)
  • benchmarks/router/run_engines.sh (5 hunks)
  • components/backends/vllm/launch/agg_router.sh (1 hunks)
  • components/backends/vllm/launch/disagg_router.sh (1 hunks)
  • components/backends/vllm/src/dynamo/vllm/handlers.py (3 hunks)
  • components/backends/vllm/src/dynamo/vllm/main.py (6 hunks)
  • components/backends/vllm/src/dynamo/vllm_prefill_router/__init__.py (1 hunks)
  • components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py (1 hunks)
  • lib/bindings/python/rust/llm/kv.rs (5 hunks)
  • lib/bindings/python/src/dynamo/_core.pyi (0 hunks)
  • lib/bindings/python/src/dynamo/llm/__init__.py (1 hunks)
  • lib/llm/src/kv_router.rs (7 hunks)
  • lib/llm/src/kv_router/scheduler.rs (4 hunks)
💤 Files with no reviewable changes (1)
  • lib/bindings/python/src/dynamo/_core.pyi
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/llm/src/kv_router/subscriber.rs:36-44
Timestamp: 2025-08-29T10:03:48.330Z
Learning: PeaBrane prefers to keep PRs contained in scope and is willing to defer technical improvements to future PRs when the current implementation works for the immediate use case. They acknowledge technical debt but prioritize deliverability over completeness in individual PRs.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3095
File: lib/llm/src/kv_router/subscriber.rs:200-223
Timestamp: 2025-09-17T20:55:41.392Z
Learning: In the dynamo codebase, PeaBrane prefers to maintain consistency with existing etcd key parsing patterns (like splitting on '/' and parsing the last segment) rather than introducing more robust parsing approaches, even when the current approach might be brittle, to keep the codebase aligned and avoid divergent patterns.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.313Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.
📚 Learning: 2025-06-02T19:37:27.666Z
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

Applied to files:

  • lib/llm/src/kv_router.rs
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Applied to files:

  • lib/bindings/python/rust/llm/kv.rs
🧬 Code graph analysis (6)
lib/bindings/python/src/dynamo/llm/__init__.py (2)
lib/bindings/python/rust/lib.rs (1)
  • _core (69-130)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • KvPushRouter (1157-1256)
components/backends/vllm/src/dynamo/vllm/handlers.py (4)
lib/bindings/python/src/dynamo/runtime/logging.py (1)
  • configure_dynamo_logging (77-107)
lib/bindings/python/src/dynamo/_core.pyi (5)
  • instance_ids (256-263)
  • generate (1178-1210)
  • get (137-150)
  • direct (286-290)
  • round_robin (280-284)
lib/bindings/python/rust/lib.rs (5)
  • instance_ids (904-906)
  • generate (923-935)
  • get (608-624)
  • direct (1005-1043)
  • round_robin (939-968)
lib/bindings/python/rust/llm/kv.rs (2)
  • anext (1110-1122)
  • generate (946-1013)
components/backends/vllm/src/dynamo/vllm/main.py (4)
components/backends/vllm/src/dynamo/vllm/args.py (1)
  • Config (39-64)
examples/multimodal/utils/args.py (1)
  • Config (25-40)
lib/bindings/python/src/dynamo/_core.pyi (9)
  • component (191-195)
  • ZmqKvEventPublisher (811-825)
  • endpoint (210-214)
  • ZmqKvEventPublisherConfig (793-809)
  • lease_id (243-247)
  • block_size (625-629)
  • block_size (648-652)
  • namespace (38-42)
  • client (237-241)
examples/multimodal/components/worker.py (1)
  • setup_vllm_engine (117-174)
components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py (4)
lib/bindings/python/src/dynamo/_core.pyi (12)
  • KvPushRouter (1157-1256)
  • KvRouterConfig (867-869)
  • Client (249-290)
  • DistributedRuntime (31-62)
  • block_size (625-629)
  • block_size (648-652)
  • component (191-195)
  • endpoint (210-214)
  • client (237-241)
  • best_worker_id (1212-1229)
  • instance_ids (256-263)
  • get (137-150)
lib/bindings/python/src/dynamo/runtime/__init__.py (1)
  • dynamo_worker (36-62)
lib/bindings/python/src/dynamo/runtime/logging.py (1)
  • configure_dynamo_logging (77-107)
lib/bindings/python/rust/llm/kv.rs (3)
  • block_size (436-438)
  • block_size (497-499)
  • best_worker_id (1031-1058)
lib/llm/src/kv_router.rs (6)
lib/bindings/python/rust/llm/kv.rs (1)
  • drop (228-230)
lib/llm/src/kv_router/sequence.rs (1)
  • drop (854-864)
lib/llm/src/kv_router/approx.rs (1)
  • drop (404-406)
lib/llm/src/block_manager.rs (1)
  • drop (86-88)
lib/llm/src/kv_router/publisher.rs (1)
  • drop (175-177)
lib/llm/src/recorder.rs (1)
  • drop (377-379)
lib/bindings/python/rust/llm/kv.rs (4)
lib/bindings/python/src/dynamo/_core.pyi (3)
  • KvPushRouter (1157-1256)
  • new (117-135)
  • component (191-195)
components/backends/sglang/src/dynamo/sglang/protocol.py (1)
  • PreprocessedRequest (36-43)
lib/llm/src/kv_router.rs (3)
  • new (137-161)
  • new (224-321)
  • new (482-487)
lib/runtime/src/component.rs (2)
  • component (428-430)
  • component (575-581)
🪛 Ruff (0.13.1)
components/backends/vllm/src/dynamo/vllm_prefill_router/__init__.py

6-6: Do not catch blind exception: Exception

(BLE001)


11-11: Do not catch blind exception: Exception

(BLE001)

components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py

69-69: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


72-72: Unused method argument: context

(ARG002)


110-110: Abstract raise to an inner function

(TRY301)


110-110: Avoid specifying long messages outside the exception class

(TRY003)


127-127: Do not catch blind exception: Exception

(BLE001)


128-128: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


198-198: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3155/merge) by PeaBrane.
components/backends/vllm/src/dynamo/vllm/handlers.py

[error] 1-1: isort: files were modified by this hook. Fixing /home/runner/work/dynamo/dynamo/components/backends/vllm/src/dynamo/vllm/handlers.py

components/backends/vllm/src/dynamo/vllm/main.py

[error] 1-1: isort: files were modified by this hook. Fixing /home/runner/work/dynamo/dynamo/components/backends/vllm/src/dynamo/vllm/main.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
🔇 Additional comments (25)
benchmarks/router/real_data_benchmark.py (1)

121-121: Port change to align with benchmark defaults.

The default URL port has been updated from 8080 to 8000, which aligns consistently with the broader benchmark tooling changes across the router benchmarks.

benchmarks/router/prefix_ratio_benchmark.py (1)

312-312: Port change to align with benchmark defaults.

The default URL port has been updated from 8080 to 8000, which aligns consistently with the broader benchmark tooling changes across the router benchmarks.

benchmarks/router/ping.sh (1)

6-7: Port change to align with benchmark defaults.

The default port has been updated from 8080 to 8000, which aligns consistently with the broader benchmark tooling changes across the router benchmarks.

benchmarks/router/run_engines.sh (9)

11-12: New configuration variables for prefill router support.

The addition of USE_PREFILLS and BASE_GPU_OFFSET variables provides the foundation for configuring prefill workers and GPU allocation offsets.


34-41: CLI argument parsing for prefill router features.

The new command line options --prefills and --base-gpu-offset enable users to configure prefill worker mode and GPU allocation offset respectively.


84-87: Validation for BASE_GPU_OFFSET parameter.

The validation ensures BASE_GPU_OFFSET is a non-negative integer, preventing invalid GPU allocation configurations.


91-91: GPU range calculation with offset support.

The LAST_GPU calculation now accounts for the BASE_GPU_OFFSET, enabling proper GPU allocation when using GPU offsets.


94-94: Enhanced configuration display for prefill workers.

The configuration output now clearly indicates worker type (Prefill vs Decode) and shows the GPU range, improving operational visibility.

Also applies to: 99-99


114-115: Worker type labeling for improved logging.

The WORKER_TYPE variable enables consistent labeling throughout the script, making it easier to distinguish between prefill and decode workers in logs.


119-119: Consistent worker type labeling in status messages.

The status messages now clearly indicate whether workers are prefill or decode type, improving operational clarity.

Also applies to: 146-146, 161-161


122-122: GPU allocation with base offset support.

The START_GPU calculation now incorporates BASE_GPU_OFFSET, enabling proper GPU allocation when using GPU offsets for worker deployment.


148-157: vLLM argument construction with prefill worker support.

The refactored argument building approach uses an array to construct vLLM arguments, conditionally adding --is-prefill-worker when prefill mode is enabled. This provides a clean and maintainable way to handle the conditional argument passing.

components/backends/vllm/launch/agg_router.sh (1)

7-8: Good addition of deterministic hashing for KV event IDs.

Setting PYTHONHASHSEED=0 ensures consistent hash values across runs, which is important for KV event ID generation and debugging.

lib/bindings/python/src/dynamo/llm/__init__.py (1)

28-28: LGTM! KvPushRouter export aligns with updated bindings.

The addition of KvPushRouter to the public API is consistent with the broader prefill router changes in the PR.

components/backends/vllm/src/dynamo/vllm/main.py (1)

91-123: Well-structured KV event publisher setup.

The new setup_kv_event_publisher helper function properly encapsulates the KV publisher initialization logic with good error handling and configuration.

components/backends/vllm/launch/disagg_router.sh (1)

38-49: Syntax error: Missing backslash for line continuation.

Line 48 is missing a backslash for proper line continuation in the bash script.

Apply this diff to fix the syntax error:

 CUDA_VISIBLE_DEVICES=3 python3 -m dynamo.vllm \
     --model $MODEL \
     --block-size $BLOCK_SIZE \
-    --enforce-eager \
+    --enforce-eager \
     --is-prefill-worker

Likely an incorrect or invalid review comment.

lib/llm/src/kv_router.rs (3)

616-621: Graceful shutdown via cancellation token — LGTM.

Matches existing patterns elsewhere and prevents background task leaks.


354-363: Propagating context_id as Option<String> to scheduler — LGTM.

Keeps state updates conditional and aligns with new SchedulingRequest semantics.


323-333: find_best_match signature change — original concern is incorrect

Internal call sites were updated to the new signature (lib/llm/src/kv_router.rs — calls around lines 454 and 558). The public wrapper still exposes the old two-arg API and delegates with None (lib/llm/src/kv_router.rs:490–497), so the Python binding call at lib/bindings/python/rust/llm/kv.rs:1049–1052 (.find_best_match(&token_ids, router_config_override.as_ref())) matches the public API and does not need changes.

Likely an incorrect or invalid review comment.

lib/bindings/python/rust/llm/kv.rs (3)

915-921: Good preflight: prevent KV routing in static mode.

LGTM; avoids subtle runtime errors.


1011-1013: Reuse helper for generate path — LGTM.

Keeps binding thin and consistent.


1030-1057: API alignment verified — no action required.

lib/bindings/python/src/dynamo/_core.pyi defines async def best_worker_id(self, token_ids: List[int], router_config_override: Optional[JsonLike] = None) -> Tuple[int, int], and the pyo3 binding in lib/bindings/python/rust/llm/kv.rs has #[pyo3(signature = (token_ids, router_config_override=None))]; the caller components/backends/vllm/src/dynamo/vllm_prefill_router/main.py invokes best_worker_id(token_ids=token_ids) (no override) which is compatible.

lib/llm/src/kv_router/scheduler.rs (2)

251-257: Fail fast when update_states=true and request_id is missing

Silently skipping state updates biases scheduling — return an error to callers (fail fast) or at minimum log at warn and include the update_states flag in the message. Would you prefer returning Err(KvSchedulerError::AllWorkersBusy) or adding a new KvSchedulerError variant to surface this to callers?


58-71: Resolved — callers verified; no changes required

Only schedule() is called at lib/llm/src/kv_router.rs:355. generate() calls find_best_match with Some(context_id) and update_states=true; KvPushRouter calls with update_states=false. Scheduler's requirement that update_states=true implies Some(request_id) is satisfied.

components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py (1)

52-62: Confirmed: KvRouterConfig kwargs match binding initializer

router_track_active_blocks and router_reset_states are exposed in the Python binding (pyo3 signature in lib/bindings/python/rust/llm/entrypoint.rs) and match the Rust KvRouterConfig fields, so the KvRouterConfig(...) call is valid.

Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
@PeaBrane PeaBrane self-assigned this Sep 22, 2025
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
@PeaBrane PeaBrane merged commit 031590f into main Sep 22, 2025
18 checks passed
@PeaBrane PeaBrane deleted the rupei/prefill-router branch September 22, 2025 23:38
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
kylehh pushed a commit that referenced this pull request Sep 25, 2025
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: Kyle H <kylhuang@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants