Skip to content

Conversation

@PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented Sep 19, 2025

Core changes:

Allow sequence.rs to handle a None variant of token_sequence, in which case active blocks will not be tracked, and only active prefill tokens will be. This is in preparation for routing to prefill-only engines (in disagg). This can be enabled by the new router_track_active_blocks flag in KvRouterConfig, which will obviously be propagated both downstream and upstream to the Python-binding level. However, it is currently not exposed at the top Python CLI level, as we do not want the frontend to directly address the prefill engines (as of now). In addition, prefill_counter.rs is nuked as it is basically a subset duplication of sequence.rs, and we don't want to maintain both.

Not tracking cached blocks is easier. We simply do not use (or create) the KvIndexer at all if overlap_score_weight = 0, which can be specified when we launch the frontend (targeting the decode workers). A caveat here is that the slot manager will still try to bookkeep the active prefill tokens (uncached), but this is a very small overhead and not worth adding more logic to disable.

The reviewer should only need to focus on everything before line 420 in sequence.rs, and the rest is really not super important.

Non-core/unrelated/peripheral changes:

  1. Previously, the tests in sequence.rs weren't actually configured properly to test the router replica sync mechanism. This is now fixed. In addition, added a new test to test handling the None variant of token_sequence
  2. Tied the uuid of the slot manager to the consumer UUID of the router (this is technically not needed, but will tame the proliferation of random UUIDs everywhere)
  3. Explicitly disallow (panic on) double add in the slot manager, as it cannot be gracefully handled yet.
  4. A small optimization to prevent unnecessary clones of token_sequence for the KvScheduler and the ApproxKvIndexer; only clone if actually needed.
  5. Turned the subscriber loop in the slot manager (for replica syncing) into tokio::select! so it can be shut down on the cancellation token

Signed-off-by: PeaBrane <yanrpei@gmail.com>
@PeaBrane PeaBrane requested a review from a team as a code owner September 19, 2025 04:30
@github-actions github-actions bot added the feat label Sep 19, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 19, 2025

Walkthrough

Adds a per-router configuration flag to optionally track active blocks, threads it through CLI and Python bindings, and updates KvRouter, Scheduler, Sequence, and Protocols to carry optional token sequences. Removes the prefill_counter module. KvScheduler::start now takes a router UUID, and several public signatures are expanded or modified accordingly.

Changes

Cohort / File(s) Summary
Router config and construction
lib/llm/src/kv_router.rs
Adds KvRouterConfig.router_track_active_blocks (default true) and accepts it in KvRouterConfig::new; KvRouter now stores KvRouterConfig; integrates router UUID into KvScheduler::start; conditionally computes/passes sequence hashes based on router_track_active_blocks; removes pub mod prefill_counter.
Launch flags
launch/dynamo-run/src/flags.rs
Adds router_track_active_blocks: Option<bool> to Flags and passes it to KvRouterConfig::new at the new parameter position.
Python binding
lib/bindings/python/rust/llm/entrypoint.rs
Extends PyO3 KvRouterConfig constructor to include router_track_active_blocks (default true) and wires it to the inner Rust config.
Scheduler and events
lib/llm/src/kv_router/scheduler.rs, lib/llm/src/kv_router/protocols.rs
Makes token sequences optional across SchedulingRequest and ActiveSequenceEventData::AddRequest; updates respond signature to &mut self; KvScheduler::start adds router_uuid: String; propagates Option<Vec> through scheduling APIs.
Sequences and workers
lib/llm/src/kv_router/sequence.rs
Allows optional token sequences in ActiveSequences; updates add_request and potential_blocks_and_tokens to accept Option; adds router UUID to ActiveSequencesMultiWorker::new; refines free/remove semantics; updates event subscription with cancellation handling.
Removal
lib/llm/src/kv_router/prefill_counter.rs
Deletes the prefill token tracking subsystem and its tests; removes related public types and helpers.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Router as KvRouter
  participant Scheduler
  participant Seqs as ActiveSequences(MW)

  Client->>Router: add_request(input, isl, overlap, ...)
  alt track_active_blocks = true
    Router->>Router: compute seq_hashes
    Router->>Scheduler: add_request(token_seq=Some(hashes), isl, overlap)
  else track_active_blocks = false
    Router->>Scheduler: add_request(token_seq=None, isl, overlap)
  end
  Scheduler->>Seqs: UpdateSequences::AddRequest(token_sequence=Option, isl, overlap)
  Seqs-->>Scheduler: affected_request_ids
  Scheduler-->>Router: scheduling_response
  Router-->>Client: response
Loading
sequenceDiagram
  autonumber
  participant Router as KvRouter
  participant Scheduler
  participant Seqs as ActiveSequences(MW)

  Router->>Router: get_potential_loads(request)
  alt track_active_blocks = true
    Router->>Router: compute seq_hashes
    Router->>Scheduler: get_potential_loads(token_seq=Some(hashes), isl, overlap)
  else
    Router->>Scheduler: get_potential_loads(token_seq=None, isl, overlap)
  end
  Scheduler->>Seqs: PotentialBlocksAndTokens(token_sequence=Option, isl, overlap)
  Seqs-->>Scheduler: (blocks_by_worker, tokens_by_worker)
  Scheduler-->>Router: loads
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

A toggle flips, the blocks may hide,
I thump my paw and glide with pride.
Tokens whisper, “Optional today!”
Routers wink, “We’ll find the way.”
Prefill ghosts hop off the trail—
UUIDs mark every tail.
Carrots queued, the schedule’s hale. 🥕

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 82.61% which is sufficient. The required threshold is 80.00%.
Title Check ✅ Passed The title clearly highlights the primary change — adding an option to allow the router/scheduler to skip tracking active blocks (useful for prefill workers) — and is concise and actionable; however the added phrase "and to not track cached blocks (decode)" is not clearly evidenced in the provided changes and could be misleading without further detail.
Description Check ✅ Passed The PR description provides a clear technical explanation of the core change (making token_sequence optional to support prefill-only flows), lists implementation and test changes, and tells reviewers which file/lines to focus on; however it does not use the repository's required template headings exactly and it omits an explicit "Related Issues" section. The narrative is detailed and actionable for reviewers but deviates from the repo template structure. Because the substantive content is present the description is largely adequate, but it should be reformatted to match the required template.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
lib/bindings/python/rust/llm/entrypoint.rs (1)

45-45: Consider updating the parameter order in the PyO3 signature for consistency.

The new router_track_active_blocks parameter is placed between router_replica_sync and router_snapshot_threshold. Based on the parameter ordering convention in the Rust-side implementation, consider placing it after router_replica_sync consistently, but before router_snapshot_threshold seems appropriate.

lib/llm/src/kv_router.rs (2)

136-136: Consider removing the clippy allow directive.

The #[allow(clippy::too_many_arguments)] directive suppresses a legitimate warning about function complexity. Consider using a builder pattern or configuration struct to reduce the parameter count instead of suppressing the warning.

-#[allow(clippy::too_many_arguments)]
 pub fn new(
-    overlap_score_weight: Option<f64>,
-    temperature: Option<f64>,
-    use_kv_events: Option<bool>,
-    replica_sync: Option<bool>,
-    track_active_blocks: Option<bool>,
-    max_num_batched_tokens: Option<u32>,
-    router_snapshot_threshold: Option<Option<u32>>,
-    router_reset_states: Option<bool>,
+    config: KvRouterConfigBuilder,
 ) -> Self {
-    let default = Self::default();
-    Self {
-        overlap_score_weight: overlap_score_weight.unwrap_or(default.overlap_score_weight),
-        // ... rest of the fields
-    }
+    config.build()
 }

Consider implementing a builder pattern for KvRouterConfig to make the API more ergonomic and avoid the need for the clippy suppression.


399-402: Potential duplicate computation of block hashes.

In get_potential_loads, the block hashes are computed twice - once on line 396 and again on line 400 inside the conditional block.

Apply this diff to avoid duplicate computation:

 pub async fn get_potential_loads(&self, tokens: &[u32]) -> Result<Vec<PotentialLoad>> {
     let isl_tokens = tokens.len();
     let block_hashes = compute_block_hash_for_seq(tokens, self.block_size);
     let overlap_scores = self.indexer.find_matches(block_hashes).await?;

     let maybe_seq_hashes = self.kv_router_config.router_track_active_blocks.then(|| {
-        let block_hashes = compute_block_hash_for_seq(tokens, self.block_size);
-        compute_seq_hash_for_block(&block_hashes)
+        compute_seq_hash_for_block(&block_hashes)
     });

     Ok(self
         .scheduler
         .get_potential_loads(maybe_seq_hashes, isl_tokens, overlap_scores)
         .await)
 }
lib/llm/src/kv_router/sequence.rs (2)

198-202: Consider differentiating between prefill-only and missing requests.

The trace log at Line 200 treats both cases (prefill-only requests and genuinely missing requests) the same way. Consider adding more context to help with debugging.

 let Some(token_seq) = self.active_seqs.get(request_id) else {
-    // this may happen if active_seqs were never added with token_seq (prefill-only)
-    tracing::trace!("Trying to free non-existent request {request_id}");
+    // this may happen if active_seqs were never added with token_seq (prefill-only)
+    tracing::trace!(
+        "Request {request_id} not found in active_seqs (likely prefill-only request)"
+    );
     return 0;
 };

409-409: Consider using deref() for clarity.

Using as_deref() would be more idiomatic than as_ref().map(|v| v.as_slice()).

-token_sequence.as_ref().map(|v| v.as_slice()),
+token_sequence.as_deref(),
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10bfb73 and 11b2b94.

📒 Files selected for processing (7)
  • launch/dynamo-run/src/flags.rs (2 hunks)
  • lib/bindings/python/rust/llm/entrypoint.rs (2 hunks)
  • lib/llm/src/kv_router.rs (11 hunks)
  • lib/llm/src/kv_router/prefill_counter.rs (0 hunks)
  • lib/llm/src/kv_router/protocols.rs (1 hunks)
  • lib/llm/src/kv_router/scheduler.rs (6 hunks)
  • lib/llm/src/kv_router/sequence.rs (15 hunks)
💤 Files with no reviewable changes (1)
  • lib/llm/src/kv_router/prefill_counter.rs
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/llm/src/kv_router/subscriber.rs:36-44
Timestamp: 2025-08-29T10:03:48.330Z
Learning: PeaBrane prefers to keep PRs contained in scope and is willing to defer technical improvements to future PRs when the current implementation works for the immediate use case. They acknowledge technical debt but prioritize deliverability over completeness in individual PRs.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3095
File: lib/llm/src/kv_router/subscriber.rs:200-223
Timestamp: 2025-09-17T20:55:41.392Z
Learning: In the dynamo codebase, PeaBrane prefers to maintain consistency with existing etcd key parsing patterns (like splitting on '/' and parsing the last segment) rather than introducing more robust parsing approaches, even when the current approach might be brittle, to keep the codebase aligned and avoid divergent patterns.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.313Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.
📚 Learning: 2025-09-03T19:31:32.621Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.

Applied to files:

  • lib/llm/src/kv_router/protocols.rs
  • lib/llm/src/kv_router/sequence.rs
📚 Learning: 2025-08-29T10:08:18.434Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.434Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.

Applied to files:

  • lib/bindings/python/rust/llm/entrypoint.rs
  • lib/llm/src/kv_router.rs
🧬 Code graph analysis (3)
lib/bindings/python/rust/llm/entrypoint.rs (1)
lib/llm/src/kv_router.rs (3)
  • new (137-161)
  • new (204-297)
  • new (460-465)
lib/llm/src/kv_router.rs (2)
lib/llm/src/kv_router/scheduler.rs (1)
  • new (455-459)
lib/llm/src/kv_router/indexer.rs (4)
  • default (243-245)
  • default (625-627)
  • compute_block_hash_for_seq (125-137)
  • compute_seq_hash_for_block (152-171)
lib/llm/src/kv_router/sequence.rs (2)
lib/llm/src/kv_router.rs (4)
  • block_size (389-391)
  • new (137-161)
  • new (204-297)
  • new (460-465)
lib/runtime/src/utils/tasks/tracker.rs (1)
  • cancel_token (1972-1975)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (.)
🔇 Additional comments (20)
lib/llm/src/kv_router/protocols.rs (1)

165-169: LGTM!

The change from Vec<SequenceHash> to Option<Vec<SequenceHash>> for token_sequence is properly implemented and aligns with the PR objective to allow the KV scheduler to optionally skip tracking active blocks for prefill workers.

lib/bindings/python/rust/llm/entrypoint.rs (1)

51-51: LGTM! Python binding properly wired.

The new router_track_active_blocks: bool parameter is correctly added to the constructor signature and properly propagated to the inner RsKvRouterConfig.

launch/dynamo-run/src/flags.rs (2)

102-107: LGTM! CLI documentation is clear and accurate.

The documentation properly explains the purpose of the router_track_active_blocks flag and its trade-offs. The default value is correctly documented as true.


238-238: LGTM!

The router_track_active_blocks flag is correctly passed to KvRouterConfig::new as the 5th argument, maintaining proper parameter ordering.

lib/llm/src/kv_router/scheduler.rs (3)

59-59: LGTM! Proper migration to optional token sequences.

The change from Vec<SequenceHash> to Option<Vec<SequenceHash>> for the token_seq field aligns with the PR objective and is consistently applied throughout the scheduler.


217-218: LGTM! Proper handling of optional token sequences.

The scheduler correctly passes request.token_seq.clone() (which is now Option<Vec<SequenceHash>>) to potential_blocks_and_tokens.


72-83: No concurrent-respond risk — Mutex not needed

SchedulingRequest is moved into the scheduler's single background task (request_rx created at lib/llm/src/kv_router/scheduler.rs:192 and handled in the spawned task) and request.respond(response) is invoked there (lib/llm/src/kv_router/scheduler.rs:243), so resp_tx is owned by the moved request and cannot be accessed concurrently.

Likely an incorrect or invalid review comment.

lib/llm/src/kv_router.rs (4)

104-105: LGTM! Configuration field properly added.

The router_track_active_blocks field is correctly added with a clear documentation comment explaining its purpose.


316-327: Excellent optimization of sequence hash cloning!

The conditional cloning logic based on who needs the sequence hashes is a smart optimization that avoids unnecessary memory allocations. The use of pattern matching makes the intent very clear.


365-368: LGTM! Consistent application of conditional sequence tracking.

The conditional computation of sequence hashes based on router_track_active_blocks is properly implemented using then(), which is idiomatic Rust.


269-269: LGTM! Proper threading of router UUID to scheduler.

The consumer_uuid is correctly passed to KvScheduler::start for per-router identification.

lib/llm/src/kv_router/sequence.rs (9)

107-110: Change from panic to silent return is reasonable for prefill-only flows.

The removal logic now returns early when a block is missing instead of panicking. This aligns with the prefill-only mode where blocks aren't tracked, making the system more robust.


122-146: Optional token sequence handling correctly implements prefill-only mode.

The conditional block tracking based on token_sequence presence is well-implemented. When None, only prefill tokens are tracked without adding blocks or storing the sequence, which aligns with the PR objective to skip tracking active blocks for prefill workers.


298-305: Router UUID parsing with fallback is handled appropriately.

The UUID parsing includes proper error handling with a warning log and fallback to a new UUID when parsing fails, ensuring the system remains operational.


458-532: Event subscription loop with cancellation is well-structured.

The implementation properly uses tokio::select! to handle both events and cancellation, ensuring clean shutdown. The event handling logic correctly propagates the optional token sequences and maintains worker mapping.


498-503: Worker not found warning is appropriate.

The warning for missing workers provides good visibility into potential configuration issues without crashing the system.


779-826: Method signature update aligns with optional token sequences.

The potential_blocks_and_tokens method now accepts an optional token sequence, which correctly supports both standard and prefill-only modes.


861-998: Test updates demonstrate cross-instance synchronization.

The tests have been properly migrated to use tokio and test cross-instance scenarios with optional token sequences. The test coverage includes both standard flows with token sequences and prefill-only flows without them.


1000-1127: Prefill-only test validates the new functionality.

The test specifically validates the prefill-only mode where token sequences are not provided (None), properly testing the mark_prefill_completed flow and ensuring tokens are tracked without blocks.


169-176: Verified: potential_blocks_and_tokens intentionally ignores new blocks when token_sequence is None.

Function (lib/llm/src/kv_router/sequence.rs:163–176) computes potential_blocks = new_blocks(token_seq) + active_blocks when Some(token_seq), otherwise active_blocks; potential_tokens always includes new_tokens. Scheduler call sites (lib/llm/src/kv_router/scheduler.rs near lines 216 and 351) pass request.token_seq — if that is None callers will get active-only block counts.

Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
@PeaBrane PeaBrane changed the title feat: allow KV scheduler to not track active blocks (for prefill workers) feat: allow router to not track active blocks (prefill), and to not track cached blocks (decode) Sep 19, 2025
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
@PeaBrane
Copy link
Contributor Author

/ok to test 4c043f4

@PeaBrane PeaBrane merged commit 5b19a39 into main Sep 19, 2025
18 of 19 checks passed
@PeaBrane PeaBrane deleted the rupei/router-no-track-blocks branch September 19, 2025 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants