-
Notifications
You must be signed in to change notification settings - Fork 676
feat: allow router to not track active blocks (prefill), and to not track cached blocks (decode) #3135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: PeaBrane <yanrpei@gmail.com>
WalkthroughAdds a per-router configuration flag to optionally track active blocks, threads it through CLI and Python bindings, and updates KvRouter, Scheduler, Sequence, and Protocols to carry optional token sequences. Removes the prefill_counter module. KvScheduler::start now takes a router UUID, and several public signatures are expanded or modified accordingly. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant Router as KvRouter
participant Scheduler
participant Seqs as ActiveSequences(MW)
Client->>Router: add_request(input, isl, overlap, ...)
alt track_active_blocks = true
Router->>Router: compute seq_hashes
Router->>Scheduler: add_request(token_seq=Some(hashes), isl, overlap)
else track_active_blocks = false
Router->>Scheduler: add_request(token_seq=None, isl, overlap)
end
Scheduler->>Seqs: UpdateSequences::AddRequest(token_sequence=Option, isl, overlap)
Seqs-->>Scheduler: affected_request_ids
Scheduler-->>Router: scheduling_response
Router-->>Client: response
sequenceDiagram
autonumber
participant Router as KvRouter
participant Scheduler
participant Seqs as ActiveSequences(MW)
Router->>Router: get_potential_loads(request)
alt track_active_blocks = true
Router->>Router: compute seq_hashes
Router->>Scheduler: get_potential_loads(token_seq=Some(hashes), isl, overlap)
else
Router->>Scheduler: get_potential_loads(token_seq=None, isl, overlap)
end
Scheduler->>Seqs: PotentialBlocksAndTokens(token_sequence=Option, isl, overlap)
Seqs-->>Scheduler: (blocks_by_worker, tokens_by_worker)
Scheduler-->>Router: loads
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
lib/bindings/python/rust/llm/entrypoint.rs (1)
45-45: Consider updating the parameter order in the PyO3 signature for consistency.The new
router_track_active_blocksparameter is placed betweenrouter_replica_syncandrouter_snapshot_threshold. Based on the parameter ordering convention in the Rust-side implementation, consider placing it afterrouter_replica_syncconsistently, but beforerouter_snapshot_thresholdseems appropriate.lib/llm/src/kv_router.rs (2)
136-136: Consider removing the clippy allow directive.The
#[allow(clippy::too_many_arguments)]directive suppresses a legitimate warning about function complexity. Consider using a builder pattern or configuration struct to reduce the parameter count instead of suppressing the warning.-#[allow(clippy::too_many_arguments)] pub fn new( - overlap_score_weight: Option<f64>, - temperature: Option<f64>, - use_kv_events: Option<bool>, - replica_sync: Option<bool>, - track_active_blocks: Option<bool>, - max_num_batched_tokens: Option<u32>, - router_snapshot_threshold: Option<Option<u32>>, - router_reset_states: Option<bool>, + config: KvRouterConfigBuilder, ) -> Self { - let default = Self::default(); - Self { - overlap_score_weight: overlap_score_weight.unwrap_or(default.overlap_score_weight), - // ... rest of the fields - } + config.build() }Consider implementing a builder pattern for
KvRouterConfigto make the API more ergonomic and avoid the need for the clippy suppression.
399-402: Potential duplicate computation of block hashes.In
get_potential_loads, the block hashes are computed twice - once on line 396 and again on line 400 inside the conditional block.Apply this diff to avoid duplicate computation:
pub async fn get_potential_loads(&self, tokens: &[u32]) -> Result<Vec<PotentialLoad>> { let isl_tokens = tokens.len(); let block_hashes = compute_block_hash_for_seq(tokens, self.block_size); let overlap_scores = self.indexer.find_matches(block_hashes).await?; let maybe_seq_hashes = self.kv_router_config.router_track_active_blocks.then(|| { - let block_hashes = compute_block_hash_for_seq(tokens, self.block_size); - compute_seq_hash_for_block(&block_hashes) + compute_seq_hash_for_block(&block_hashes) }); Ok(self .scheduler .get_potential_loads(maybe_seq_hashes, isl_tokens, overlap_scores) .await) }lib/llm/src/kv_router/sequence.rs (2)
198-202: Consider differentiating between prefill-only and missing requests.The trace log at Line 200 treats both cases (prefill-only requests and genuinely missing requests) the same way. Consider adding more context to help with debugging.
let Some(token_seq) = self.active_seqs.get(request_id) else { - // this may happen if active_seqs were never added with token_seq (prefill-only) - tracing::trace!("Trying to free non-existent request {request_id}"); + // this may happen if active_seqs were never added with token_seq (prefill-only) + tracing::trace!( + "Request {request_id} not found in active_seqs (likely prefill-only request)" + ); return 0; };
409-409: Consider using deref() for clarity.Using
as_deref()would be more idiomatic thanas_ref().map(|v| v.as_slice()).-token_sequence.as_ref().map(|v| v.as_slice()), +token_sequence.as_deref(),
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
launch/dynamo-run/src/flags.rs(2 hunks)lib/bindings/python/rust/llm/entrypoint.rs(2 hunks)lib/llm/src/kv_router.rs(11 hunks)lib/llm/src/kv_router/prefill_counter.rs(0 hunks)lib/llm/src/kv_router/protocols.rs(1 hunks)lib/llm/src/kv_router/scheduler.rs(6 hunks)lib/llm/src/kv_router/sequence.rs(15 hunks)
💤 Files with no reviewable changes (1)
- lib/llm/src/kv_router/prefill_counter.rs
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/llm/src/kv_router/subscriber.rs:36-44
Timestamp: 2025-08-29T10:03:48.330Z
Learning: PeaBrane prefers to keep PRs contained in scope and is willing to defer technical improvements to future PRs when the current implementation works for the immediate use case. They acknowledge technical debt but prioritize deliverability over completeness in individual PRs.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3095
File: lib/llm/src/kv_router/subscriber.rs:200-223
Timestamp: 2025-09-17T20:55:41.392Z
Learning: In the dynamo codebase, PeaBrane prefers to maintain consistency with existing etcd key parsing patterns (like splitting on '/' and parsing the last segment) rather than introducing more robust parsing approaches, even when the current approach might be brittle, to keep the codebase aligned and avoid divergent patterns.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.313Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.
📚 Learning: 2025-09-03T19:31:32.621Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.
Applied to files:
lib/llm/src/kv_router/protocols.rslib/llm/src/kv_router/sequence.rs
📚 Learning: 2025-08-29T10:08:18.434Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.434Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.
Applied to files:
lib/bindings/python/rust/llm/entrypoint.rslib/llm/src/kv_router.rs
🧬 Code graph analysis (3)
lib/bindings/python/rust/llm/entrypoint.rs (1)
lib/llm/src/kv_router.rs (3)
new(137-161)new(204-297)new(460-465)
lib/llm/src/kv_router.rs (2)
lib/llm/src/kv_router/scheduler.rs (1)
new(455-459)lib/llm/src/kv_router/indexer.rs (4)
default(243-245)default(625-627)compute_block_hash_for_seq(125-137)compute_seq_hash_for_block(152-171)
lib/llm/src/kv_router/sequence.rs (2)
lib/llm/src/kv_router.rs (4)
block_size(389-391)new(137-161)new(204-297)new(460-465)lib/runtime/src/utils/tasks/tracker.rs (1)
cancel_token(1972-1975)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
🔇 Additional comments (20)
lib/llm/src/kv_router/protocols.rs (1)
165-169: LGTM!The change from
Vec<SequenceHash>toOption<Vec<SequenceHash>>fortoken_sequenceis properly implemented and aligns with the PR objective to allow the KV scheduler to optionally skip tracking active blocks for prefill workers.lib/bindings/python/rust/llm/entrypoint.rs (1)
51-51: LGTM! Python binding properly wired.The new
router_track_active_blocks: boolparameter is correctly added to the constructor signature and properly propagated to the innerRsKvRouterConfig.launch/dynamo-run/src/flags.rs (2)
102-107: LGTM! CLI documentation is clear and accurate.The documentation properly explains the purpose of the
router_track_active_blocksflag and its trade-offs. The default value is correctly documented astrue.
238-238: LGTM!The
router_track_active_blocksflag is correctly passed toKvRouterConfig::newas the 5th argument, maintaining proper parameter ordering.lib/llm/src/kv_router/scheduler.rs (3)
59-59: LGTM! Proper migration to optional token sequences.The change from
Vec<SequenceHash>toOption<Vec<SequenceHash>>for thetoken_seqfield aligns with the PR objective and is consistently applied throughout the scheduler.
217-218: LGTM! Proper handling of optional token sequences.The scheduler correctly passes
request.token_seq.clone()(which is nowOption<Vec<SequenceHash>>) topotential_blocks_and_tokens.
72-83: No concurrent-respond risk — Mutex not neededSchedulingRequest is moved into the scheduler's single background task (request_rx created at lib/llm/src/kv_router/scheduler.rs:192 and handled in the spawned task) and request.respond(response) is invoked there (lib/llm/src/kv_router/scheduler.rs:243), so resp_tx is owned by the moved request and cannot be accessed concurrently.
Likely an incorrect or invalid review comment.
lib/llm/src/kv_router.rs (4)
104-105: LGTM! Configuration field properly added.The
router_track_active_blocksfield is correctly added with a clear documentation comment explaining its purpose.
316-327: Excellent optimization of sequence hash cloning!The conditional cloning logic based on who needs the sequence hashes is a smart optimization that avoids unnecessary memory allocations. The use of pattern matching makes the intent very clear.
365-368: LGTM! Consistent application of conditional sequence tracking.The conditional computation of sequence hashes based on
router_track_active_blocksis properly implemented usingthen(), which is idiomatic Rust.
269-269: LGTM! Proper threading of router UUID to scheduler.The
consumer_uuidis correctly passed toKvScheduler::startfor per-router identification.lib/llm/src/kv_router/sequence.rs (9)
107-110: Change from panic to silent return is reasonable for prefill-only flows.The removal logic now returns early when a block is missing instead of panicking. This aligns with the prefill-only mode where blocks aren't tracked, making the system more robust.
122-146: Optional token sequence handling correctly implements prefill-only mode.The conditional block tracking based on
token_sequencepresence is well-implemented. WhenNone, only prefill tokens are tracked without adding blocks or storing the sequence, which aligns with the PR objective to skip tracking active blocks for prefill workers.
298-305: Router UUID parsing with fallback is handled appropriately.The UUID parsing includes proper error handling with a warning log and fallback to a new UUID when parsing fails, ensuring the system remains operational.
458-532: Event subscription loop with cancellation is well-structured.The implementation properly uses
tokio::select!to handle both events and cancellation, ensuring clean shutdown. The event handling logic correctly propagates the optional token sequences and maintains worker mapping.
498-503: Worker not found warning is appropriate.The warning for missing workers provides good visibility into potential configuration issues without crashing the system.
779-826: Method signature update aligns with optional token sequences.The
potential_blocks_and_tokensmethod now accepts an optional token sequence, which correctly supports both standard and prefill-only modes.
861-998: Test updates demonstrate cross-instance synchronization.The tests have been properly migrated to use tokio and test cross-instance scenarios with optional token sequences. The test coverage includes both standard flows with token sequences and prefill-only flows without them.
1000-1127: Prefill-only test validates the new functionality.The test specifically validates the prefill-only mode where token sequences are not provided (
None), properly testing the mark_prefill_completed flow and ensuring tokens are tracked without blocks.
169-176: Verified: potential_blocks_and_tokens intentionally ignores new blocks when token_sequence is None.Function (lib/llm/src/kv_router/sequence.rs:163–176) computes potential_blocks = new_blocks(token_seq) + active_blocks when Some(token_seq), otherwise active_blocks; potential_tokens always includes new_tokens. Scheduler call sites (lib/llm/src/kv_router/scheduler.rs near lines 216 and 351) pass request.token_seq — if that is None callers will get active-only block counts.
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
|
/ok to test 4c043f4 |
Core changes:
Allow
sequence.rsto handle a None variant oftoken_sequence, in which case active blocks will not be tracked, and only active prefill tokens will be. This is in preparation for routing to prefill-only engines (in disagg). This can be enabled by the newrouter_track_active_blocksflag inKvRouterConfig, which will obviously be propagated both downstream and upstream to the Python-binding level. However, it is currently not exposed at the top Python CLI level, as we do not want the frontend to directly address the prefill engines (as of now). In addition,prefill_counter.rsis nuked as it is basically a subset duplication ofsequence.rs, and we don't want to maintain both.Not tracking cached blocks is easier. We simply do not use (or create) the
KvIndexerat all ifoverlap_score_weight = 0, which can be specified when we launch the frontend (targeting the decode workers). A caveat here is that the slot manager will still try to bookkeep the active prefill tokens (uncached), but this is a very small overhead and not worth adding more logic to disable.The reviewer should only need to focus on everything before line 420 in
sequence.rs, and the rest is really not super important.Non-core/unrelated/peripheral changes:
sequence.rsweren't actually configured properly to test the router replica sync mechanism. This is now fixed. In addition, added a new test to test handling the None variant oftoken_sequenceuuidof the slot manager to the consumer UUID of the router (this is technically not needed, but will tame the proliferation of random UUIDs everywhere)token_sequencefor theKvSchedulerand theApproxKvIndexer; only clone if actually needed.tokio::select!so it can be shut down on the cancellation token