-
Notifications
You must be signed in to change notification settings - Fork 676
feat: don't modify kv scheduler states on query + more python binding #2798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
WalkthroughAdds an update_states flag across routing and scheduling paths, propagating into KvScheduler::schedule. Introduces query_instance_id annotation handling that bypasses routing and state updates, returning instance metadata early. Non-annotated flows continue to route as before, now optionally updating states based on the flag. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant KvRouter
participant Scheduler
participant Worker
rect rgba(230,240,255,0.5)
note over Client,KvRouter: PreprocessedRequest path
Client->>KvRouter: request (may have annotation)
alt has annotation "query_instance_id"
KvRouter->>Scheduler: find_best_match(update_states=false)
Scheduler-->>KvRouter: worker_instance_id, token_data
KvRouter-->>Client: early response (no routing)
else no annotation
KvRouter->>Scheduler: find_best_match(update_states=true)
Scheduler-->>KvRouter: best match (instance, overlap)
KvRouter->>Worker: route request (with estimated_prefix_hit_num_blocks)
Worker-->>KvRouter: stream/response
KvRouter-->>Client: forward response
end
end
sequenceDiagram
autonumber
participant KvRouter
participant Scheduler
participant StateStore as SchedulerState
KvRouter->>Scheduler: schedule(..., update_states)
Scheduler-->>KvRouter: SchedulingResponse
alt update_states == true
Scheduler->>StateStore: add_request(...)
StateStore-->>Scheduler: ok
else update_states == false
note over Scheduler: Skip state update
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/llm/src/kv_router/scheduler.rs (1)
31-34: Typo in user-facing error message."aviailable" → "available".
Apply:
- #[error("no endpoints aviailable to route work")] + #[error("no endpoints available to route work")]
🧹 Nitpick comments (3)
lib/llm/src/kv_router/scheduler.rs (1)
313-315: Usethread_rng()withgen_range/genfor randomnessWorkspace depends on rand v0.9.0, so
rng.random_rangeandrng.randomcompile, but for consistency with the current rand API, replace withthread_rng()+gen_range/gen:- let mut rng = rand::rng(); - let index = rng.random_range(0..min_keys.len()); + let mut rng = rand::thread_rng(); + let index = rand::Rng::gen_range(&mut rng, 0..min_keys.len());- let mut rng = rand::rng(); - let sample: f64 = rng.random(); + let mut rng = rand::thread_rng(); + let sample: f64 = rand::Rng::gen(&mut rng);lib/llm/src/kv_router.rs (2)
371-374: Annotation string: de-dupe and centralize to avoid typos.Use a single constant for "query_instance_id".
Example:
+const QUERY_INSTANCE_ID: &str = "query_instance_id"; ... - let query_instance_id = request.has_annotation("query_instance_id"); + let query_instance_id = request.has_annotation(QUERY_INSTANCE_ID);
399-407: Minimize sensitive token logging.Even at trace, logging raw token IDs can be sensitive and noisy.
Apply:
- tracing::trace!( - "Tokens requested in the response through the query_instance_id annotation: {:?}", - response_tokens - ); + tracing::trace!("query_instance_id: returning worker_instance_id and token_data (len={})", + request.token_ids.len());
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
lib/llm/src/kv_router.rs(5 hunks)lib/llm/src/kv_router/scheduler.rs(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/llm/src/kv_router/subscriber.rs:36-44
Timestamp: 2025-08-29T10:03:48.303Z
Learning: PeaBrane prefers to keep PRs contained in scope and is willing to defer technical improvements to future PRs when the current implementation works for the immediate use case. They acknowledge technical debt but prioritize deliverability over completeness in individual PRs.
📚 Learning: 2025-08-29T10:08:18.414Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.414Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.
Applied to files:
lib/llm/src/kv_router.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build and Test - vllm
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (8)
lib/llm/src/kv_router/scheduler.rs (4)
58-60: Good: explicit gate for state updates.Adding
update_stateson the request clearly separates "query-only" from "mutating" flows.
65-75: Correct fix: make respond idempotent.Switching to
&mut selfwithresp_tx.take()prevents double-send and preserves ownership.
209-220: Meets PR objective: skip state mutations on query.Conditionally calling
add_requestonly whenupdate_statesis true avoids corrupting slot accounting.If helpful, I can add a unit/integration test that asserts
add_requestis not called whenupdate_states=falseusing a test double forActiveSequencesMultiWorker.
248-256: Signature change verified—onlyschedule()call site inlib/llm/src/kv_router.rswas updated and correctly passesupdate_states.lib/llm/src/kv_router.rs (4)
271-277: API: explicitupdate_statesparameter is appropriate.Threading
update_statesintofind_best_matchaligns the router with scheduler behavior.
381-386: Correct: do not update states for query-only requests.
!query_instance_idcleanly flips the flag based on the annotation.
297-303: Confirm intent: ApproxKvIndexer still updates on query.
process_routing_decision(...)runs regardless ofupdate_states. If the PR’s “don’t modify states on query” only targets slot accounting, this is fine. If it also aims to avoid indexer feedback on queries, guard this withupdate_states.Do you want me to gate this call behind
if update_states { ... }?
333-336: Use update_states=false in deprecatedgeneratepath
Passingtruetofind_best_matchhere allocates slot state that is never freed. Change tofalseto prevent leaking state in this deprecated AsyncEngine implementation.
Location: lib/llm/src/kv_router.rs:333
Please confirm no downstream consumers rely on state mutation in this path.⛔ Skipped due to learnings
Learnt from: PeaBrane PR: ai-dynamo/dynamo#2756 File: lib/bindings/python/rust/llm/kv.rs:401-436 Timestamp: 2025-08-29T10:08:18.414Z Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
894e1d4 to
18075dd
Compare
Signed-off-by: PeaBrane <yanrpei@gmail.com>
…#2798) Signed-off-by: PeaBrane <yanrpei@gmail.com>
…#2798) Signed-off-by: PeaBrane <yanrpei@gmail.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
Overview:
As titled, if the request is just to query the best worker id, do not add the request tokens to the slot manager/tracker, otherwise it will throw the accounting off.
Also exposed (recovered) more functionalities to Python land:
best_worker_idwithout routingworker_idSummary by CodeRabbit