Skip to content

Conversation

@PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented Aug 30, 2025

Overview:

As titled, if the request is just to query the best worker id, do not add the request tokens to the slot manager/tracker, otherwise it will throw the accounting off.

Also exposed (recovered) more functionalities to Python land:

  • query/probe best_worker_id without routing
  • direct a request to a worker by specifying the worker_id
  • get the potential engine loads (prefill tokens + active blocks) for the user to write their own routing decision

Summary by CodeRabbit

  • New Features
    • Added support for a query_instance_id request annotation to retrieve the target worker instance and token metadata without dispatching the request or altering routing state.
  • Refactor
    • Routing now conditionally updates internal state only when appropriate, preventing side effects from query-only requests.
  • Documentation
    • Clarified behavior: query-like requests can avoid scheduling side effects.
  • Notes
    • No changes to public APIs; existing integrations continue to work as before.

Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
@PeaBrane PeaBrane requested a review from a team as a code owner August 30, 2025 23:27
@PeaBrane PeaBrane requested a review from atchernych August 30, 2025 23:27
@github-actions github-actions bot added the feat label Aug 30, 2025
@PeaBrane PeaBrane requested a review from biswapanda August 30, 2025 23:27
@PeaBrane PeaBrane self-assigned this Aug 30, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 30, 2025

Walkthrough

Adds an update_states flag across routing and scheduling paths, propagating into KvScheduler::schedule. Introduces query_instance_id annotation handling that bypasses routing and state updates, returning instance metadata early. Non-annotated flows continue to route as before, now optionally updating states based on the flag.

Changes

Cohort / File(s) Summary
Router state-update gating and query_instance_id handling
lib/llm/src/kv_router.rs
- KvRouter::find_best_match gains update_states: bool and passes it to scheduler.schedule
- RouterRequest path calls find_best_match with update_states = true
- PreprocessedRequest and KvPushRouter detect has_annotation("query_instance_id"); if present, call find_best_match with update_states = false and return early with worker_instance_id and token_data; otherwise proceed with routing and set estimated_prefix_hit_num_blocks
Scheduler API and conditional state updates
lib/llm/src/kv_router/scheduler.rs
- SchedulingRequest adds pub update_states: bool
- KvScheduler::schedule signature adds update_states: bool and populates it
- State update via add_request is conditional on request.update_states
- SchedulingRequest::respond now takes &mut self and sends via taken resp_tx

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant KvRouter
  participant Scheduler
  participant Worker

  rect rgba(230,240,255,0.5)
  note over Client,KvRouter: PreprocessedRequest path
  Client->>KvRouter: request (may have annotation)
  alt has annotation "query_instance_id"
    KvRouter->>Scheduler: find_best_match(update_states=false)
    Scheduler-->>KvRouter: worker_instance_id, token_data
    KvRouter-->>Client: early response (no routing)
  else no annotation
    KvRouter->>Scheduler: find_best_match(update_states=true)
    Scheduler-->>KvRouter: best match (instance, overlap)
    KvRouter->>Worker: route request (with estimated_prefix_hit_num_blocks)
    Worker-->>KvRouter: stream/response
    KvRouter-->>Client: forward response
  end
  end
Loading
sequenceDiagram
  autonumber
  participant KvRouter
  participant Scheduler
  participant StateStore as SchedulerState

  KvRouter->>Scheduler: schedule(..., update_states)
  Scheduler-->>KvRouter: SchedulingResponse
  alt update_states == true
    Scheduler->>StateStore: add_request(...)
    StateStore-->>Scheduler: ok
  else update_states == false
    note over Scheduler: Skip state update
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

A thump of paws, a flag set true,
I sniff the routes for something new.
Query whispers, “shh—don’t change the slate,”
I hop back quick, avoid the state.
When paths are clear, I leap and send—
Carrots queued, requests ascend. 🥕✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/llm/src/kv_router/scheduler.rs (1)

31-34: Typo in user-facing error message.

"aviailable" → "available".

Apply:

-    #[error("no endpoints aviailable to route work")]
+    #[error("no endpoints available to route work")]
🧹 Nitpick comments (3)
lib/llm/src/kv_router/scheduler.rs (1)

313-315: Use thread_rng() with gen_range/gen for randomness

Workspace depends on rand v0.9.0, so rng.random_range and rng.random compile, but for consistency with the current rand API, replace with thread_rng() + gen_range/gen:

-        let mut rng = rand::rng();
-        let index = rng.random_range(0..min_keys.len());
+        let mut rng = rand::thread_rng();
+        let index = rand::Rng::gen_range(&mut rng, 0..min_keys.len());
-    let mut rng = rand::rng();
-    let sample: f64 = rng.random();
+    let mut rng = rand::thread_rng();
+    let sample: f64 = rand::Rng::gen(&mut rng);
lib/llm/src/kv_router.rs (2)

371-374: Annotation string: de-dupe and centralize to avoid typos.

Use a single constant for "query_instance_id".

Example:

+const QUERY_INSTANCE_ID: &str = "query_instance_id";
...
-                let query_instance_id = request.has_annotation("query_instance_id");
+                let query_instance_id = request.has_annotation(QUERY_INSTANCE_ID);

399-407: Minimize sensitive token logging.

Even at trace, logging raw token IDs can be sensitive and noisy.

Apply:

-                    tracing::trace!(
-                        "Tokens requested in the response through the query_instance_id annotation: {:?}",
-                        response_tokens
-                    );
+                    tracing::trace!("query_instance_id: returning worker_instance_id and token_data (len={})",
+                        request.token_ids.len());
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3f09c39 and 7c0468e.

📒 Files selected for processing (2)
  • lib/llm/src/kv_router.rs (5 hunks)
  • lib/llm/src/kv_router/scheduler.rs (4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/llm/src/kv_router/subscriber.rs:36-44
Timestamp: 2025-08-29T10:03:48.303Z
Learning: PeaBrane prefers to keep PRs contained in scope and is willing to defer technical improvements to future PRs when the current implementation works for the immediate use case. They acknowledge technical debt but prioritize deliverability over completeness in individual PRs.
📚 Learning: 2025-08-29T10:08:18.414Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.414Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.

Applied to files:

  • lib/llm/src/kv_router.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (8)
lib/llm/src/kv_router/scheduler.rs (4)

58-60: Good: explicit gate for state updates.

Adding update_states on the request clearly separates "query-only" from "mutating" flows.


65-75: Correct fix: make respond idempotent.

Switching to &mut self with resp_tx.take() prevents double-send and preserves ownership.


209-220: Meets PR objective: skip state mutations on query.

Conditionally calling add_request only when update_states is true avoids corrupting slot accounting.

If helpful, I can add a unit/integration test that asserts add_request is not called when update_states=false using a test double for ActiveSequencesMultiWorker.


248-256: Signature change verified—only schedule() call site in lib/llm/src/kv_router.rs was updated and correctly passes update_states.

lib/llm/src/kv_router.rs (4)

271-277: API: explicit update_states parameter is appropriate.

Threading update_states into find_best_match aligns the router with scheduler behavior.


381-386: Correct: do not update states for query-only requests.

!query_instance_id cleanly flips the flag based on the annotation.


297-303: Confirm intent: ApproxKvIndexer still updates on query.

process_routing_decision(...) runs regardless of update_states. If the PR’s “don’t modify states on query” only targets slot accounting, this is fine. If it also aims to avoid indexer feedback on queries, guard this with update_states.

Do you want me to gate this call behind if update_states { ... }?


333-336: Use update_states=false in deprecated generate path
Passing true to find_best_match here allocates slot state that is never freed. Change to false to prevent leaking state in this deprecated AsyncEngine implementation.
Location: lib/llm/src/kv_router.rs:333
Please confirm no downstream consumers rely on state mutation in this path.

⛔ Skipped due to learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.414Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.

Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
@PeaBrane PeaBrane changed the title feat: don't modify kv scheduler states on query feat: don't modify kv scheduler states on query + more python binding Sep 2, 2025
@PeaBrane PeaBrane disabled auto-merge September 2, 2025 23:03
@PeaBrane PeaBrane force-pushed the rupei/no-modify-states branch from 894e1d4 to 18075dd Compare September 2, 2025 23:07
Signed-off-by: PeaBrane <yanrpei@gmail.com>
@PeaBrane PeaBrane merged commit 383e3b3 into main Sep 3, 2025
16 of 18 checks passed
@PeaBrane PeaBrane deleted the rupei/no-modify-states branch September 3, 2025 00:09
dillon-cullinan pushed a commit that referenced this pull request Sep 5, 2025
nnshah1 pushed a commit that referenced this pull request Sep 8, 2025
…#2798)

Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants