Skip to content

Conversation

@PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented Aug 4, 2025

Where should the reviewer start

Core:

  • sequence.rs: the local accounting of decode blocks and prefill tokens are now broadcasted (to remote Router replicas) via NATs, so that all the Router replicas can be in sync

"Dead code" for now, but useful for future:

  • publisher.rs: a push-based load metrics publishing over NATs on changes of core metrics
  • prefill_counter.rs: a simplified version of sequence.rs (slot simulator) that only cares about prefill tokens, in case we wish to revert tracking of active blocks from load events

Minor:

  • scheduler.rs: simplified the background scheduling loop to remove the two nested loops, and to prepare for request queueing, migration, and throttling. Also, moved the state updates into the scheduling loop instead of separate methods.

All other changes are very trivial, mainly involving handling propagation of the replica_sync flag (defaulting to false)

Description

Slot tracking is synced between Router replicas

It is not desirable or easy to poll active prefill tokens and decode blocks from the engine, as they are very latency-sensitive metrics, and extracting this info from the backend engine is fairly intrusive (and annoying). However, it is fairly easy to track this locally within the Rotuer based on the request-response cycle. For example, we know how many new prefill tokens are required based on the ISL and the overlap_blocks from the KvIndexer. Furthermore, we can listen to the first output token as a signal that the prefill process for a request is completed. Therefore, now the hard part is to sync this between Router replicas.

The proposal is to have the Router send out an event on every local update for other Router replicas to listen to. The overhead should be fairly small, as there are only two events per request, and the event body is very small (containing only the request ID and the number of new tokens). The actual data structure mapping request_id to new_tokens is a dashmap, to allow for concurrent local direct updates and event-based updates informed by remote Routers.

Load metrics changed to push-based

Previously, load metrics were polled from the backend engines at a 100 ms interval. This was too slow. More recently, this was changed to being predicted/accounted for directly on the Router end, but this made the Router unable to replicate, as the active block states were not synced naturally by events.

Now, the proposal is to change the load metrics publishing to a push-based pattern. During WorkerMetricsPublisher.create_endpoint(), a background process would start to relay LoadEvent immediately over NATs when key load metrics (kv_active_blocks and num_requests_waiting) did not change over a 1 ms duration. This is a balance between having low latency, and not overloading the NATs server (e.g. sending an event on every engine forward pass).

Benchmarks

plots

Summary by CodeRabbit

  • New Features

    • Introduced a system for tracking and synchronizing prefill tokens across distributed workers, enhancing visibility into token usage and resource allocation.
    • Added event-driven metrics publishing for load metrics, enabling real-time monitoring via NATS messaging.
    • Expanded protocol support with new event types for prefill and active block state changes.
  • Refactor

    • Simplified and improved scheduling and resource tracking logic, reducing synchronization overhead and improving efficiency.
    • Replaced and streamlined internal data structures for active block and token tracking.
    • Updated and renamed several public structures and methods for clarity and consistency.
  • Bug Fixes

    • Corrected naming inconsistencies in metrics endpoint handlers.
  • Tests

    • Added comprehensive tests for distributed token tracking and metrics publishing behavior.

@PeaBrane PeaBrane changed the title feat: push-based load metrics + prefill tokens with sync feat: allow co-aware Router replicas = push-based load metrics + prefill tokens with sync Aug 4, 2025
@PeaBrane PeaBrane marked this pull request as ready for review August 4, 2025 04:34
@PeaBrane PeaBrane requested a review from a team as a code owner August 4, 2025 04:34
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 4, 2025

Walkthrough

This change refactors the KV router and scheduler system by removing sequence-based tracking and mutex locking, introducing a new event-driven prefill token and active block tracking mechanism using PrefillCounter and NATS events. Several new data structures and protocols are added, and metrics publishing is implemented for distributed coordination.

Changes

Cohort / File(s) Change Summary
Dependency Addition
lib/llm/Cargo.toml
Adds dashmap v6 as a new dependency.
Endpoint Import Refactor
components/metrics/src/lib.rs, lib/llm/src/kv_router/metrics_aggregator.rs
Changes Endpoint import to use the local scoring::Endpoint struct.
KV Router Core Refactor
lib/llm/src/kv_router.rs
Removes mutex and free logic, updates method signatures, integrates new prefill counter, and adjusts scheduling logic. Adds new public module, constants, and removes resource freeing.
Prefill Counter Implementation
lib/llm/src/kv_router/prefill_counter.rs
Introduces PrefillCounter and PrefillCountersMultiWorker for distributed prefill token tracking with NATS synchronization and tests.
Protocol Extensions
lib/llm/src/kv_router/protocols.rs
Adds PrefillEvent, ActiveBlockEvent, and related enums for structured event-driven coordination.
Metrics Publishing and Endpoint Handler Fixes
lib/llm/src/kv_router/publisher.rs
Adds background task for NATS metrics publishing, corrects endpoint handler naming, and adds async test for publishing behavior.
Scheduler Refactor
lib/llm/src/kv_router/scheduler.rs
Replaces sequence tracking with prefill/active block event-driven tracking, updates scheduling logic, and simplifies background tasks.
Scoring Structs Introduction
lib/llm/src/kv_router/scoring.rs
Defines new public LoadEvent and Endpoint structs, implements worker ID extraction, and replaces previous imports.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant KvRouter
    participant KvScheduler
    participant PrefillCounter
    participant NATS
    participant Worker

    Client->>KvRouter: Request scheduling (context_id, tokens, overlaps)
    KvRouter->>KvScheduler: schedule(context_id, tokens, overlaps)
    KvScheduler->>PrefillCounter: Query prefill tokens
    KvScheduler->>NATS: Listen for LoadEvent (active blocks)
    KvScheduler->>KvRouter: Return selected worker_id
    KvRouter->>Worker: Route request to selected worker

    Worker->>PrefillCounter: Publish NewPrefill/UpdatePrefill/CompletePrefill via NATS
    PrefillCounter->>NATS: Broadcast prefill events
    PrefillCounter->>PrefillCounter: Synchronize state on event receipt
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

Poem

In the warren where tokens leap and bound,
Prefill counters now keep track all around.
With NATS they chatter, events in the air,
No more mutexes, less code to repair!
Metrics hop swiftly from worker to den—
A rabbit’s delight: robust code once again! 🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
lib/llm/src/kv_router/prefill_counter.rs (3)

113-114: Address the TODO: Filter self-generated events.

The TODO comment indicates a need to filter out events generated by the same instance to avoid processing own events. This could lead to unnecessary processing and potential issues.

Would you like me to implement a solution that includes the instance ID in events and filters them during subscription?


103-108: Consider handling background task lifecycle.

The spawned task for event subscription doesn't have a clean shutdown mechanism. While the task will log errors, it might continue running even after the PrefillCounter is dropped.

Consider storing a cancellation token to enable clean shutdown:

 pub struct PrefillCounter {
     state: Arc<RwLock<PrefillCounterState>>,
     component: Component,
+    cancellation_token: CancellationToken,
 }

140-153: Improve error handling for missing entries in UpdatePrefill.

When handling UpdatePrefill, if the entry doesn't exist, the code silently continues. Consider logging this case as it might indicate an ordering issue or lost events.

 PrefillEventData::UpdatePrefill(new_tokens) => {
     let state_write = state.write().await;
     let Some(old_tokens_ref) = state_write.tokens_map.get(&event.request_id) else {
+        tracing::debug!("Received UpdatePrefill for unknown request_id: {}", event.request_id);
         continue;
     };
lib/llm/src/kv_router/scheduler.rs (1)

226-230: Consider error handling for prefill token updates.

The error from add_prefill is silently ignored with let _. Consider at least logging failures for debugging purposes.

-        let _ = self
-            .prefill_tokens
-            .add_prefill(best_worker_id, request_id, response.new_tokens)
-            .await;
+        if let Err(e) = self
+            .prefill_tokens
+            .add_prefill(best_worker_id, request_id, response.new_tokens)
+            .await {
+            tracing::warn!("Failed to add prefill tokens: {}", e);
+        }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dbb4caa and d937608.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • lib/bindings/python/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • components/metrics/src/lib.rs (1 hunks)
  • lib/llm/Cargo.toml (1 hunks)
  • lib/llm/src/kv_router.rs (5 hunks)
  • lib/llm/src/kv_router/metrics_aggregator.rs (1 hunks)
  • lib/llm/src/kv_router/prefill_counter.rs (1 hunks)
  • lib/llm/src/kv_router/protocols.rs (2 hunks)
  • lib/llm/src/kv_router/publisher.rs (4 hunks)
  • lib/llm/src/kv_router/scheduler.rs (7 hunks)
  • lib/llm/src/kv_router/scoring.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (15)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
📚 Learning: in lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating lo...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Applied to files:

  • lib/llm/src/kv_router/metrics_aggregator.rs
  • components/metrics/src/lib.rs
  • lib/llm/src/kv_router/scoring.rs
  • lib/llm/src/kv_router/scheduler.rs
  • lib/llm/src/kv_router/publisher.rs
  • lib/llm/src/kv_router.rs
📚 Learning: in lib/llm/src/kv_router/scoring.rs, peabrane prefers panic-based early failure over result-based er...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Applied to files:

  • lib/llm/src/kv_router/metrics_aggregator.rs
  • components/metrics/src/lib.rs
  • lib/llm/src/kv_router/scoring.rs
  • lib/llm/src/kv_router/scheduler.rs
  • lib/llm/src/kv_router/publisher.rs
  • lib/llm/src/kv_router.rs
📚 Learning: in the kv router scheduler code, peabrane prefers fail-fast behavior over silent failure handling. w...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Applied to files:

  • lib/llm/src/kv_router/metrics_aggregator.rs
  • components/metrics/src/lib.rs
  • lib/llm/src/kv_router/scoring.rs
  • lib/llm/src/kv_router/scheduler.rs
  • lib/llm/src/kv_router/publisher.rs
  • lib/llm/src/kv_router.rs
📚 Learning: the `create_endpoint` method in `workermetricspublisher` has backward compatibility maintained throu...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.

Applied to files:

  • lib/llm/src/kv_router/metrics_aggregator.rs
  • components/metrics/src/lib.rs
  • lib/llm/src/kv_router/publisher.rs
📚 Learning: in docs/support_matrix.md, the nixl version difference between runtime dependencies (0.5.0) and buil...
Learnt from: dmitry-tokarev-nv
PR: ai-dynamo/dynamo#2179
File: docs/support_matrix.md:61-63
Timestamp: 2025-07-30T00:34:35.810Z
Learning: In docs/support_matrix.md, the NIXL version difference between runtime dependencies (0.5.0) and build dependencies (0.4.0) is intentional and expected, not an error that needs to be corrected.

Applied to files:

  • lib/llm/Cargo.toml
📚 Learning: in lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block...
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Applied to files:

  • lib/llm/src/kv_router/protocols.rs
  • lib/llm/src/kv_router/scoring.rs
  • lib/llm/src/kv_router/prefill_counter.rs
  • lib/llm/src/kv_router/scheduler.rs
  • lib/llm/src/kv_router/publisher.rs
  • lib/llm/src/kv_router.rs
📚 Learning: in lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_so...
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Applied to files:

  • lib/llm/src/kv_router/prefill_counter.rs
  • lib/llm/src/kv_router.rs
📚 Learning: in the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the `...
Learnt from: jthomson04
PR: ai-dynamo/dynamo#1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the `wait_for_key_count` function correctly uses exact equality (`==`) instead of greater-than-or-equal (`>=`) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.

Applied to files:

  • lib/llm/src/kv_router/prefill_counter.rs
  • lib/llm/src/kv_router/publisher.rs
📚 Learning: in rust/tokio applications, when background tasks use channels for communication, dropping the sende...
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

Applied to files:

  • lib/llm/src/kv_router/prefill_counter.rs
  • lib/llm/src/kv_router/scheduler.rs
  • lib/llm/src/kv_router/publisher.rs
  • lib/llm/src/kv_router.rs
📚 Learning: in lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister ...
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Applied to files:

  • lib/llm/src/kv_router/prefill_counter.rs
  • lib/llm/src/kv_router/scheduler.rs
  • lib/llm/src/kv_router/publisher.rs
  • lib/llm/src/kv_router.rs
📚 Learning: the codebase uses async-nats version 0.40, not the older nats crate. error handling should use async...
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Applied to files:

  • lib/llm/src/kv_router/publisher.rs
📚 Learning: in rust async code, when an arc> is used solely to transfer ownership of a resource (like a...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.

Applied to files:

  • lib/llm/src/kv_router/publisher.rs
  • lib/llm/src/kv_router.rs
📚 Learning: in async-nats, the "no responders" error is represented as async_nats::error::requesterrorkind::nore...
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::error::RequestErrorKind::NoResponders. Use err.downcast_ref::<async_nats::error::RequestError>() and then check req_err.kind() against RequestErrorKind::NoResponders to handle this error properly.

Applied to files:

  • lib/llm/src/kv_router/publisher.rs
📚 Learning: the asyncenginecontextprovider trait in lib/runtime/src/engine.rs was intentionally changed from `se...
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Applied to files:

  • lib/llm/src/kv_router.rs
🔇 Additional comments (25)
lib/llm/Cargo.toml (1)

88-88: LGTM! Appropriate dependency for concurrent data structures.

The dashmap = "6" dependency addition correctly supports the new PrefillCountersMultiWorker implementation that requires thread-safe concurrent HashMap operations.

lib/llm/src/kv_router/metrics_aggregator.rs (1)

21-21: Import path correctly updated for architectural refactoring.

The change from scheduler::Endpoint to scoring::Endpoint aligns with the new local Endpoint definition that includes non-optional LoadMetrics data and worker ID extraction functionality.

components/metrics/src/lib.rs (1)

87-87: Consistent import path update.

The import change from scheduler::Endpoint to scoring::Endpoint maintains consistency with the architectural refactoring across the codebase.

lib/llm/src/kv_router/scoring.rs (2)

22-26: Well-designed event structure for load metrics transport.

The LoadEvent struct provides a clean abstraction for transporting worker-specific load metrics in the new event-driven architecture.


28-50: Solid implementation with appropriate fail-fast behavior.

The new Endpoint struct and worker_id() method are well-designed:

  • Non-optional data field ensures clean metrics handling
  • The worker_id() method correctly extracts and parses hexadecimal worker IDs from subject strings
  • Panic-based error handling with expect() aligns with the preferred fail-fast approach for catching invalid data early
lib/llm/src/kv_router/protocols.rs (3)

4-4: Appropriate import expansion for event protocol needs.

Adding both Token and SequenceHash imports supports the new event structures that track token prefilling and block lifecycle events.


119-134: Well-designed prefill event protocol.

The PrefillEvent structure and PrefillEventData enum provide a clean protocol for tracking prefill token lifecycle:

  • Clear event stages: NewPrefill, UpdatePrefill, CompletePrefill
  • Proper request identification with request_id
  • Token count tracking with usize values
  • Complete trait implementations for serialization and debugging

136-146: Solid active block event protocol design.

The ActiveBlockEvent structure and ActiveBlockEventData enum effectively handle active block lifecycle:

  • Clear block operations: NewBlock with sequence hashes, FreeBlock for cleanup
  • Consistent request tracking pattern
  • Proper use of Vec<SequenceHash> for block identification
  • Complete trait implementations
lib/llm/src/kv_router.rs (5)

22-23: LGTM! Well-organized event subjects.

The new constants KV_METRICS_SUBJECT and PREFILL_SUBJECT follow the existing naming convention and clearly indicate their purpose for event-driven communication.

Also applies to: 54-55


228-231: Efficient scheduling without sequence tracking.

The removal of seq_hashes from the schedule call aligns well with the new event-driven architecture. The deferral of seq_hashes computation until after scheduling (lines 233-239) is a good optimization.


250-252: Parameter type improvement.

Changing from &String to &str is a good practice that makes the API more flexible and avoids unnecessary allocations.


333-334: Good architectural improvement.

Moving the prefill completion notification before yielding the first item ensures that prefill tracking is updated as soon as the first response is received, which aligns well with the event-driven design.


176-177: Double-Check Component Lifecycle in KvScheduler

The scheduler now takes a full Component clone instead of just the namespace string, increasing its strong-reference count. We’ve found multiple other long-lived objects also holding component.clone(), but no impl Drop for Component in this crate. Please verify that:

  • Component does not internally retain strong references back to any of these routers/schedulers, which could form an Arc cycle.
  • If you only need the namespace or a subset of the context, consider passing that directly or exposing a Weak<Component> instead.

Review all component.clone() usages in:

  • lib/llm/src/kv_router.rs (KvScheduler::start)
  • lib/llm/src/kv_router/scheduler.rs
  • lib/llm/src/kv_router/publisher.rs
  • lib/llm/src/kv_router/prefill_counter.rs
  • lib/llm/src/kv_router/metrics_aggregator.rs

Ensuring no circular references will prevent unintended memory leaks or lifecycle issues.

lib/llm/src/kv_router/prefill_counter.rs (2)

55-85: Well-designed concurrent state management.

The use of DashMap for tokens_map with atomic operations for running_sum provides excellent concurrency guarantees. The implementation correctly handles both insert and remove operations with proper atomic updates.


252-259: Warning log for existing request reassignment.

Good defensive programming with the warning log when a request is reassigned to a different worker. This helps detect potential bugs in the routing logic.

lib/llm/src/kv_router/publisher.rs (5)

506-514: Good defensive programming for worker ID retrieval.

The implementation properly handles both dynamic (with lease) and static components, with appropriate warning for the static case.


606-606: Fixed typo in struct name.

Good catch on fixing the typo from KvLoadEndpoingHander to KvLoadEndpointHandler.

Also applies to: 610-610, 618-618


533-603: Well-implemented debounced metrics publishing.

The implementation correctly:

  • Debounces metrics updates by 1ms to avoid NATS overload
  • Only publishes when specific load metrics change
  • Properly handles the watch channel lifecycle
  • Uses select! for concurrent timer and change handling

146-147: Integer overflow protection in delta calculation.

The delta calculation correctly uses isize to handle the case where new_tokens might be less than old_tokens, preventing potential underflow.


972-1083: Comprehensive integration test.

The test thoroughly validates the debouncing behavior:

  • Verifies that only the last metric in a rapid sequence is published
  • Confirms that unchanged load metrics don't trigger publications
  • Uses appropriate timeouts and assertions
lib/llm/src/kv_router/scheduler.rs (5)

104-105: Good use of DashMap for concurrent state.

Using DashMap for active_blocks is the right choice as it allows concurrent reads and writes from multiple tasks without explicit locking.


351-363: Clear and correct worker selection logic.

The new formula for calculating logits based on prefill tokens and decode blocks is well-documented and correctly implemented. The removal of max_logit normalization simplifies the algorithm without affecting correctness.


277-280: Good optimization in softmax normalization.

The comment clarifies that full min-max normalization isn't necessary since we only need relative values for probability calculation. This is a good optimization.


137-146: Efficient non-blocking instance updates.

The scheduler efficiently checks for instance updates without blocking, ensuring scheduling requests are processed promptly while keeping the instance list current.


110-127: Consider handling subscription errors more gracefully.

The subscription task uses expect() which will panic if subscription fails. Consider more graceful error handling with retry logic.

-        let mut load_subscriber = ns_clone
-            .subscribe_with_type::<LoadEvent>(KV_METRICS_SUBJECT)
-            .await
-            .expect("Cannot launch load subscriber");
+        let load_subscriber_result = ns_clone
+            .subscribe_with_type::<LoadEvent>(KV_METRICS_SUBJECT)
+            .await;
+        
+        let mut load_subscriber = match load_subscriber_result {
+            Ok(subscriber) => subscriber,
+            Err(e) => {
+                tracing::error!("Failed to subscribe to load events: {}", e);
+                return;
+            }
+        };
⛔ Skipped due to learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::client::RequestErrorKind::NoResponders, not async_nats::Error::NoResponders. Use err.downcast_ref::<async_nats::client::RequestError>() and then check request_err.kind() against RequestErrorKind::NoResponders.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::error::RequestErrorKind::NoResponders. Use err.downcast_ref::<async_nats::error::RequestError>() and then check req_err.kind() against RequestErrorKind::NoResponders to handle this error properly.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.

@tedzhouhk
Copy link
Contributor

I'm OK to merge this as the results show better perf.
Some high level thoughts:

  1. I don't think a broadcast-based system would scales well. However, we can run some more benchmarks to see if this is going to be the bottleneck or something else.
  2. To solve the lagging metric problem, shall we run more benchmarks are see where the requests actually stuck in the pipeline? Based on the results, we maybe able to design simpler algorithm to track the load.

Copy link
Contributor

@tedzhouhk tedzhouhk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve to unblock

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants