-
Notifications
You must be signed in to change notification settings - Fork 676
feat: Router replicas with state-sharing #2264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis change refactors the KV router and scheduler system by removing sequence-based tracking and mutex locking, introducing a new event-driven prefill token and active block tracking mechanism using PrefillCounter and NATS events. Several new data structures and protocols are added, and metrics publishing is implemented for distributed coordination. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant KvRouter
participant KvScheduler
participant PrefillCounter
participant NATS
participant Worker
Client->>KvRouter: Request scheduling (context_id, tokens, overlaps)
KvRouter->>KvScheduler: schedule(context_id, tokens, overlaps)
KvScheduler->>PrefillCounter: Query prefill tokens
KvScheduler->>NATS: Listen for LoadEvent (active blocks)
KvScheduler->>KvRouter: Return selected worker_id
KvRouter->>Worker: Route request to selected worker
Worker->>PrefillCounter: Publish NewPrefill/UpdatePrefill/CompletePrefill via NATS
PrefillCounter->>NATS: Broadcast prefill events
PrefillCounter->>PrefillCounter: Synchronize state on event receipt
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
lib/llm/src/kv_router/prefill_counter.rs (3)
113-114: Address the TODO: Filter self-generated events.The TODO comment indicates a need to filter out events generated by the same instance to avoid processing own events. This could lead to unnecessary processing and potential issues.
Would you like me to implement a solution that includes the instance ID in events and filters them during subscription?
103-108: Consider handling background task lifecycle.The spawned task for event subscription doesn't have a clean shutdown mechanism. While the task will log errors, it might continue running even after the PrefillCounter is dropped.
Consider storing a cancellation token to enable clean shutdown:
pub struct PrefillCounter { state: Arc<RwLock<PrefillCounterState>>, component: Component, + cancellation_token: CancellationToken, }
140-153: Improve error handling for missing entries in UpdatePrefill.When handling
UpdatePrefill, if the entry doesn't exist, the code silently continues. Consider logging this case as it might indicate an ordering issue or lost events.PrefillEventData::UpdatePrefill(new_tokens) => { let state_write = state.write().await; let Some(old_tokens_ref) = state_write.tokens_map.get(&event.request_id) else { + tracing::debug!("Received UpdatePrefill for unknown request_id: {}", event.request_id); continue; };lib/llm/src/kv_router/scheduler.rs (1)
226-230: Consider error handling for prefill token updates.The error from
add_prefillis silently ignored withlet _. Consider at least logging failures for debugging purposes.- let _ = self - .prefill_tokens - .add_prefill(best_worker_id, request_id, response.new_tokens) - .await; + if let Err(e) = self + .prefill_tokens + .add_prefill(best_worker_id, request_id, response.new_tokens) + .await { + tracing::warn!("Failed to add prefill tokens: {}", e); + }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.locklib/bindings/python/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
components/metrics/src/lib.rs(1 hunks)lib/llm/Cargo.toml(1 hunks)lib/llm/src/kv_router.rs(5 hunks)lib/llm/src/kv_router/metrics_aggregator.rs(1 hunks)lib/llm/src/kv_router/prefill_counter.rs(1 hunks)lib/llm/src/kv_router/protocols.rs(2 hunks)lib/llm/src/kv_router/publisher.rs(4 hunks)lib/llm/src/kv_router/scheduler.rs(7 hunks)lib/llm/src/kv_router/scoring.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (15)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
📚 Learning: in lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating lo...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Applied to files:
lib/llm/src/kv_router/metrics_aggregator.rscomponents/metrics/src/lib.rslib/llm/src/kv_router/scoring.rslib/llm/src/kv_router/scheduler.rslib/llm/src/kv_router/publisher.rslib/llm/src/kv_router.rs
📚 Learning: in lib/llm/src/kv_router/scoring.rs, peabrane prefers panic-based early failure over result-based er...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Applied to files:
lib/llm/src/kv_router/metrics_aggregator.rscomponents/metrics/src/lib.rslib/llm/src/kv_router/scoring.rslib/llm/src/kv_router/scheduler.rslib/llm/src/kv_router/publisher.rslib/llm/src/kv_router.rs
📚 Learning: in the kv router scheduler code, peabrane prefers fail-fast behavior over silent failure handling. w...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Applied to files:
lib/llm/src/kv_router/metrics_aggregator.rscomponents/metrics/src/lib.rslib/llm/src/kv_router/scoring.rslib/llm/src/kv_router/scheduler.rslib/llm/src/kv_router/publisher.rslib/llm/src/kv_router.rs
📚 Learning: the `create_endpoint` method in `workermetricspublisher` has backward compatibility maintained throu...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.
Applied to files:
lib/llm/src/kv_router/metrics_aggregator.rscomponents/metrics/src/lib.rslib/llm/src/kv_router/publisher.rs
📚 Learning: in docs/support_matrix.md, the nixl version difference between runtime dependencies (0.5.0) and buil...
Learnt from: dmitry-tokarev-nv
PR: ai-dynamo/dynamo#2179
File: docs/support_matrix.md:61-63
Timestamp: 2025-07-30T00:34:35.810Z
Learning: In docs/support_matrix.md, the NIXL version difference between runtime dependencies (0.5.0) and build dependencies (0.4.0) is intentional and expected, not an error that needs to be corrected.
Applied to files:
lib/llm/Cargo.toml
📚 Learning: in lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block...
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Applied to files:
lib/llm/src/kv_router/protocols.rslib/llm/src/kv_router/scoring.rslib/llm/src/kv_router/prefill_counter.rslib/llm/src/kv_router/scheduler.rslib/llm/src/kv_router/publisher.rslib/llm/src/kv_router.rs
📚 Learning: in lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_so...
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Applied to files:
lib/llm/src/kv_router/prefill_counter.rslib/llm/src/kv_router.rs
📚 Learning: in the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the `...
Learnt from: jthomson04
PR: ai-dynamo/dynamo#1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the `wait_for_key_count` function correctly uses exact equality (`==`) instead of greater-than-or-equal (`>=`) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.
Applied to files:
lib/llm/src/kv_router/prefill_counter.rslib/llm/src/kv_router/publisher.rs
📚 Learning: in rust/tokio applications, when background tasks use channels for communication, dropping the sende...
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
Applied to files:
lib/llm/src/kv_router/prefill_counter.rslib/llm/src/kv_router/scheduler.rslib/llm/src/kv_router/publisher.rslib/llm/src/kv_router.rs
📚 Learning: in lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister ...
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Applied to files:
lib/llm/src/kv_router/prefill_counter.rslib/llm/src/kv_router/scheduler.rslib/llm/src/kv_router/publisher.rslib/llm/src/kv_router.rs
📚 Learning: the codebase uses async-nats version 0.40, not the older nats crate. error handling should use async...
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Applied to files:
lib/llm/src/kv_router/publisher.rs
📚 Learning: in rust async code, when an arc> is used solely to transfer ownership of a resource (like a...
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
Applied to files:
lib/llm/src/kv_router/publisher.rslib/llm/src/kv_router.rs
📚 Learning: in async-nats, the "no responders" error is represented as async_nats::error::requesterrorkind::nore...
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::error::RequestErrorKind::NoResponders. Use err.downcast_ref::<async_nats::error::RequestError>() and then check req_err.kind() against RequestErrorKind::NoResponders to handle this error properly.
Applied to files:
lib/llm/src/kv_router/publisher.rs
📚 Learning: the asyncenginecontextprovider trait in lib/runtime/src/engine.rs was intentionally changed from `se...
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Applied to files:
lib/llm/src/kv_router.rs
🔇 Additional comments (25)
lib/llm/Cargo.toml (1)
88-88: LGTM! Appropriate dependency for concurrent data structures.The
dashmap = "6"dependency addition correctly supports the newPrefillCountersMultiWorkerimplementation that requires thread-safe concurrent HashMap operations.lib/llm/src/kv_router/metrics_aggregator.rs (1)
21-21: Import path correctly updated for architectural refactoring.The change from
scheduler::Endpointtoscoring::Endpointaligns with the new localEndpointdefinition that includes non-optionalLoadMetricsdata and worker ID extraction functionality.components/metrics/src/lib.rs (1)
87-87: Consistent import path update.The import change from
scheduler::Endpointtoscoring::Endpointmaintains consistency with the architectural refactoring across the codebase.lib/llm/src/kv_router/scoring.rs (2)
22-26: Well-designed event structure for load metrics transport.The
LoadEventstruct provides a clean abstraction for transporting worker-specific load metrics in the new event-driven architecture.
28-50: Solid implementation with appropriate fail-fast behavior.The new
Endpointstruct andworker_id()method are well-designed:
- Non-optional
datafield ensures clean metrics handling- The
worker_id()method correctly extracts and parses hexadecimal worker IDs from subject strings- Panic-based error handling with
expect()aligns with the preferred fail-fast approach for catching invalid data earlylib/llm/src/kv_router/protocols.rs (3)
4-4: Appropriate import expansion for event protocol needs.Adding both
TokenandSequenceHashimports supports the new event structures that track token prefilling and block lifecycle events.
119-134: Well-designed prefill event protocol.The
PrefillEventstructure andPrefillEventDataenum provide a clean protocol for tracking prefill token lifecycle:
- Clear event stages:
NewPrefill,UpdatePrefill,CompletePrefill- Proper request identification with
request_id- Token count tracking with
usizevalues- Complete trait implementations for serialization and debugging
136-146: Solid active block event protocol design.The
ActiveBlockEventstructure andActiveBlockEventDataenum effectively handle active block lifecycle:
- Clear block operations:
NewBlockwith sequence hashes,FreeBlockfor cleanup- Consistent request tracking pattern
- Proper use of
Vec<SequenceHash>for block identification- Complete trait implementations
lib/llm/src/kv_router.rs (5)
22-23: LGTM! Well-organized event subjects.The new constants
KV_METRICS_SUBJECTandPREFILL_SUBJECTfollow the existing naming convention and clearly indicate their purpose for event-driven communication.Also applies to: 54-55
228-231: Efficient scheduling without sequence tracking.The removal of
seq_hashesfrom the schedule call aligns well with the new event-driven architecture. The deferral ofseq_hashescomputation until after scheduling (lines 233-239) is a good optimization.
250-252: Parameter type improvement.Changing from
&Stringto&stris a good practice that makes the API more flexible and avoids unnecessary allocations.
333-334: Good architectural improvement.Moving the prefill completion notification before yielding the first item ensures that prefill tracking is updated as soon as the first response is received, which aligns well with the event-driven design.
176-177: Double-Check Component Lifecycle in KvSchedulerThe scheduler now takes a full
Componentclone instead of just the namespace string, increasing its strong-reference count. We’ve found multiple other long-lived objects also holdingcomponent.clone(), but noimpl Drop for Componentin this crate. Please verify that:
Componentdoes not internally retain strong references back to any of these routers/schedulers, which could form anArccycle.- If you only need the namespace or a subset of the context, consider passing that directly or exposing a
Weak<Component>instead.Review all
component.clone()usages in:
- lib/llm/src/kv_router.rs (KvScheduler::start)
- lib/llm/src/kv_router/scheduler.rs
- lib/llm/src/kv_router/publisher.rs
- lib/llm/src/kv_router/prefill_counter.rs
- lib/llm/src/kv_router/metrics_aggregator.rs
Ensuring no circular references will prevent unintended memory leaks or lifecycle issues.
lib/llm/src/kv_router/prefill_counter.rs (2)
55-85: Well-designed concurrent state management.The use of DashMap for
tokens_mapwith atomic operations forrunning_sumprovides excellent concurrency guarantees. The implementation correctly handles both insert and remove operations with proper atomic updates.
252-259: Warning log for existing request reassignment.Good defensive programming with the warning log when a request is reassigned to a different worker. This helps detect potential bugs in the routing logic.
lib/llm/src/kv_router/publisher.rs (5)
506-514: Good defensive programming for worker ID retrieval.The implementation properly handles both dynamic (with lease) and static components, with appropriate warning for the static case.
606-606: Fixed typo in struct name.Good catch on fixing the typo from
KvLoadEndpoingHandertoKvLoadEndpointHandler.Also applies to: 610-610, 618-618
533-603: Well-implemented debounced metrics publishing.The implementation correctly:
- Debounces metrics updates by 1ms to avoid NATS overload
- Only publishes when specific load metrics change
- Properly handles the watch channel lifecycle
- Uses select! for concurrent timer and change handling
146-147: Integer overflow protection in delta calculation.The delta calculation correctly uses
isizeto handle the case wherenew_tokensmight be less thanold_tokens, preventing potential underflow.
972-1083: Comprehensive integration test.The test thoroughly validates the debouncing behavior:
- Verifies that only the last metric in a rapid sequence is published
- Confirms that unchanged load metrics don't trigger publications
- Uses appropriate timeouts and assertions
lib/llm/src/kv_router/scheduler.rs (5)
104-105: Good use of DashMap for concurrent state.Using DashMap for
active_blocksis the right choice as it allows concurrent reads and writes from multiple tasks without explicit locking.
351-363: Clear and correct worker selection logic.The new formula for calculating logits based on prefill tokens and decode blocks is well-documented and correctly implemented. The removal of max_logit normalization simplifies the algorithm without affecting correctness.
277-280: Good optimization in softmax normalization.The comment clarifies that full min-max normalization isn't necessary since we only need relative values for probability calculation. This is a good optimization.
137-146: Efficient non-blocking instance updates.The scheduler efficiently checks for instance updates without blocking, ensuring scheduling requests are processed promptly while keeping the instance list current.
110-127: Consider handling subscription errors more gracefully.The subscription task uses
expect()which will panic if subscription fails. Consider more graceful error handling with retry logic.- let mut load_subscriber = ns_clone - .subscribe_with_type::<LoadEvent>(KV_METRICS_SUBJECT) - .await - .expect("Cannot launch load subscriber"); + let load_subscriber_result = ns_clone + .subscribe_with_type::<LoadEvent>(KV_METRICS_SUBJECT) + .await; + + let mut load_subscriber = match load_subscriber_result { + Ok(subscriber) => subscriber, + Err(e) => { + tracing::error!("Failed to subscribe to load events: {}", e); + return; + } + };⛔ Skipped due to learnings
Learnt from: PeaBrane PR: ai-dynamo/dynamo#1285 File: lib/llm/src/kv_router/scoring.rs:58-63 Timestamp: 2025-05-30T06:38:09.630Z Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.Learnt from: kthui PR: ai-dynamo/dynamo#1424 File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209 Timestamp: 2025-06-13T22:07:24.843Z Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.Learnt from: PeaBrane PR: ai-dynamo/dynamo#1392 File: lib/llm/src/kv_router/scoring.rs:35-46 Timestamp: 2025-06-05T01:02:15.318Z Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.Learnt from: oandreeva-nv PR: ai-dynamo/dynamo#1195 File: lib/llm/tests/block_manager.rs:150-152 Timestamp: 2025-06-02T19:37:27.666Z Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.Learnt from: kthui PR: ai-dynamo/dynamo#1424 File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209 Timestamp: 2025-06-13T22:32:05.022Z Learning: In async-nats, the "no responders" error is represented as async_nats::client::RequestErrorKind::NoResponders, not async_nats::Error::NoResponders. Use err.downcast_ref::<async_nats::client::RequestError>() and then check request_err.kind() against RequestErrorKind::NoResponders.Learnt from: ryanolson PR: ai-dynamo/dynamo#1093 File: lib/llm/src/block_manager/block/registry.rs:98-122 Timestamp: 2025-05-29T06:20:12.901Z Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.Learnt from: PeaBrane PR: ai-dynamo/dynamo#1285 File: lib/llm/src/kv_router/scheduler.rs:260-266 Timestamp: 2025-05-30T06:34:12.785Z Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.Learnt from: kthui PR: ai-dynamo/dynamo#1424 File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209 Timestamp: 2025-06-13T22:32:05.022Z Learning: In async-nats, the "no responders" error is represented as async_nats::error::RequestErrorKind::NoResponders. Use err.downcast_ref::<async_nats::error::RequestError>() and then check req_err.kind() against RequestErrorKind::NoResponders to handle this error properly.Learnt from: grahamking PR: ai-dynamo/dynamo#1962 File: lib/runtime/src/component/client.rs:270-273 Timestamp: 2025-07-16T12:41:12.543Z Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.Learnt from: alec-flowers PR: ai-dynamo/dynamo#1181 File: lib/llm/src/kv_router/publisher.rs:379-425 Timestamp: 2025-05-29T00:02:35.018Z Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.Learnt from: PeaBrane PR: ai-dynamo/dynamo#1236 File: lib/llm/src/mocker/engine.rs:140-161 Timestamp: 2025-06-17T00:50:44.845Z Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
|
I'm OK to merge this as the results show better perf.
|
tedzhouhk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approve to unblock
Where should the reviewer start
Core:
sequence.rs: the local accounting of decode blocks and prefill tokens are now broadcasted (to remote Router replicas) via NATs, so that all the Router replicas can be in sync"Dead code" for now, but useful for future:
publisher.rs: a push-based load metrics publishing over NATs on changes of core metricsprefill_counter.rs: a simplified version ofsequence.rs(slot simulator) that only cares about prefill tokens, in case we wish to revert tracking of active blocks from load eventsMinor:
scheduler.rs: simplified the background scheduling loop to remove the two nested loops, and to prepare for request queueing, migration, and throttling. Also, moved the state updates into the scheduling loop instead of separate methods.All other changes are very trivial, mainly involving handling propagation of the
replica_syncflag (defaulting to false)Description
Slot tracking is synced between Router replicas
It is not desirable or easy to poll active prefill tokens and decode blocks from the engine, as they are very latency-sensitive metrics, and extracting this info from the backend engine is fairly intrusive (and annoying). However, it is fairly easy to track this locally within the Rotuer based on the request-response cycle. For example, we know how many new prefill tokens are required based on the ISL and the
overlap_blocksfrom theKvIndexer. Furthermore, we can listen to the first output token as a signal that the prefill process for a request is completed. Therefore, now the hard part is to sync this between Router replicas.The proposal is to have the Router send out an event on every local update for other Router replicas to listen to. The overhead should be fairly small, as there are only two events per request, and the event body is very small (containing only the request ID and the number of new tokens). The actual data structure mapping
request_idtonew_tokensis a dashmap, to allow for concurrent local direct updates and event-based updates informed by remote Routers.Load metrics changed to push-based
Previously, load metrics were polled from the backend engines at a 100 ms interval. This was too slow. More recently, this was changed to being predicted/accounted for directly on the Router end, but this made the Router unable to replicate, as the active block states were not synced naturally by events.Now, the proposal is to change the load metrics publishing to a push-based pattern. DuringWorkerMetricsPublisher.create_endpoint(), a background process would start to relayLoadEventimmediately over NATs when key load metrics (kv_active_blocksandnum_requests_waiting) did not change over a 1 ms duration. This is a balance between having low latency, and not overloading the NATs server (e.g. sending an event on every engine forward pass).Benchmarks
Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Tests