-
Notifications
You must be signed in to change notification settings - Fork 679
chore: Add Request Migration docs and minor enhancements #2038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis update introduces a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant HTTP_Server
participant Worker_Engine_A
participant Worker_Engine_B
Client->>HTTP_Server: Send request
HTTP_Server->>Worker_Engine_A: Forward request
alt Connectivity failure on Worker_Engine_A
HTTP_Server->>Worker_Engine_B: Migrate request (if migration-limit not reached)
Worker_Engine_B->>HTTP_Server: Continue processing (with updated max_tokens)
end
HTTP_Server->>Client: Return response
Estimated code review effort3 (~45 minutes) Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docs/guides/dynamo_run.md(1 hunks)launch/dynamo-run/src/main.rs(1 hunks)lib/llm/src/migration.rs(12 hunks)
🧠 Learnings (2)
docs/guides/dynamo_run.md (1)
Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The @dynamo_worker() decorator in the dynamo codebase returns a wrapper that automatically injects the runtime parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature async def get_metrics(runtime, log_dir) decorated with @dynamo_worker() can be called as get_metrics(log_dir) because the decorator wrapper injects the runtime parameter.
lib/llm/src/migration.rs (4)
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::error::RequestErrorKind::NoResponders. Use err.downcast_ref::<async_nats::error::RequestError>() and then check req_err.kind() against RequestErrorKind::NoResponders to handle this error properly.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::client::RequestErrorKind::NoResponders, not async_nats::Error::NoResponders. Use err.downcast_ref::<async_nats::client::RequestError>() and then check request_err.kind() against RequestErrorKind::NoResponders.
Learnt from: ishandhanani
PR: #1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the sampling_options call in the preprocess_request method is placed in the common section after the match statement on request.prompt_input_type(), meaning it applies to both PromptInput::Tokens and PromptInput::Text request types.
🧰 Additional context used
🧠 Learnings (2)
docs/guides/dynamo_run.md (1)
Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The @dynamo_worker() decorator in the dynamo codebase returns a wrapper that automatically injects the runtime parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature async def get_metrics(runtime, log_dir) decorated with @dynamo_worker() can be called as get_metrics(log_dir) because the decorator wrapper injects the runtime parameter.
lib/llm/src/migration.rs (4)
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::error::RequestErrorKind::NoResponders. Use err.downcast_ref::<async_nats::error::RequestError>() and then check req_err.kind() against RequestErrorKind::NoResponders to handle this error properly.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::client::RequestErrorKind::NoResponders, not async_nats::Error::NoResponders. Use err.downcast_ref::<async_nats::client::RequestError>() and then check request_err.kind() against RequestErrorKind::NoResponders.
Learnt from: ishandhanani
PR: #1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the sampling_options call in the preprocess_request method is placed in the common section after the match statement on request.prompt_input_type(), meaning it applies to both PromptInput::Tokens and PromptInput::Text request types.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (8)
launch/dynamo-run/src/main.rs (1)
27-27: LGTM! The migration-limit flag follows the established CLI pattern.The addition of
--migration-limit=0to the usage string is consistent with other optional flags and properly documents the new fault tolerance feature.docs/guides/dynamo_run.md (1)
212-227: Well-documented feature explanation!The new "Fault Tolerance and Request Migration" section clearly explains the migration behavior, the
--migration-limitflag usage, and provides a practical example. The documentation effectively communicates how the feature enhances reliability in distributed deployments.lib/llm/src/migration.rs (6)
116-118: Appropriate logging level upgrade for migration events.Elevating these log messages from info to warning level is correct - stream disconnections and recreation failures are operational concerns that warrant higher visibility.
141-141: Consistent logging level for retry events.Good change - retry attempts due to NoResponders errors are warning-level events that operators need to monitor.
168-171: Correct implementation of token limit tracking during migration.The logic properly tracks and updates the remaining token budget by subtracting already-generated tokens. Using
saturating_subis a good defensive programming practice to handle edge cases.
188-195: Test helper properly updated to support max_tokens testing.The parameterization of
max_tokensin the test helper enables comprehensive testing of the token tracking feature.
274-285: Excellent test coverage for the token tracking feature.The assertion comprehensively validates that
max_tokensis correctly adjusted based on already-generated responses, ensuring the migration logic respects token limits.
473-473: Test updates are consistent with the new max_tokens parameter.All test cases have been properly updated to specify appropriate
max_tokensvalues, maintaining test coverage while adapting to the new API.Also applies to: 504-504, 535-535, 571-571, 591-591, 638-638
84e88d8 to
88f9726
Compare
rmccorm4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment on out=vllm being removed in dynamo-run (I believe), but LGTM otherwise
Overview:
Complete the documentations for the Request Migration feature, and update
max_tokenson migrated requests accounting for responses already generated.Details:
--migration-limitflag todynamo-run --help.max_tokensfor responses already generated during migration.--migration-limitflag to vllm, trtllm, sglang and llama_cpp backends and their READMEs.Where should the reviewer start?
Start with the documentation changes first, and then look at the
max_tokenssubtraction feature and then the updated unit test.Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
N/A
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Style