-
Notifications
You must be signed in to change notification settings - Fork 676
feat(request cancellation): add raii guard for monitoring the connection (http server , openai.rs) #2152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(request cancellation): add raii guard for monitoring the connection (http server , openai.rs) #2152
Conversation
|
👋 Hi michaelfeil! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
|
@ryanolson I saw you are potentially working on a similar item. https://basetenlabs.slack.com/archives/C04BUDD86FR/p1753721286150779?thread_ts=1753289366.278079&cid=C04BUDD86FR additional context. |
WalkthroughA new RAII guard struct, Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant HTTP Handler
participant CtxDropGuard
participant AsyncEngineContext
Client->>HTTP Handler: Send request
HTTP Handler->>CtxDropGuard: Create guard with context
CtxDropGuard->>AsyncEngineContext: Hold reference
alt Streaming response
loop Each streamed event
HTTP Handler->>CtxDropGuard: mute()
end
else Non-streaming
HTTP Handler->>CtxDropGuard: defuse() after success
end
alt Client disconnects prematurely
CtxDropGuard->>AsyncEngineContext: stop_generating() on drop
else Request completes
CtxDropGuard->>CtxDropGuard: defused, no stop
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
lib/llm/src/http/service/openai.rs (1)
672-782: Add RAIICtxDropGuardto theresponseshandlerThe
responsesfunction inlib/llm/src/http/service/openai.rs(around line 673) currently omits theCtxDropGuardpattern. For consistency with the other endpoints (see lines 283, 385, 494) and to ensure client-disconnects are handled correctly, wrap the request context in a guard and defuse it before returning.• File:
lib/llm/src/http/service/openai.rs
• Handler start: line 673 (async fn responses)
• Existing guards in this module at lines 283, 385, 494Suggested changes:
--- a/lib/llm/src/http/service/openai.rs +++ b/lib/llm/src/http/service/openai.rs @@ async fn responses( let (request, context) = request.into_parts(); + let mut drop_guard = CtxDropGuard::new(context.clone()); let mut request: NvCreateChatCompletionRequest = request.try_into().map_err(|e| { … })?; … inflight_guard.mark_ok(); + drop_guard.defuse(); Ok(Json(response).into_response()) }Please add these lines to ensure the guard is active during processing and defused on success.
🧹 Nitpick comments (1)
lib/llm/src/http/service/openai.rs (1)
158-188: Fix misleading method comments.The comments for
defuse()andmute()methods are misleading:
- Line 178:
defuse()doesn't take ownership, it takes a mutable reference- Line 184:
mute()is not a no-op, it modifies theverboseflagApply this diff to fix the comments:
- // request succeeded, no need to stop generating - // takes ownership + // Request succeeded, no need to stop generating fn defuse(&mut self) { self.issue_stop_generating = false; self.verbose = false; } - // no-op, moves the guard to a thread. + // Suppress verbose logging when the guard is moved to a streaming thread fn mute(&mut self) { self.verbose = false; }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/llm/src/http/service/openai.rs(10 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
lib/llm/src/http/service/openai.rs (5)
Learnt from: ryanolson
PR: #1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from Send + Sync + Debug to Send + Debug because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: grahamking
PR: #1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Learnt from: ryanolson
PR: #1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: kthui
PR: #1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: t-ob
PR: #1290
File: launch/dynamo-run/src/subprocess/sglang_inc.py:80-110
Timestamp: 2025-06-03T10:17:51.711Z
Learning: The sglang async_encode method does not support streaming options, so collecting all embeddings before yielding is the correct approach for embedding requests.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (5)
lib/llm/src/http/service/openai.rs (5)
21-21: LGTM!The import of
AsyncEngineContextis correctly added to support the new RAII guard functionality.
190-204: Well-implemented RAII cleanup logic.The Drop implementation correctly handles the lifecycle management with appropriate logging levels. This ensures
stop_generating()is called when requests are dropped due to client disconnections, addressing the PR's core objective.
283-283: Correct RAII guard usage in completions handler.The guard is properly instantiated and controlled:
- Created early to catch any premature drops
- Muted in streaming scenarios to avoid log spam
- Defused after successful non-streaming responses
This pattern correctly addresses the PR objectives for both streaming and non-streaming payloads.
Also applies to: 320-320, 346-346
385-385: Appropriate guard usage for non-streaming embeddings.The RAII guard correctly handles the embeddings endpoint which only supports non-streaming responses. The guard is defused after successful completion, ensuring
stop_generating()is only called on actual disconnections.Also applies to: 407-407
494-494: Consistent RAII guard pattern in chat completions.The guard usage follows the same correct pattern as the completions handler, properly handling both streaming and non-streaming scenarios. This ensures consistent behavior across all completion endpoints.
Also applies to: 551-551, 580-580
|
Hi @michaelfeil,
Can you review the latest changes from this PR and see if they address your problem? #2014 dynamo/lib/llm/src/http/service/disconnect.rs Lines 9 to 14 in d5a9897
|
|
@kthui @ryanolson to take a look |
|
@rmccorm4 Can you review the latest changes from this PR and see if they address your problem? #2014
Follow up: can |
|
Okay, I think overall the implementation that was merged 2 weeks ago is more solid. |
|
Concerns: Is there a issue to mark the request as cancelled after it reached. Can cancellation be logged somehow (/metrics would be nice but hard to manage -- still but dropped connections could be useful for discovering network stack failures) |
|
Any thoughts @rmccorm4 ? |
Hi @michaelfeil - @kthui is going to help take a look. |
|
Hi @michaelfeil, thanks for contributing to Dynamo! I looked over the current state of cancellation and found the following:
This implementation definitely appears more complicated, but I think @ryanolson can provide some insights into the design of http disconnect. We merged http disconnect and request migration on the same day and I think it caused a gap I found when http request sets With request migration, the stream http server is consuming is not the same one producing responses by the worker, because we want the http server stream to be uninterrupted if we choose to switch worker while the responses are being generated. The gap is we need to pass the I experimented with the simple code change below and found this change along is sufficient at bridging the gap: #[async_trait]
impl
Operator<
SingleIn<PreprocessedRequest>,
ManyOut<Annotated<LLMEngineOutput>>,
SingleIn<PreprocessedRequest>,
ManyOut<Annotated<LLMEngineOutput>>,
> for Migration
{
async fn generate(
&self,
request: SingleIn<PreprocessedRequest>,
next: ServerStreamingEngine<PreprocessedRequest, Annotated<LLMEngineOutput>>,
) -> Result<ManyOut<Annotated<LLMEngineOutput>>> {
let (preprocessed_request, context) = request.transfer(());
let engine_ctx = context.context();
let engine_ctx_ = engine_ctx.clone();
let retry_manager =
RetryManager::build(preprocessed_request, next, self.migration_limit).await?;
let response_stream = stream::unfold(retry_manager, move |mut retry_manager| {
let engine_ctx = engine_ctx_.clone();
async move {
if engine_ctx.is_stopped() || engine_ctx.is_killed() {
return None;
}
retry_manager
.next()
.await
.map(|response| (response, retry_manager))
}
});
Ok(ResponseStream::new(Box::pin(response_stream), engine_ctx))
}
}
Would you be able to test the above change on top of the main branch to see if that gives you the expected cancellation behavior? The steps I followed when testing: DYN_LOG=debug python -m dynamo.frontend
DYN_LOG=debug python3 -m dynamo.vllm --model ...
curl localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "...",
"messages": [{
"role": "user",
"content": "Tell me a long long long story about yourself"
}],
"stream": true/false,
"max_tokens": 16000
}'I cancelled the and the vLLM worker which indicated a successful cancellation. |
|
@kthui Thanks, I think the request migration is overall useful. We needed to implement that too, but in python by retrying failed request and forcing a new routing decision. My gap between OSS dynamo and my version of dynamo is quite large, not including the retryManger / request migration. I think the change in disconnect.rs is sufficient. |
|
For the most part, we are trying to get the cancellation directly into python side, so we can effectivlty terminate the request on the GPU level. #2158 |
@michaelfeil It is great to hear request migration is overall useful! I wonder if you would be able to share your use-case that requires the migration to be done in Python and couldn't use the Rust frontend implementation? |
@michaelfeil Would you be able to update this PR such that it bridges the gap between the streams? Then, we can merge this PR for the cancellation fix. |
|
Use-case: #2152 (comment) |
|
I am closing this, as i think the frontend based detection is sufficent. |
|
Will bring in the non-streaming patch with #2350 PR. |
|
Thanks, missed to actually closing the pr! |
Overview:
current
monitor for disconnectsis not suitable for:payloads that have stream=False.
cancellation that needs to happen BEFORE first token is streamed back /stream is established. Currently, waits for first token to be streamed / accptance of request.
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Bug Fixes
Other