-
Notifications
You must be signed in to change notification settings - Fork 725
fix: use channel to avoid race condition from async nats registration task #4758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughRefactors NATS service registration from fire-and-forget to an awaitable signaling pattern: Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lib/runtime/src/component.rs(1 hunks)lib/runtime/src/distributed.rs(4 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
📚 Learning: 2025-06-13T22:07:24.843Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Applied to files:
lib/runtime/src/component.rslib/runtime/src/distributed.rs
📚 Learning: 2025-07-16T12:41:12.543Z
Learnt from: grahamking
Repo: ai-dynamo/dynamo PR: 1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Applied to files:
lib/runtime/src/component.rs
📚 Learning: 2025-07-14T21:25:56.930Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Applied to files:
lib/runtime/src/distributed.rs
📚 Learning: 2025-05-29T06:20:12.901Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Applied to files:
lib/runtime/src/distributed.rs
🧬 Code graph analysis (1)
lib/runtime/src/component.rs (3)
lib/runtime/src/system_status_server.rs (1)
drt(89-91)lib/runtime/src/traits.rs (2)
drt(14-14)drt(27-29)lib/llm/src/local_model.rs (1)
service_name(366-368)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: vllm (arm64)
- GitHub Check: trtllm (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: sglang (amd64)
- GitHub Check: operator (amd64)
- GitHub Check: operator (arm64)
- GitHub Check: Build and Test - dynamo
- GitHub Check: clippy (.)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: tests (.)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: clippy (launch/dynamo-run)
🔇 Additional comments (3)
lib/runtime/src/distributed.rs (3)
420-425: LGTM - Channel-based signaling is appropriate for this use case.The API change to return a
Receiver<Result<(), String>>and using an mpsc channel with capacity 1 as a oneshot-style signal is a clean approach to synchronize registration completion.
471-488: Confirm both branches signal completion correctly.The code correctly sends
Ok(())after the if-else block (line 488), which covers both:
- The normal case where the service is inserted.
- The race case where another task inserted the service first (lines 479-485).
This ensures the caller unblocks regardless of which path is taken.
443-444: All exit paths signal the channel — no risk of receiver hanging.Each return path in the spawned task sends a result before exiting:
- Line 444: early return when service exists →
Ok(())- Lines 450-452: no NATS client →
Err(...)- Line 466: build failure →
Err(...)- Line 488: successful registration or race →
Ok(())This guarantees the caller's
blocking_recv()will always unblock.Also applies to: 450-452, 466-466, 487-488
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/runtime/src/component.rs (1)
276-312: Race condition fix correctly implemented withblock_in_place().The channel-based synchronization pattern properly addresses the race condition by ensuring NATS service registration completes before
serve_endpoint()attempts a lookup. The use ofblock_in_place()correctly handles blocking operations from async contexts.Two notes on the implementation:
The production runtime is configured as multi-threaded (via
tokio::runtime::Builder::new_multi_thread()inconfig.rs), which is required forblock_in_place()to work safely. It will panic on a single-threaded runtime.There is no timeout on
blocking_recv(). If service registration hangs, this will block indefinitely. Consider adding a timeout with a clear error message for production debugging, as the codebase uses configurable timeouts for similar blocking operations elsewhere (health checks, transport connections, etc.).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/runtime/src/component.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
📚 Learning: 2025-07-16T12:41:12.543Z
Learnt from: grahamking
Repo: ai-dynamo/dynamo PR: 1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Applied to files:
lib/runtime/src/component.rs
📚 Learning: 2025-07-14T21:25:56.930Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Applied to files:
lib/runtime/src/component.rs
📚 Learning: 2025-06-13T22:07:24.843Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Applied to files:
lib/runtime/src/component.rs
📚 Learning: 2025-05-29T06:20:12.901Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Applied to files:
lib/runtime/src/component.rs
📚 Learning: 2025-09-18T21:47:44.143Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.
Applied to files:
lib/runtime/src/component.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Validate PR title and add label
- GitHub Check: vllm (arm64)
- GitHub Check: trtllm (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: operator (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: sglang (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: clippy (.)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: tests (.)
- GitHub Check: Build and Test - dynamo
Overview:
This PR fixes race condition with nats stats service initializations
fixes: #4753
closes: DYN-1541
Details
Write pathhappens in a background job in asynchronously (fire & fortget task) andRead pathduring endpoint registration may/may-not find the service entry due to async / non-blocking nature of write/read path.in PR ( #4513) write path for add_stats_service was moved from
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.