Skip to content

Conversation

@biswapanda
Copy link
Contributor

@biswapanda biswapanda commented Dec 4, 2025

Overview:

This PR fixes race condition with nats stats service initializations

fixes: #4753
closes: DYN-1541

Details

  • Root cause : Write path happens in a background job in asynchronously (fire & fortget task) and Read path during endpoint registration may/may-not find the service entry due to async / non-blocking nature of write/read path.

in PR ( #4513) write path for add_stats_service was moved from

component.add_stats_service().await?
component.drt().runtime().secondary().spawn(async move {
                if let Err(err) = c.add_stats_service().await {

Summary by CodeRabbit

  • Bug Fixes
    • Improved NATS service registration handling to ensure proper initialization before services are accessed, enhancing reliability in distributed deployments.

✏️ Tip: You can customize this high-level summary in your review settings.

Summary by CodeRabbit

  • Bug Fixes
    • Service registration now provides a completion signal and can be awaited, returning success or a clear error when registration fails.
    • Improves reliability during startup by ensuring services are registered before being used and surfaces registration errors for easier troubleshooting.

✏️ Tip: You can customize this high-level summary in your review settings.

@biswapanda biswapanda requested a review from a team as a code owner December 4, 2025 22:42
@github-actions github-actions bot added the fix label Dec 4, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

Walkthrough

Refactors NATS service registration from fire-and-forget to an awaitable signaling pattern: distributed.rs makes register_nats_service() return an mpsc receiver that reports success or error; component.rs blocks and waits on that receiver after building the service to avoid a lookup race.

Changes

Cohort / File(s) Summary
NATS registration API & signaling
lib/runtime/src/distributed.rs
register_nats_service() signature changed to return tokio::sync::mpsc::Receiver<Result<(), String>>. Spawns a background task that sends Ok(()) or Err(...) on all critical paths (existing service, missing NATS client, build failure, post-registration).
Synchronous wait on registration
lib/runtime/src/component.rs
After building the NATS service, code now registers the service and blocks (via block_in_place + blocking_recv) on the returned receiver, handling Some(Ok(())), Some(Err(e)), and None outcomes to surface errors or prevent races.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • Review API signature change and ensure all callers are updated to handle the returned receiver.
  • Verify blocking wait usage (block_in_place / blocking_recv) does not introduce deadlocks or violate runtime/threading expectations.
  • Confirm sender lifetimes and all code paths in the background task send a signal.
  • Validate error propagation and logging paths in component.rs for the three receiver outcomes.

Poem

🐰 I built a bridge of tiny hops,
A channel hums, the waiting stops.
No racing feet, no frantic pace—
We tune the beat, we find our place. 🥕✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: using a channel mechanism to fix a race condition in async NATS registration, which directly matches the code changes shown in the summary.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The PR description includes all required sections: Overview with issue references, Details explaining the root cause and the fix approach, and mentions specific files modified.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5942ef8 and 1d1f340.

📒 Files selected for processing (2)
  • lib/runtime/src/component.rs (1 hunks)
  • lib/runtime/src/distributed.rs (4 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
📚 Learning: 2025-06-13T22:07:24.843Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Applied to files:

  • lib/runtime/src/component.rs
  • lib/runtime/src/distributed.rs
📚 Learning: 2025-07-16T12:41:12.543Z
Learnt from: grahamking
Repo: ai-dynamo/dynamo PR: 1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Applied to files:

  • lib/runtime/src/component.rs
📚 Learning: 2025-07-14T21:25:56.930Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Applied to files:

  • lib/runtime/src/distributed.rs
📚 Learning: 2025-05-29T06:20:12.901Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Applied to files:

  • lib/runtime/src/distributed.rs
🧬 Code graph analysis (1)
lib/runtime/src/component.rs (3)
lib/runtime/src/system_status_server.rs (1)
  • drt (89-91)
lib/runtime/src/traits.rs (2)
  • drt (14-14)
  • drt (27-29)
lib/llm/src/local_model.rs (1)
  • service_name (366-368)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: vllm (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: operator (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: clippy (.)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: tests (.)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: clippy (launch/dynamo-run)
🔇 Additional comments (3)
lib/runtime/src/distributed.rs (3)

420-425: LGTM - Channel-based signaling is appropriate for this use case.

The API change to return a Receiver<Result<(), String>> and using an mpsc channel with capacity 1 as a oneshot-style signal is a clean approach to synchronize registration completion.


471-488: Confirm both branches signal completion correctly.

The code correctly sends Ok(()) after the if-else block (line 488), which covers both:

  1. The normal case where the service is inserted.
  2. The race case where another task inserted the service first (lines 479-485).

This ensures the caller unblocks regardless of which path is taken.


443-444: All exit paths signal the channel — no risk of receiver hanging.

Each return path in the spawned task sends a result before exiting:

  • Line 444: early return when service exists → Ok(())
  • Lines 450-452: no NATS client → Err(...)
  • Line 466: build failure → Err(...)
  • Line 488: successful registration or race → Ok(())

This guarantees the caller's blocking_recv() will always unblock.

Also applies to: 450-452, 466-466, 487-488

@biswapanda biswapanda self-assigned this Dec 4, 2025
@biswapanda biswapanda enabled auto-merge (squash) December 5, 2025 00:01
@keivenchang keivenchang self-requested a review December 5, 2025 00:04
@biswapanda
Copy link
Contributor Author

@CodeRabbit review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 5, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lib/runtime/src/component.rs (1)

276-312: Race condition fix correctly implemented with block_in_place().

The channel-based synchronization pattern properly addresses the race condition by ensuring NATS service registration completes before serve_endpoint() attempts a lookup. The use of block_in_place() correctly handles blocking operations from async contexts.

Two notes on the implementation:

  1. The production runtime is configured as multi-threaded (via tokio::runtime::Builder::new_multi_thread() in config.rs), which is required for block_in_place() to work safely. It will panic on a single-threaded runtime.

  2. There is no timeout on blocking_recv(). If service registration hangs, this will block indefinitely. Consider adding a timeout with a clear error message for production debugging, as the codebase uses configurable timeouts for similar blocking operations elsewhere (health checks, transport connections, etc.).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1d1f340 and 67aa299.

📒 Files selected for processing (1)
  • lib/runtime/src/component.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
📚 Learning: 2025-07-16T12:41:12.543Z
Learnt from: grahamking
Repo: ai-dynamo/dynamo PR: 1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Applied to files:

  • lib/runtime/src/component.rs
📚 Learning: 2025-07-14T21:25:56.930Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Applied to files:

  • lib/runtime/src/component.rs
📚 Learning: 2025-06-13T22:07:24.843Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Applied to files:

  • lib/runtime/src/component.rs
📚 Learning: 2025-05-29T06:20:12.901Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Applied to files:

  • lib/runtime/src/component.rs
📚 Learning: 2025-09-18T21:47:44.143Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.

Applied to files:

  • lib/runtime/src/component.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: Validate PR title and add label
  • GitHub Check: vllm (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: clippy (.)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: tests (.)
  • GitHub Check: Build and Test - dynamo

@biswapanda biswapanda merged commit 501ef02 into main Dec 5, 2025
44 of 47 checks passed
@biswapanda biswapanda deleted the bis/fix-nats-race-cond branch December 5, 2025 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: NATS request-plane race condition on registration maybe

3 participants