Skip to content

Conversation

@PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented Aug 31, 2025

Overview:

  • We cannot delete the KV events jetstream because it would remove the durable consumers tied to previously launched Router replicas. This fixes a tight loop where the invalid Router replica (with no durable consumer) keeps trying to dequeue a KV event but never gets one.
  • In addition, change the default behavior to not reset states, to prevent user errors of accidentally bring existing Router replicas into bad states with a new replica launch.
  • Put the Router event handling and state syncing background processes back into tokio::spawn instead of the secondary runtime to prevent CI flakiness. These tasks are also fairly critical.
  • Cleaned up the cli args and the routing docs.

The only substantive change to review is nats.rs

Summary by CodeRabbit

  • New Features

    • Reset now purges existing NATS streams instead of deleting/recreating when enabled, with outcome logging.
    • Added a 1-hour inactivity threshold for pull consumers to enable automatic cleanup.
  • Chores

    • Reduced log noise by downgrading stream-creation conflict messages to debug and improving error formatting.
    • Added an internal TODO regarding durable consumer cleanup during ungraceful shutdown.

Signed-off-by: PeaBrane <yanrpei@gmail.com>
@PeaBrane PeaBrane requested a review from a team as a code owner August 31, 2025 10:53
@github-actions github-actions bot added the fix label Aug 31, 2025
@PeaBrane PeaBrane self-assigned this Aug 31, 2025
Signed-off-by: PeaBrane <yanrpei@gmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 31, 2025

Walkthrough

  • Added a TODO comment about durable consumer cleanup on ungraceful shutdown in the KV router subscriber.
  • Modified NATS transport reset flow: removed stream deletion, added purge-on-reset after failed create, adjusted log levels/messages, and set a 1-hour inactive threshold on pull consumers.

Changes

Cohort / File(s) Summary of Changes
KV Router Subscriber comments
lib/llm/src/kv_router/subscriber.rs
Inserted a TODO in the cancellation branch of start_kv_router_background regarding inability to guarantee durable consumer cleanup on ungraceful shutdown. No functional changes.
NATS transport reset/purge and consumer config
lib/runtime/src/transports/nats.rs
Removed pre-deletion of stream on reset; on create failure with reset_stream=true, fetch existing stream and purge it; updated log levels/messages; added pull consumer inactive_threshold of 3600s; purge success logs count, failures warn.

Sequence Diagram(s)

sequenceDiagram
  participant App as Runtime/NATS Transport
  participant NATS as NATS JetStream

  Note over App: Reset/Connect sequence

  App->>NATS: create_stream()
  alt create succeeds
    NATS-->>App: OK
  else create fails
    App-->>App: check reset_stream == true
    alt reset_stream == true
      App->>NATS: get_stream()
      NATS-->>App: Stream handle or error
      opt on handle
        App->>NATS: purge_stream()
        NATS-->>App: PurgeResult(count) or Error
        Note over App,NATS: Log purge outcome (debug on success, warn on failure)
      end
    else reset_stream == false
      Note over App: Log at debug and continue
    end
  end

  rect rgb(245,245,255)
  Note over App,NATS: Consumer setup (unchanged flow with tweak)
  App->>NATS: create/configure pull consumer<br/>(inactive_threshold = 3600s)
  NATS-->>App: OK or Error
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

In burrows of bytes I twitch my nose,
Streams don’t delete—now purge, it goes.
A quiet hour for consumers to nap,
Debug whispers where warnings would yap.
Thump-thump logs mark the trail I tread—
Carrots cached, stale messages shed. 🥕✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
lib/llm/src/kv_router/subscriber.rs (1)

181-188: Clarify cleanup guarantees on crash vs. graceful shutdown

Good note. Since the consumer now has an inactive_threshold (1h) on the server side, consider referencing that here to make the cleanup story explicit (graceful: delete consumer; ungraceful: rely on inactivity_threshold and/or etcd-driven cleanup). No functional change requested.

Would you confirm that 3600s is acceptable for worst-case replica downtime before auto-cleanup kicks in?

lib/runtime/src/transports/nats.rs (3)

544-547: Reword debug to avoid over-asserting root cause

The create_stream failure isn’t always “already exists.” Minor wording tweak keeps logs accurate while preserving intent.

Apply:

-                    log::debug!(
-                        "Failed to create NATS stream '{}': {e}. Stream likely already exists, continuing...",
-                        self.stream_name
-                    );
+                    log::debug!(
+                        "Failed to create NATS stream '{}': {e}. Assuming it may already exist; continuing (if reset_stream=true we'll attempt a purge).",
+                        self.stream_name
+                    );

549-572: Comment nit: say “existing stream,” not “newly created”

We purge only when create_stream fails (i.e., an existing stream is present).

Apply:

-                    // If reset_stream is true, purge all messages from the newly created stream
+                    // If reset_stream is true, purge all messages from the existing stream (create_stream failed)

580-580: Review inactivity threshold SLA and guard destructive stream deletions

  • Confirm 1 h inactive_threshold is acceptable in production; consider making it configurable via NATS_CONSUMER_INACTIVE_SECS.
  • Ensure delete_stream calls at lib/runtime/src/transports/nats.rs:1124 and 1307 (which drop entire streams) are limited to tests/dev or gated by configuration to avoid accidental data loss.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 488c870 and 181fdd1.

📒 Files selected for processing (2)
  • lib/llm/src/kv_router/subscriber.rs (1 hunks)
  • lib/runtime/src/transports/nats.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/llm/src/kv_router/subscriber.rs:36-44
Timestamp: 2025-08-29T10:03:48.303Z
Learning: PeaBrane prefers to keep PRs contained in scope and is willing to defer technical improvements to future PRs when the current implementation works for the immediate use case. They acknowledge technical debt but prioritize deliverability over completeness in individual PRs.
📚 Learning: 2025-08-27T17:56:14.648Z
Learnt from: kthui
PR: ai-dynamo/dynamo#2500
File: lib/llm/src/migration.rs:58-77
Timestamp: 2025-08-27T17:56:14.648Z
Learning: In lib/llm/src/migration.rs, the cancellation visibility in the Migration operator is intentionally one-way - it checks engine_ctx.is_stopped()/is_killed() to stop pulling from streams but doesn't link newly created streams as child contexts to the parent. This is a conscious architectural decision with plans for future enhancement.

Applied to files:

  • lib/llm/src/kv_router/subscriber.rs
📚 Learning: 2025-05-29T00:02:35.018Z
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Applied to files:

  • lib/llm/src/kv_router/subscriber.rs
📚 Learning: 2025-06-02T19:37:27.666Z
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

Applied to files:

  • lib/llm/src/kv_router/subscriber.rs
📚 Learning: 2025-06-13T22:07:24.843Z
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.

Applied to files:

  • lib/runtime/src/transports/nats.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (.)

Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
This reverts commit 524ff4c.

Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
@PeaBrane PeaBrane enabled auto-merge (squash) September 1, 2025 20:53
@PeaBrane PeaBrane merged commit 7fabe7b into main Sep 1, 2025
16 of 17 checks passed
@PeaBrane PeaBrane deleted the rupei/do-not-delete-stream branch September 1, 2025 21:25
KrishnanPrash pushed a commit that referenced this pull request Sep 2, 2025
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
dillon-cullinan pushed a commit that referenced this pull request Sep 5, 2025
Signed-off-by: PeaBrane <yanrpei@gmail.com>
nnshah1 pushed a commit that referenced this pull request Sep 8, 2025
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants