Skip to content

Conversation

@grahamking
Copy link
Contributor

@grahamking grahamking commented Dec 1, 2025

Previously the keep-alive thread ran until the process stopped. In the normal case that's fine.

In some testing setups there are multiple FileStore, and their lifetime is not tied to the process, so now the keep-alive thread will exit cleanly.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced file-based key-value store with improved cancellation support for background maintenance operations. Shutdown signals are now properly recognized and handled, enabling clean resource release and graceful termination of storage processes. This significantly improves overall runtime stability and reliability, ensuring safe and predictable shutdown behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@grahamking grahamking requested a review from a team as a code owner December 1, 2025 18:01
@github-actions github-actions bot added the fix label Dec 1, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 1, 2025

Walkthrough

The changes introduce cancellation token support to the file-based KV store backend. The Manager::file constructor is updated to accept a CancellationToken parameter, which is propagated through FileStore and used by the expiry thread for graceful cancellation. The distributed runtime is updated to pass its primary token when initializing the file backend.

Changes

Cohort / File(s) Summary
KV File Store Backend
lib/runtime/src/storage/kv/file.rs
Added cancel_token field to FileStore struct; updated FileStore::new constructor signature to accept CancellationToken as first parameter; integrated cancellation token into expiry thread logic for periodic cancellation checks; updated test instantiations to provide CancellationToken::new()
KV Manager API
lib/runtime/src/storage/kv.rs
Updated Manager::file constructor signature to accept CancellationToken as first parameter before root path; propagated token to FileStore::new invocation
Runtime KV Integration
lib/runtime/src/distributed.rs
Updated kv::Manager::file invocation to pass runtime.primary_token() as first argument alongside root path

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35 minutes

  • Cancellation token lifecycle verification: Ensure the token is properly propagated through the constructor chain and has correct lifetime semantics
  • Expiry thread cancellation logic: Verify the periodic cancellation check implementation is correct and doesn't introduce race conditions or premature exits
  • Test coverage: Confirm all test instantiations of FileStore are updated consistently and cover cancellation scenarios
  • API surface changes: Review the public signature changes to Manager::file and FileStore::new for breaking changes and downstream impact

Poem

🐰 With tokens held tight in paws so small,
The file store dances at cancellation's call,
Each expiry thread bows when whispered to cease,
Graceful shutdowns bring distributed peace! 🌿✨

Pre-merge checks

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The description explains the motivation and problem being solved, but lacks adherence to the repository template with missing sections like Details and Where should the reviewer start. Expand the description to follow the template: add a Details section explaining the implementation changes, specify which files to review closely, and add Related Issues section.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding cancellation token support to stop the FileStore keep-alive thread, which is the primary focus of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
lib/runtime/src/storage/kv/file.rs (2)

64-77: Consider checking cancellation before sleep as well.

Currently, the cancellation check happens only after the sleep completes. If cancellation is signaled at the start of a sleep interval, the thread will wait the full keep_alive_interval (1-3.3s) before exiting. While the docstring acknowledges this is acceptable, checking at both ends of the loop could improve responsiveness with minimal code change:

     fn expiry_thread(&self) {
         loop {
             let ttl = self.shortest_ttl();
             let keep_alive_interval = cmp::max(ttl / 3, MIN_KEEP_ALIVE);

+            if self.cancel_token.is_cancelled() {
+                break;
+            }
             thread::sleep(keep_alive_interval);
             if self.cancel_token.is_cancelled() {
                 break;
             }

That said, given the stated design intent and test-focused use case, the current implementation is reasonable.


484-503: Test updated correctly, but orphaned thread may accumulate.

The test creates a FileStore with a fresh CancellationToken that is never cancelled, leaving the expiry thread running after the test completes. For a single test this is fine, but in a larger test suite with many FileStore instances, these orphaned threads could accumulate.

Consider cancelling the token at test end if this becomes problematic:

#[tokio::test]
async fn test_entries_full_path() {
    let t = tempfile::tempdir().unwrap();
    let cancel = CancellationToken::new();

    let m = FileStore::new(cancel.clone(), t.path());
    // ... test logic ...

    cancel.cancel(); // Clean shutdown of expiry thread
}

For now, the current approach is acceptable.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f6dca6 and 86b5bd2.

📒 Files selected for processing (3)
  • lib/runtime/src/distributed.rs (1 hunks)
  • lib/runtime/src/storage/kv.rs (1 hunks)
  • lib/runtime/src/storage/kv/file.rs (5 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane suggested using tokio::select! arm ordering with the existing biased directive in the indexer to create a natural barrier for dump requests, ensuring KV events are drained before snapshotting. This approach leverages existing architecture (biased select) to solve race conditions with minimal code changes, which aligns with their preference for contained solutions.

Applied to files:

  • lib/runtime/src/storage/kv/file.rs
  • lib/runtime/src/distributed.rs
📚 Learning: 2025-08-15T23:51:04.958Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, `PrefixWatcher` uses `#[derive(Dissolve)]` to generate a `dissolve()` method. The pattern `let (_, _watcher, mut events_rx) = prefix_watcher.dissolve();` is the standard and intended usage throughout the codebase. The `mpsc::Receiver<WatchEvent>` maintains the etcd watch stream independently, so the `Watcher` handle can be safely dropped. This pattern is used consistently in critical infrastructure modules like component/client.rs, utils/leader_worker_barrier.rs, and entrypoint/input/http.rs.

Applied to files:

  • lib/runtime/src/storage/kv/file.rs
📚 Learning: 2025-09-25T00:54:01.369Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: tests/fault_tolerance/cancellation/test_trtllm.py:4-204
Timestamp: 2025-09-25T00:54:01.369Z
Learning: The fault tolerance tests in tests/fault_tolerance/cancellation/ run in a controlled container environment where files written to /workspace are automatically cleaned up after test completion, and tests execute sequentially without concurrency concerns, so temporary file management for config files is not necessary.

Applied to files:

  • lib/runtime/src/storage/kv/file.rs
🧬 Code graph analysis (3)
lib/runtime/src/storage/kv/file.rs (3)
lib/runtime/src/distributed.rs (1)
  • new (98-293)
lib/runtime/src/discovery/kv_store.rs (1)
  • new (27-32)
lib/runtime/src/utils/tasks/tracker.rs (1)
  • cancel_token (1951-1954)
lib/runtime/src/distributed.rs (1)
lib/runtime/src/storage/kv.rs (1)
  • file (268-270)
lib/runtime/src/storage/kv.rs (2)
lib/runtime/src/distributed.rs (1)
  • new (98-293)
lib/runtime/src/storage/kv/file.rs (2)
  • new (46-56)
  • new (185-199)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: clippy (.)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: tests (.)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
lib/runtime/src/distributed.rs (1)

111-111: LGTM!

The change correctly passes runtime.primary_token() to the file-based KV store, consistent with how cancellation tokens are propagated to other components in this file (e.g., KubeDiscoveryClient at line 152, KVStoreDiscovery at line 166).

lib/runtime/src/storage/kv/file.rs (2)

36-56: LGTM!

The CancellationToken is correctly integrated into the struct and constructor. Using CancellationToken::is_cancelled() from a std::thread is safe since the check is synchronous and the token uses Arc internally for thread-safe state sharing.


58-62: Good documentation.

The docstring clearly explains the cancellation behavior and rationale for using a real thread instead of a tokio task.

lib/runtime/src/storage/kv.rs (1)

268-270: LGTM!

The Manager::file API correctly propagates the CancellationToken to FileStore::new. The change is consistent with the updated constructor signature and the caller in distributed.rs has been updated accordingly.

Previously the keep-alive thread ran until the process stopped. In the
normal case that's fine.

In some testing setups there are multiple FileStore, and their lifetime
is not tied to the process, so now the keep-alive thread will exit cleanly.

Signed-off-by: Graham King <grahamk@nvidia.com>
Signed-off-by: Graham King <grahamk@nvidia.com>
@grahamking grahamking force-pushed the gk-filestore-keepalive-cancel-token branch from 97333cb to 7fbb235 Compare December 2, 2025 13:30
@grahamking grahamking enabled auto-merge (squash) December 2, 2025 13:30
@grahamking grahamking merged commit c9fdc2e into main Dec 2, 2025
38 of 39 checks passed
@grahamking grahamking deleted the gk-filestore-keepalive-cancel-token branch December 2, 2025 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants