-
Notifications
You must be signed in to change notification settings - Fork 720
fix(FileStore): Stop keep-alive thread on cancellation token. #4666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes introduce cancellation token support to the file-based KV store backend. The Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~35 minutes
Poem
Pre-merge checks❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
lib/runtime/src/storage/kv/file.rs (2)
64-77: Consider checking cancellation before sleep as well.Currently, the cancellation check happens only after the sleep completes. If cancellation is signaled at the start of a sleep interval, the thread will wait the full
keep_alive_interval(1-3.3s) before exiting. While the docstring acknowledges this is acceptable, checking at both ends of the loop could improve responsiveness with minimal code change:fn expiry_thread(&self) { loop { let ttl = self.shortest_ttl(); let keep_alive_interval = cmp::max(ttl / 3, MIN_KEEP_ALIVE); + if self.cancel_token.is_cancelled() { + break; + } thread::sleep(keep_alive_interval); if self.cancel_token.is_cancelled() { break; }That said, given the stated design intent and test-focused use case, the current implementation is reasonable.
484-503: Test updated correctly, but orphaned thread may accumulate.The test creates a
FileStorewith a freshCancellationTokenthat is never cancelled, leaving the expiry thread running after the test completes. For a single test this is fine, but in a larger test suite with manyFileStoreinstances, these orphaned threads could accumulate.Consider cancelling the token at test end if this becomes problematic:
#[tokio::test] async fn test_entries_full_path() { let t = tempfile::tempdir().unwrap(); let cancel = CancellationToken::new(); let m = FileStore::new(cancel.clone(), t.path()); // ... test logic ... cancel.cancel(); // Clean shutdown of expiry thread }For now, the current approach is acceptable.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
lib/runtime/src/distributed.rs(1 hunks)lib/runtime/src/storage/kv.rs(1 hunks)lib/runtime/src/storage/kv/file.rs(5 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane suggested using tokio::select! arm ordering with the existing biased directive in the indexer to create a natural barrier for dump requests, ensuring KV events are drained before snapshotting. This approach leverages existing architecture (biased select) to solve race conditions with minimal code changes, which aligns with their preference for contained solutions.
Applied to files:
lib/runtime/src/storage/kv/file.rslib/runtime/src/distributed.rs
📚 Learning: 2025-08-15T23:51:04.958Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, `PrefixWatcher` uses `#[derive(Dissolve)]` to generate a `dissolve()` method. The pattern `let (_, _watcher, mut events_rx) = prefix_watcher.dissolve();` is the standard and intended usage throughout the codebase. The `mpsc::Receiver<WatchEvent>` maintains the etcd watch stream independently, so the `Watcher` handle can be safely dropped. This pattern is used consistently in critical infrastructure modules like component/client.rs, utils/leader_worker_barrier.rs, and entrypoint/input/http.rs.
Applied to files:
lib/runtime/src/storage/kv/file.rs
📚 Learning: 2025-09-25T00:54:01.369Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: tests/fault_tolerance/cancellation/test_trtllm.py:4-204
Timestamp: 2025-09-25T00:54:01.369Z
Learning: The fault tolerance tests in tests/fault_tolerance/cancellation/ run in a controlled container environment where files written to /workspace are automatically cleaned up after test completion, and tests execute sequentially without concurrency concerns, so temporary file management for config files is not necessary.
Applied to files:
lib/runtime/src/storage/kv/file.rs
🧬 Code graph analysis (3)
lib/runtime/src/storage/kv/file.rs (3)
lib/runtime/src/distributed.rs (1)
new(98-293)lib/runtime/src/discovery/kv_store.rs (1)
new(27-32)lib/runtime/src/utils/tasks/tracker.rs (1)
cancel_token(1951-1954)
lib/runtime/src/distributed.rs (1)
lib/runtime/src/storage/kv.rs (1)
file(268-270)
lib/runtime/src/storage/kv.rs (2)
lib/runtime/src/distributed.rs (1)
new(98-293)lib/runtime/src/storage/kv/file.rs (2)
new(46-56)new(185-199)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: operator (arm64)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: clippy (.)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: tests (.)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
lib/runtime/src/distributed.rs (1)
111-111: LGTM!The change correctly passes
runtime.primary_token()to the file-based KV store, consistent with how cancellation tokens are propagated to other components in this file (e.g.,KubeDiscoveryClientat line 152,KVStoreDiscoveryat line 166).lib/runtime/src/storage/kv/file.rs (2)
36-56: LGTM!The
CancellationTokenis correctly integrated into the struct and constructor. UsingCancellationToken::is_cancelled()from astd::threadis safe since the check is synchronous and the token usesArcinternally for thread-safe state sharing.
58-62: Good documentation.The docstring clearly explains the cancellation behavior and rationale for using a real thread instead of a tokio task.
lib/runtime/src/storage/kv.rs (1)
268-270: LGTM!The
Manager::fileAPI correctly propagates theCancellationTokentoFileStore::new. The change is consistent with the updated constructor signature and the caller indistributed.rshas been updated accordingly.
Previously the keep-alive thread ran until the process stopped. In the normal case that's fine. In some testing setups there are multiple FileStore, and their lifetime is not tied to the process, so now the keep-alive thread will exit cleanly. Signed-off-by: Graham King <grahamk@nvidia.com>
Signed-off-by: Graham King <grahamk@nvidia.com>
97333cb to
7fbb235
Compare
Previously the keep-alive thread ran until the process stopped. In the normal case that's fine.
In some testing setups there are multiple FileStore, and their lifetime is not tied to the process, so now the keep-alive thread will exit cleanly.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.