refactor: kvbm modularity DIS-657 Eliminate ETCD from the leader-worker initialization #3202

richardhuo-nv · 2025-09-24T15:49:39Z

Overview:

To modularize KVBM and eliminate ETCD from leader–worker initialization, the leader and workers will communicate exclusively through ZMQ.

Details:

Remove the ETCD barrier from leader–worker synchronization and replace it with direct ZMQ communication.
Add worker_metadata and leader_metadata message types to support metadata exchange between workers and the leader.
Use variable gating to ensure correctness during both blocking and non-blocking initialization.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Introduced environment-based configuration for leader publish/acknowledge URLs, replacing the previous barrier prefix setting.
- Added a startup handshake that exchanges worker/leader metadata and validates readiness.
- Background worker initialization with gated ping acknowledgements for smoother bring-up.
Refactor
- Streamlined distributed startup flow with timeouts, retries, and payload collection for improved robustness.
- Removed barrier-based synchronization across leader/worker components.
Chores
- Added a binary serialization dependency to support metadata exchange.

copy-pr-bot · 2025-09-24T15:49:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-09-24T15:52:23Z

Walkthrough

Replaces barrier-based coordination with explicit ZMQ pub/ack URL configuration and a staged leader–worker handshake. Updates Rust core and Python bindings to use leader_pub_url and leader_ack_url. Adds metadata structs and bincode payload handling, revises ZMQ messaging, leader/worker init flows, and removes barrier synchronization paths. Minor config cloning fix and adds bincode dependency.

Changes

Cohort / File(s)	Summary
Bindings: distributed re-exports and configs `lib/bindings/python/rust/llm/block_manager/distributed.rs`, `.../distributed/leader.rs`, `.../distributed/worker.rs`	Replace `get_barrier_id_prefix` with `get_leader_zmq_pub_url` and `get_leader_zmq_ack_url`. Leader/worker builders now use leader URLs; barrier prefix usage removed.
Bindings: vLLM connector adjustments `lib/bindings/python/rust/llm/block_manager/vllm/connector.rs`, `.../connector/leader.rs`, `.../connector/leader/recorder.rs`, `.../connector/trtllm_leader.rs`, `.../connector/trtllm_worker.rs`, `.../connector/worker.rs`	Drop `ConnectorOperation` and barrier readiness calls; update worker config to leader URLs; remove barrier spawn/run calls.
Core LLM: distributed leader/worker overhaul `lib/llm/src/block_manager/distributed/leader.rs`, `.../distributed/worker.rs`	Remove barrier- and DRT-based paths. Add leader_pub_url/leader_ack_url in configs. Introduce async worker allocation readiness, gated ping ACKs, metadata handlers, and background allocation/build of transfer handler.
Core LLM: ZMQ protocol and startup `lib/llm/src/block_manager/distributed/zmq.rs`	Introduce handshake startup (`new_with_handshake`), `broadcast_collect`, payload replies, updated `LeaderSockets` construction with separate pub/ack URLs, message handle API changes (public fields, reply/mark_handled), staged timeouts/retries, remove Ping handler.
Core LLM: distributed utils (metadata) `lib/llm/src/block_manager/distributed/utils.rs`	Add ZMQ metadata message constants and `WorkerMetadata`/`LeaderMetadata` (Serialize/Deserialize).
Core LLM: tests/config usage `lib/llm/src/block_manager/distributed.rs`	Remove barrier ID generation and propagation in tests.
Offload config clone fix `lib/llm/src/block_manager/offload.rs`	Clone layout config for disk pool build to preserve original config.
Cargo dependency `lib/llm/Cargo.toml`	Add `bincode = "1"` for metadata payload serialization.
Bindings: distributed utils (env/config) `lib/bindings/python/rust/llm/block_manager/distributed/utils.rs`	Replace barrier prefix env lookup with env-driven leader ZMQ host/port parsing and URL builders; add validation helpers; expose `get_leader_zmq_pub_url`/`get_leader_zmq_ack_url`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant L as Leader
  participant Pub as ZMQ Pub (bind)
  participant Ack as ZMQ Ack (bind)
  participant W1 as Worker 1
  participant Wn as Worker N

  rect rgba(230,245,255,0.6)
  note over L, Ack: Startup & Bind (new_with_handshake)
  L->>Pub: Bind leader_pub_url
  L->>Ack: Bind leader_ack_url
  end

  rect rgba(240,255,230,0.6)
  note over L,Wn: Round 1: Collect WorkerMetadata
  L->>W1: broadcast "worker_metadata?"
  L->>Wn: broadcast "worker_metadata?"
  W1-->>L: reply WorkerMetadata (bincode)
  Wn-->>L: reply WorkerMetadata (bincode)
  L->>L: make LeaderMetadata from collected WorkerMetadata
  end

  rect rgba(255,245,230,0.6)
  note over L,Wn: Round 2: Send LeaderMetadata & Await ACKs
  L->>W1: broadcast "leader_metadata" (bincode payload)
  L->>Wn: broadcast "leader_metadata" (bincode payload)
  W1-->>L: ACK
  Wn-->>L: ACK
  end

  rect rgba(245,230,255,0.6)
  note over L,Wn: Round 3: Final readiness ping/ack
  L->>W1: ping_ready?
  L->>Wn: ping_ready?
  W1-->>L: ACK (after background allocation ready)
  Wn-->>L: ACK (after background allocation ready)
  L->>L: workers_allocation_ready = true
  end

sequenceDiagram
  autonumber
  participant Old as Old Flow
  participant New as New Flow

  rect rgba(255,240,240,0.6)
  note over Old: Barrier-based sync
  Old->>Old: Use barrier_id_prefix
  Old->>Old: Spawn/run readiness barrier
  Old->>Old: Proceed after barrier completion
  end

  rect rgba(240,255,240,0.6)
  note over New: URL + Handshake-based sync
  New->>New: Use leader_pub_url/leader_ack_url
  New->>New: Handshake: collect WorkerMetadata
  New->>New: Broadcast LeaderMetadata
  New->>New: Await final readiness ACKs
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Possibly related PRs

feat: DIS-373 dynamo KVBM connector API integration with TRTLLM #2544 — Touches the same distributed bindings and utils, adjusting leader/worker initialization and re-exports in a similar area.
feat: kvbm + connector #2258 — Introduced/modified distributed leader/worker utils that this PR now updates from barrier ID to leader URL configuration.
merge with main #2252 — Changes in the same distributed modules replacing barrier-based config with leader ZMQ URL wiring.

Poem

A rabbit taps the socket drum—thump, thump—
No barriers now, just URLs to jump.
Metadata whispers in bincode bytes,
Leaders and workers align their lights.
Handshakes done, ACKs cascade—hooray!
Carrots for all: we’re distributed today. 🥕✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly summarizes the primary change: refactoring KVBM to remove ETCD from leader–worker initialization and switch to ZMQ, and it includes the tracking ticket (DIS-657) for context; this matches the PR objectives and the diffs that remove the ETCD/barrier logic. It is a single, specific sentence without file lists or ambiguous phrasing, so a reviewer scanning history can understand the main intent. Therefore the title meets the repository's clarity and specificity criteria.
Description Check	✅ Passed	The PR description follows the repository template and clearly states the Overview and Details (removal of ETCD barrier, addition of worker/leader metadata, and gating changes), so the intent and scope are easy to understand. The "Where should the reviewer start?" section is present but empty and the Related Issues field still uses a placeholder ("#xxx"), which should be filled in to help reviewers and link the PR to the issue. These omissions are non-critical, so the description is largely complete but would benefit from those additions to streamline review.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (6)

lib/llm/Cargo.toml (1)

105-105: Confirm bincode usage and pin appropriately

Adding bincode makes sense for the new ZMQ payloads. Consider pinning to a specific minor/patch to avoid surprise upgrades, and ensure mixed use with serde_json is intentional and consistent across components.

Would you like a quick sweep to identify where bincode vs serde_json is used in the distributed paths to confirm consistency?
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs (1)
13-13: Worker config migrated to leader URLs — good. Small robustness tweak suggested for event sync

The URL-based init aligns with the new handshake. In save_kv_layer, avoid potential panic on indexing by using last():
-            event_sync_blocking(self.layer_events[self.layers_complete - 1]);
+            if let Some(&ev) = self.layer_events.last() {
+                event_sync_blocking(ev);
+            }
Note: bind_connector_meta currently JSON-decodes ConnectorMetadata; if the rest of the path standardizes on bincode for metadata, consider aligning for consistency.

Also applies to: 138-140
lib/bindings/python/rust/llm/block_manager/distributed/utils.rs (2)
23-35: Warns on privileged ports — consider adding a duplicate-port guard upstream.

Ports validation is good. One additional safeguard: detect if PUB and ACK ports are configured equal (bind will fail).

Would you like a small helper (or sanity_check in leader config) to error if pub == ack ports? It avoids opaque bind failures.

37-64: URL builders are fine; minor type nit.

Returning ports as String is OK, but returning u16 and formatting later reduces conversions. Optional, non-blocking.
-fn get_leader_zmq_pub_port() -> String {
-    validated_port_from_env("DYN_KVBM_LEADER_ZMQ_PUB_PORT", DEFAULT_LEADER_ZMQ_PUB_PORT).to_string()
-}
+fn get_leader_zmq_pub_port() -> u16 {
+    validated_port_from_env("DYN_KVBM_LEADER_ZMQ_PUB_PORT", DEFAULT_LEADER_ZMQ_PUB_PORT)
+}
 
-fn get_leader_zmq_ack_port() -> String {
-    validated_port_from_env("DYN_KVBM_LEADER_ZMQ_ACK_PORT", DEFAULT_LEADER_ZMQ_ACK_PORT).to_string()
-}
+fn get_leader_zmq_ack_port() -> u16 {
+    validated_port_from_env("DYN_KVBM_LEADER_ZMQ_ACK_PORT", DEFAULT_LEADER_ZMQ_ACK_PORT)
+}
 
 pub fn get_leader_zmq_pub_url() -> String {
     format!(
         "tcp://{}:{}",
         get_leader_zmq_host(),
-        get_leader_zmq_pub_port()
+        get_leader_zmq_pub_port()
     )
 }
 
 pub fn get_leader_zmq_ack_url() -> String {
     format!(
         "tcp://{}:{}",
         get_leader_zmq_host(),
-        get_leader_zmq_ack_port()
+        get_leader_zmq_ack_port()
     )
 }
lib/llm/src/block_manager/distributed/leader.rs (1)
66-94: Prefer returning errors over panics in sanity_check.

This is a user-misconfig; don’t crash the process.
-                panic!(
+                anyhow::bail!(
                     "KVBM Configuration Error: No CPU memory configured.\n\
 ...
-                panic!(
+                anyhow::bail!(
                     "KVBM Configuration Error: CPU memory must be configured before disk memory.\n\
lib/llm/src/block_manager/distributed/zmq.rs (1)
401-406: Expose only what you need on MessageHandle.

Making message_id/push_handle public broadens surface area. Consider getters instead.
-pub struct MessageHandle {
-    pub message_id: usize,
+pub struct MessageHandle {
+    message_id: usize,
@@
-    pub push_handle: Arc<Mutex<Push>>,
+    push_handle: Arc<Mutex<Push>>,
Add accessors if required.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70f9938 and 7b3418f.

⛔ Files ignored due to path filters (2)

Cargo.lock is excluded by !**/*.lock
lib/bindings/python/Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (17)

lib/bindings/python/rust/llm/block_manager/distributed.rs (1 hunks)
lib/bindings/python/rust/llm/block_manager/distributed/leader.rs (2 hunks)
lib/bindings/python/rust/llm/block_manager/distributed/utils.rs (1 hunks)
lib/bindings/python/rust/llm/block_manager/distributed/worker.rs (2 hunks)
lib/bindings/python/rust/llm/block_manager/vllm/connector.rs (1 hunks)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (0 hunks)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (0 hunks)
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (0 hunks)
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs (2 hunks)
lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (2 hunks)
lib/llm/Cargo.toml (1 hunks)
lib/llm/src/block_manager/distributed.rs (0 hunks)
lib/llm/src/block_manager/distributed/leader.rs (4 hunks)
lib/llm/src/block_manager/distributed/utils.rs (1 hunks)
lib/llm/src/block_manager/distributed/worker.rs (11 hunks)
lib/llm/src/block_manager/distributed/zmq.rs (10 hunks)
lib/llm/src/block_manager/offload.rs (1 hunks)

💤 Files with no reviewable changes (4)

lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs
lib/llm/src/block_manager/distributed.rs

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

📚 Learning: 2025-08-18T20:51:51.324Z

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2465
File: lib/runtime/src/pipeline/network/egress/push_router.rs:0-0
Timestamp: 2025-08-18T20:51:51.324Z
Learning: The runtime crate cannot depend on the llm crate due to architectural dependency constraints, preventing imports from lib/llm into lib/runtime.

Applied to files:

lib/llm/Cargo.toml

📚 Learning: 2025-06-08T03:12:03.985Z

Learnt from: jthomson04
PR: ai-dynamo/dynamo#1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the `wait_for_key_count` function correctly uses exact equality (`==`) instead of greater-than-or-equal (`>=`) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.

Applied to files:

lib/bindings/python/rust/llm/block_manager/distributed.rs

📚 Learning: 2025-05-29T06:20:12.901Z

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Applied to files:

lib/llm/src/block_manager/distributed/worker.rs

📚 Learning: 2025-09-17T01:00:50.937Z

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Applied to files:

lib/llm/src/block_manager/distributed/zmq.rs

🧬 Code graph analysis (10)

lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (1)

lib/bindings/python/rust/llm/block_manager/distributed/utils.rs (2)

get_leader_zmq_ack_url (58-64)

get_leader_zmq_pub_url (50-56)

lib/bindings/python/rust/llm/block_manager/distributed/utils.rs (1)

lib/bindings/python/rust/llm/block_manager/distributed/leader.rs (3)

std (21-21)

std (25-27)

v (52-52)

lib/bindings/python/rust/llm/block_manager/distributed.rs (1)

lib/bindings/python/rust/llm/block_manager/distributed/utils.rs (2)

get_leader_zmq_ack_url (58-64)

get_leader_zmq_pub_url (50-56)

lib/llm/src/block_manager/distributed/leader.rs (3)

lib/bindings/python/rust/llm/block_manager/distributed/leader.rs (1)

new (73-98)

lib/bindings/python/rust/llm/block_manager/distributed/worker.rs (2)

new (29-58)

new (112-159)

lib/llm/src/block_manager/distributed/zmq.rs (2)

new_leader_sockets (45-67)

new_with_handshake (87-229)

lib/bindings/python/rust/llm/block_manager/distributed/worker.rs (1)

lib/bindings/python/rust/llm/block_manager/distributed/utils.rs (2)

get_leader_zmq_ack_url (58-64)

get_leader_zmq_pub_url (50-56)

lib/llm/src/block_manager/distributed/worker.rs (2)

lib/llm/src/block_manager/distributed/zmq.rs (2)

handle (500-500)

bincode (171-171)

lib/bindings/python/rust/llm/block_manager/distributed/worker.rs (2)

new (29-58)

new (112-159)

lib/llm/src/block_manager/offload.rs (1)

lib/llm/src/block_manager/layout.rs (3)

config (196-196)

config (509-511)

config (740-742)

lib/bindings/python/rust/llm/block_manager/distributed/leader.rs (1)

lib/bindings/python/rust/llm/block_manager/distributed/utils.rs (2)

get_leader_zmq_ack_url (58-64)

get_leader_zmq_pub_url (50-56)

lib/llm/src/block_manager/distributed/zmq.rs (2)

lib/bindings/python/rust/llm/block_manager/distributed/worker.rs (2)

new (29-58)

new (112-159)

lib/llm/src/block_manager/distributed/leader.rs (1)

new (118-131)

lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs (1)

lib/bindings/python/rust/llm/block_manager/distributed/utils.rs (2)

get_leader_zmq_ack_url (58-64)

get_leader_zmq_pub_url (50-56)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: Build and Test - dynamo
GitHub Check: Build and Test - sglang
GitHub Check: Build and Test - vllm
GitHub Check: Build and Test - trtllm
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (.)

🔇 Additional comments (23)

lib/bindings/python/rust/llm/block_manager/vllm/connector.rs (1)

5-6: LGTM: import cleanup aligns with new transfer protocol

Switching to WorkerTransferRequest and dropping the unused import looks consistent with downstream usage.

lib/llm/src/block_manager/offload.rs (1)

742-742: Good fix: clone config when building disk layout

Cloning preserves the original LayoutConfig for subsequent mutations. Matches how device/host are handled and avoids accidental consumption.

lib/llm/src/block_manager/distributed/utils.rs (1)

10-11: LGTM: explicit ZMQ metadata message subjects

Clear subject strings for metadata exchanges improve readability of the wire protocol.

lib/bindings/python/rust/llm/block_manager/distributed/leader.rs (1)

9-9: Leader now configured via ZMQ URLs — matches PR objective

Supplying leader_pub_url and leader_ack_url through utils is the right abstraction. Sanity check retention is good.

Also applies to: 82-84

lib/bindings/python/rust/llm/block_manager/distributed/worker.rs (2)

4-5: LGTM: imports for leader URL helpers

Sourcing URLs from utils keeps Python bindings thin and consistent with core.

143-145: LGTM: worker builder now uses leader URLs

KvbmWorkerConfig builder matches the leader/worker ZMQ flow.

lib/bindings/python/rust/llm/block_manager/distributed.rs (1)

11-11: Exports updated to URL-based init — no lingering barrier API usage found

Searched the repo for get_barrier_id_prefix, barrier_id_prefix, and get_barrier_id — no matches found.

lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (2)

14-14: Switch to leader URLs looks correct.

Importing get_leader_zmq_pub_url/get_leader_zmq_ack_url aligns with the new API and removes the ETCD dependency.

169-171: Config now uses explicit ZMQ URLs — good. Verify env plumbing.

Builder wiring to leader_pub_url/leader_ack_url is consistent. Please verify the env-based getters return reachable endpoints in your deployment.

Run a quick check at runtime to log the resolved URLs before calling KvbmWorker::new to catch misconfig:

tracing::info!("pub={}, ack={}", get_leader_zmq_pub_url(), get_leader_zmq_ack_url());

lib/bindings/python/rust/llm/block_manager/distributed/utils.rs (2)

9-14: Env read helper is solid.

Trim + empty-string filtering prevents common misconfig pitfalls.

16-21: Port parsing guards invalid ranges.

Using u32 parse with 1..=65535 check is fine.

lib/llm/src/block_manager/distributed/leader.rs (2)

146-176: Double-check bytes_per_block aggregation policy.

Summing bytes_per_block across workers assumes TP and may over-allocate for DP/PP. If mixed parallelism is possible, consider deriving the correct aggregation from topology.

Would you like me to gate this with a topology flag (sum for TP, min/max for others)?

219-226: Handshake readiness wait logic LGTM.

Biased select with timeout is straightforward and avoids indefinite waits.

lib/llm/src/block_manager/distributed/worker.rs (3)

35-51: Readiness gate is simple and effective.

Using AtomicBool for ping gating is fine here.

110-183: Allocation helper is well-factored.

Good separation of concerns; no obvious issues.

205-312: ACK-before-validate is intentional; ensure leader relies on final ping for correctness.

You ACK leader_metadata before validating/allocating, which is fine because final readiness is enforced via ping. Just confirm no downstream assumes allocation succeeded merely on this ACK.

lib/llm/src/block_manager/distributed/zmq.rs (7)

45-67: Two-socket binding API is clear.

Explicit pub/ack separation is good and removes ETCD barrier assumptions.

87-121: Handshake constructor is well-structured.

Good separation of pull worker and broadcast phases.

122-167: Metadata collection loop is robust.

Rounds with bounded timeouts and exact worker-count requirement look correct.

174-209: Allocation-config broadcast loop is sound.

Rebroadcast on incomplete rounds; aligns with worker ACK semantics.

211-229: Final ping loop is fine; see cleanup note below.

Works with gated ACKs on workers. See pending-message cleanup concern next.

353-371: ACK vs REPLY handling looks correct.

Frame-shape distinction and payload collection are aligned with MessageHandle::reply protocol.

455-483: reply/mark_handled are good additions.

These unblock Drop panics and enable payload workflows.

lib/llm/src/block_manager/distributed/leader.rs

lib/llm/src/block_manager/distributed/utils.rs

lib/llm/src/block_manager/distributed/worker.rs

lib/llm/src/block_manager/distributed/zmq.rs

richardhuo-nv · 2025-10-07T05:33:40Z

/ok to test 423724b

richardhuo-nv · 2025-10-07T15:56:22Z

/ok to test c6145dc

richardhuo-nv · 2025-10-07T16:53:09Z

/ok to test add228f

lib/llm/src/block_manager/distributed/leader.rs

fix fix fix fix fix fix fix fix fix fix Signed-off-by: richardhuo-nv <rihuo@nvidia.com>

Signed-off-by: richardhuo-nv <rihuo@nvidia.com>

richardhuo-nv · 2025-10-22T23:12:58Z

/ok to test 85cf9aa

richardhuo-nv requested a review from a team as a code owner September 24, 2025 15:49

pull-request-size bot added the size/XXL label Sep 24, 2025

github-actions bot added the refactor label Sep 24, 2025

richardhuo-nv force-pushed the rihuo/kvbm_no_etcd branch 2 times, most recently from 9050659 to 7b3418f Compare September 24, 2025 15:50

coderabbitai bot reviewed Sep 24, 2025

View reviewed changes

richardhuo-nv force-pushed the rihuo/kvbm_no_etcd branch from 7b3418f to ae5b89e Compare October 2, 2025 00:21

richardhuo-nv requested a review from nv-kmcgill53 October 2, 2025 00:27

rmccorm4 requested a review from ryanolson October 6, 2025 18:52

richardhuo-nv force-pushed the rihuo/kvbm_no_etcd branch 2 times, most recently from f9892ea to 423724b Compare October 7, 2025 05:33

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 05:33 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 05:34 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 15:56 Inactive

richardhuo-nv force-pushed the rihuo/kvbm_no_etcd branch 2 times, most recently from 93b95b3 to add228f Compare October 7, 2025 16:52

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 16:52 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 17:00 Inactive

richardhuo-nv mentioned this pull request Oct 22, 2025

chore: KVBM pip wheel #3826

Merged

ziqifan617 approved these changes Oct 22, 2025

View reviewed changes

nv-kmcgill53 reviewed Oct 22, 2025

View reviewed changes

lib/llm/src/block_manager/distributed/leader.rs Show resolved Hide resolved

richardhuo-nv added 3 commits October 22, 2025 16:00

remove etcd into kvbm leader-worker

5a68a22

fix fix fix fix fix fix fix fix fix fix Signed-off-by: richardhuo-nv <rihuo@nvidia.com>

resolve comments

f84da6a

Signed-off-by: richardhuo-nv <rihuo@nvidia.com>

resolve fmt

37189e7

Signed-off-by: richardhuo-nv <rihuo@nvidia.com>

fix

4c477c8

Signed-off-by: richardhuo-nv <rihuo@nvidia.com>

richardhuo-nv force-pushed the rihuo/kvbm_no_etcd branch from add228f to 4c477c8 Compare October 22, 2025 23:08

copy-pr-bot bot temporarily deployed to GITLAB October 22, 2025 23:08 Inactive

doc update

85cf9aa

copy-pr-bot bot temporarily deployed to GITLAB October 22, 2025 23:12 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 22, 2025 23:24 Inactive

richardhuo-nv merged commit 94aa2a7 into main Oct 23, 2025
30 of 31 checks passed

richardhuo-nv deleted the rihuo/kvbm_no_etcd branch October 23, 2025 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: kvbm modularity DIS-657 Eliminate ETCD from the leader-worker initialization #3202

refactor: kvbm modularity DIS-657 Eliminate ETCD from the leader-worker initialization #3202

Uh oh!

richardhuo-nv commented Sep 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Sep 24, 2025

Uh oh!

coderabbitai bot commented Sep 24, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richardhuo-nv commented Oct 7, 2025

Uh oh!

richardhuo-nv commented Oct 7, 2025

Uh oh!

richardhuo-nv commented Oct 7, 2025

Uh oh!

Uh oh!

richardhuo-nv commented Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

refactor: kvbm modularity DIS-657 Eliminate ETCD from the leader-worker initialization #3202

refactor: kvbm modularity DIS-657 Eliminate ETCD from the leader-worker initialization #3202

Uh oh!

Conversation

richardhuo-nv commented Sep 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Sep 24, 2025

Uh oh!

coderabbitai bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richardhuo-nv commented Oct 7, 2025

Uh oh!

richardhuo-nv commented Oct 7, 2025

Uh oh!

richardhuo-nv commented Oct 7, 2025

Uh oh!

Uh oh!

richardhuo-nv commented Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

richardhuo-nv commented Sep 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 24, 2025 •

edited

Loading