-
Notifications
You must be signed in to change notification settings - Fork 692
feat: DIS-678 kvbm modularity: standalone metrics endpoint #3433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughIntroduces standalone KVBM metrics mode served on a single port (default 6881) gated by environment variables. Updates Rust leaders/recorders and Python connectors to branch on this mode, removes worker-side metric instrumentation, adjusts Grafana queries and panels, drops Prometheus leader scrape, and revises docs accordingly. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User as User/Operator
participant Env as Env Vars
participant Py as Python Connector (Leader)
participant Rust as Rust Leader
participant Reg as KvbmMetricsRegistry
participant Srv as Metrics HTTP Server
participant Prom as Prometheus
User->>Env: Set DYN_KVBM_METRICS_STANDALONE [=1] and optional DYN_KVBM_METRICS_PORT
Py->>Env: Read standalone flag
alt Standalone enabled
Py->>Rust: KvConnectorLeader::new(..., metrics_standalone=true)
Rust->>Env: parse_dyn_kvbm_metrics_port() (default 6881)
Rust->>Reg: Init registry and counters (kvbm_* prefixed)
Reg-->>Rust: Registry handle
Rust->>Srv: Spawn /metrics on selected port (background)
Prom-->>Srv: Scrape /metrics
Srv-->>Prom: Expose kvbm_* metrics
else Namespace mode
Py->>Rust: KvConnectorLeader::new(..., metrics_standalone=false)
Rust->>Rust: Use namespace-based metrics
end
Note over Rust,Srv: Drop KvbmMetrics triggers graceful shutdown of server
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
lib/bindings/python/src/dynamo/llm/utils.py (1)
36-44: Rename helper to fix typoLine 36 spells the helper
is_standslone_kvbm_metrics_enabled, which is easy to misread and will propagate the typo across future call sites. Rename it (and the imports) tois_standalone_kvbm_metrics_enabledwhile the surface area is still tiny.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
deploy/metrics/grafana_dashboards/grafana-kvbm-dashboard.json(7 hunks)deploy/metrics/prometheus.yml(0 hunks)docs/guides/run_kvbm_in_trtllm.md(2 hunks)docs/guides/run_kvbm_in_vllm.md(1 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs(5 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs(2 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs(4 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs(0 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs(0 hunks)lib/bindings/python/src/dynamo/llm/trtllm_integration/connector/kvbm_connector_leader.py(2 hunks)lib/bindings/python/src/dynamo/llm/trtllm_integration/connector/kvbm_connector_worker.py(0 hunks)lib/bindings/python/src/dynamo/llm/utils.py(1 hunks)lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py(2 hunks)lib/bindings/python/src/dynamo/llm/vllm_integration/connector_worker.py(0 hunks)lib/llm/src/block_manager/metrics_kvbm.rs(3 hunks)
💤 Files with no reviewable changes (5)
- lib/bindings/python/src/dynamo/llm/vllm_integration/connector_worker.py
- lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs
- lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs
- lib/bindings/python/src/dynamo/llm/trtllm_integration/connector/kvbm_connector_worker.py
- deploy/metrics/prometheus.yml
🧰 Additional context used
🧬 Code graph analysis (6)
lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py (1)
lib/bindings/python/src/dynamo/llm/utils.py (2)
find_and_set_available_port_from_env(9-31)is_standslone_kvbm_metrics_enabled(36-44)
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (4)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (3)
parse_dyn_kvbm_metrics_port(637-651)new(91-184)new(559-590)lib/llm/src/block_manager/metrics_kvbm.rs (3)
new_with_standalone(85-163)default(241-243)new(41-81)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)
new(88-206)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (7)
new(190-227)new(340-371)new(1085-1098)new(1109-1122)new(1132-1142)new(1458-1463)new(1467-1472)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (3)
lib/llm/src/block_manager/metrics_kvbm.rs (4)
new_with_standalone(85-163)default(241-243)new(41-81)new(211-216)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)
new(88-206)lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (2)
new(65-148)new(455-473)
lib/llm/src/block_manager/metrics_kvbm.rs (3)
lib/runtime/src/metrics/prometheus_names.rs (1)
sanitize_prometheus_name(381-407)lib/runtime/src/pipeline/context.rs (1)
registry(237-239)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (2)
new(91-184)new(559-590)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (2)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (3)
parse_dyn_kvbm_metrics_port(637-651)new(91-184)new(559-590)lib/llm/src/block_manager/metrics_kvbm.rs (4)
new_with_standalone(85-163)default(241-243)new(41-81)new(211-216)
lib/bindings/python/src/dynamo/llm/trtllm_integration/connector/kvbm_connector_leader.py (1)
lib/bindings/python/src/dynamo/llm/utils.py (2)
find_and_set_available_port_from_env(9-31)is_standslone_kvbm_metrics_enabled(36-44)
🪛 markdownlint-cli2 (0.18.1)
docs/guides/run_kvbm_in_vllm.md
91-91: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Build and Test - dynamo
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: clippy (.)
- GitHub Check: tests (.)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: clippy (launch/dynamo-run)
keivenchang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the metric names, can you use the common prometheus_names.rs?
|
/ok to test 30d911d |
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs
Outdated
Show resolved
Hide resolved
lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py
Outdated
Show resolved
Hide resolved
|
/ok to test 4b9b9dd |
sure, added |
3416649 to
51a52d4
Compare
Signed-off-by: richardhuo-nv <rihuo@nvidia.com> metrics implementation Signed-off-by: richardhuo-nv <rihuo@nvidia.com> add env for metrics Signed-off-by: richardhuo-nv <rihuo@nvidia.com> add port handling Signed-off-by: richardhuo-nv <rihuo@nvidia.com> add port handling Signed-off-by: richardhuo-nv <rihuo@nvidia.com> fix fix fix fix fix Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
51a52d4 to
e6607d8
Compare
|
/ok to test e6607d8 |
|
/ok to test 765c697 |
Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
biswapanda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm from deploy prespective
Overview:
KVBM need a standalone metrics port and registry to achieve the modularity. So that user could bypass the distributed runtime requirements for getting the metrics.
Also, removed the kvbm worker metrics, since it will cause port conflict issue when we have multiple TPs.
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit