-
Notifications
You must be signed in to change notification settings - Fork 686
fix: frontend metrics to be renamed from nv_llm_http_service_* => dynamo_frontend_* #2176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: frontend metrics to be renamed from nv_llm_http_service_* => dynamo_frontend_* #2176
Conversation
WalkthroughThis change updates metric naming conventions and documentation across several components, replacing the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant MetricsRegistry
participant Prometheus
participant Grafana
User->>MetricsRegistry: Exposes metrics at /metrics (dynamo_frontend_*)
Prometheus->>MetricsRegistry: Scrapes metrics (port 8081)
Grafana->>Prometheus: Queries dynamo_frontend_* metrics
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
components/planner/src/dynamo/planner/utils/prometheus.py (1)
30-35: Division by zero still possibleIf
*_countdoes not increase within the interval the query returnsNaN, causingfloat(...)to raise. You already blanket-catch, but consider addingor 0in PromQL to avoid NaN (e.g.,... / clamp_min(increase(..._count[...]),1)).
🧹 Nitpick comments (10)
lib/llm/src/http/service/metrics.rs (2)
102-110: Docstring prefix is inconsistent with the actual metric namesThe comment says the constructor hard-codes the
"dynamo"prefix, yet every metric below is created with the"dynamo_frontend_"prefix.
Either update the wording ("dynamo_frontend"), or adjust the metric names if"dynamo"is the intended canonical prefix.
112-200: Consider de-duplicating the literal prefix & add error context instead ofunwrap()The same
"dynamo_frontend_"literal is repeated seven times. A singleconst PREFIX: &str = "dynamo_frontend";andformat!("{PREFIX}_requests_total")style construction would reduce copy-paste risk if the prefix changes again.While touching this block, replace
unwrap()withexpect("failed to create <metric_name>")(or propagate the error) to avoid a panic without context during service start-up.deploy/metrics/prometheus.yml (1)
50-52: Deprecation note is helpful – consider adding a removal timelineAdding the comment clarifies the migration. Mentioning when the job will be removed (e.g., “to be removed in vX.Y”) would set clear expectations for operators.
deploy/metrics/grafana_dashboards/grafana-dynamo-dashboard.json (1)
30-110: Dashboard description strings still reference old metric unitsPanel 17 (
Frontend Avg Request Duration) description uses"dynamo_frontend_request_duration (sum/count)"while the query is on*_seconds_. Consider appending “_seconds” in the description to stay consistent with the metric family.components/planner/src/dynamo/planner/utils/prometheus.py (2)
30-35: Prefix duplication – define onceEach query embeds the literal
"dynamo_frontend". Extract it to a module-level constant to avoid divergence with future renames.-PREFIX = "dynamo_frontend" +PREFIX = "dynamo_frontend" def _sum(metric): return f"{PREFIX}_{metric}_sum" def _cnt(metric): return f"{PREFIX}_{metric}_count"
45-47: Repeated pattern – helper function could cut boilerplateAll six helpers share identical
try/exceptstructure differing only by the query string. A private_query_float(expr)util would DRY this up and make logging uniform.Also applies to: 56-58, 81-83, 92-94
deploy/metrics/grafana_dashboards/grafana-llm-metrics.json (1)
29-35: Dashboard not clearly marked as deprecated inside Grafana UIThe JSON adds a textual deprecation notice, but users skimming dashboards in Grafana will largely rely on title / tags / UID rather than this hidden copyright array.
Consider renaming eithertitleoruid(or adding an extra tag) to include “-DEPRECATED” so it surfaces in the dashboard list & API responses.- "title": "LLM Worker Metrics", + "title": "LLM Worker Metrics (DEPRECATED)", ... - "uid": "llm-worker-metrics", + "uid": "llm-worker-metrics-deprecated", + "tags": ["llm", "metrics", "deprecated"],components/metrics/README.md (2)
3-16: Deprecation section lacks concrete removal timelineThe notice is helpful, but many users want to know when the component disappears (next minor vs. next major).
Add an approximate target release or link to the tracking issue to set expectations.
23-23: Prefix note may mis-lead readersThe bullet still states demo metrics use the
"llm"prefix while production will use"dynamo".
With the wider rename todynamo_frontend_*, the production prefix statement is no longer precise.- In this demo the metrics names use the prefix "llm", but in production they will be prefixed with "dynamo" + In this demo the metrics names use the prefix "llm", but production metrics are now prefixed with "dynamo_frontend"deploy/metrics/README.md (1)
121-135: Duplicate information—risk of readers missing the warningThe
DEPRECATED METRICS NOTICEblock repeats part of the earlier deprecation text.
You can collapse the duplication and keep a single, highly-visible notice to reduce maintenance overhead.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
components/metrics/README.md(1 hunks)components/planner/src/dynamo/planner/utils/prometheus.py(6 hunks)deploy/metrics/README.md(2 hunks)deploy/metrics/grafana_dashboards/grafana-dynamo-dashboard.json(11 hunks)deploy/metrics/grafana_dashboards/grafana-llm-metrics.json(1 hunks)deploy/metrics/prometheus.yml(1 hunks)lib/llm/src/http/service/metrics.rs(7 hunks)lib/llm/tests/http-service.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: biswapanda
PR: ai-dynamo/dynamo#1412
File: lib/bindings/python/src/dynamo/runtime/logging.py:100-100
Timestamp: 2025-06-06T21:48:35.214Z
Learning: In the Dynamo codebase, BentoML has been completely removed from all executable code, with only documentation and attribution references remaining. The error_loggers configuration in lib/bindings/python/src/dynamo/runtime/logging.py should not include "bentoml" since those modules no longer exist.
deploy/metrics/prometheus.yml (1)
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.
components/metrics/README.md (1)
Learnt from: PeaBrane
PR: #1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The create_endpoint method in WorkerMetricsPublisher has backward compatibility maintained through pyo3 signature annotation #[pyo3(signature = (component, dp_rank = None))], making the dp_rank parameter optional with a default value of None.
deploy/metrics/README.md (5)
Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.
Learnt from: nnshah1
PR: #2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.
Learnt from: biswapanda
PR: #1890
File: examples/vllm/deploy/agg.yaml:63-70
Timestamp: 2025-07-14T23:01:16.218Z
Learning: In vLLM worker deployments, grep-based log checks for "VllmWorker.*has been initialized" are appropriate for readiness probes to verify worker startup, but should not be used for liveness probes which need to detect ongoing worker health.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: #1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
🔇 Additional comments (1)
lib/llm/tests/http-service.rs (1)
358-362: LGTM ‑ test updated to the new histogram nameThe assertion now targets
dynamo_frontend_request_duration_seconds; matches the production rename.
1eb1991 to
072f489
Compare
7c5a0c6 to
26e206c
Compare
rmccorm4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
…age to component/metrics" This reverts commit c38b756.
030a862 to
499264a
Compare
Overview:
Renames frontend metrics from
nv_llm_http_service_*todynamo_frontend_*and hardcodes the "dynamo" prefix in the metrics system for consistency and simplified API.Details:
nv_llm_http_service_*todynamo_frontend_*across Rust code, Python utilities, and Grafana dashboardscomponents/metricscomponent andllm_kv_*metricsWhere should the reviewer start?
lib/llm/src/http/service/metrics.rs- Core metrics implementation changescomponents/planner/src/dynamo/planner/utils/prometheus.py- Python utility updatesdeploy/metrics/grafana_dashboards/grafana-dynamo-dashboard.json- Dashboard metric name updatesdocs/guides/metrics.md- Documentation updatesRelated Issues:
Relates to DIS-337
Summary by CodeRabbit
Documentation
Refactor
Tests