-
Notifications
You must be signed in to change notification settings - Fork 692
feat: add a new composite SW/HW grafana (DYN-678) #1788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis update reorganizes and expands metrics monitoring and visualization infrastructure. It introduces new Grafana dashboards, updates Prometheus and docker-compose configurations, clarifies documentation, and standardizes naming conventions. New files reference centralized configuration locations, and service images and networking are pinned and made explicit for improved deployment control and monitoring integration. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Grafana
participant Prometheus
participant Exporters (NATS/DCGM)
participant Dynamo Metrics Service
User->>Grafana: Access dashboards (port 3001)
Grafana->>Prometheus: Query metrics
Prometheus->>Exporters (NATS/DCGM): Scrape metrics (per config)
Prometheus->>Dynamo Metrics Service: Scrape /metrics endpoint
Exporters (NATS/DCGM)-->>Prometheus: Return hardware/software metrics
Dynamo Metrics Service-->>Prometheus: Return aggregated LLM/Dynamo metrics
Prometheus-->>Grafana: Serve metrics data
Grafana-->>User: Display visualizations
Possibly related PRs
Poem
Warning Review ran into problems🔥 ProblemsCheck-run timed out after 90 seconds. Some checks/pipelines were still in progress when the timeout was reached. Consider increasing the reviews.tools.github-checks.timeout_ms value in your CodeRabbit configuration to allow more time for checks to complete. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (2)
deploy/metrics/docker-compose.yml (1)
90-101: Prometheus tagv3.4.1is invalid – will fail to pull.Up-stream
prom/prometheushas not published anyv3.*images (latest stable is stillv2.*). Compose-up will error withmanifest unknown.
Use a valid tag (e.g.v2.52.0) or pin tolatestif you really need the bleeding edge.- image: prom/prometheus:v3.4.1 + # NOTE: Prometheus has not released v3 – use a valid v2 tag. + image: prom/prometheus:v2.52.0lib/runtime/docker-compose.yml (1)
90-108: Same invalid Prometheus tag as indeploy/metrics– compose up will fail.Please align with the fix suggested earlier.
- image: prom/prometheus:v3.4.1 + image: prom/prometheus:v2.52.0
🧹 Nitpick comments (11)
lib/runtime/metrics/prometheus.yml (1)
36-41:host.docker.internalis not portable across all Docker hosts
Linux installations prior to Docker 20.10 or custom DNS setups won’t resolve this hostname, causing thellm-demoscrape job to fail silently. Consider parameterising the target or documenting the requirement explicitly.lib/runtime/grafana-datasources.yml (1)
16-24: Optional: expose datasource URL via environment variableHard-coding
http://prometheus:9090couples the runtime image to an internal Docker DNS name.
Expose it through an env-substitution (${PROM_URL:http://prometheus:9090}) so integrators can override without editing the file.deploy/metrics/README.md (1)
111-115: Markdown bullet list renders as a single wrapped paragraphEmpty line between the list items and preceding paragraph is missing, so GitHub will not interpret the dashes as a list.
-## Required Files -The following configuration files should be present in this directory: +## Required Files + +The following configuration files should be present in this directory:Minor but improves readability.
deploy/metrics/prometheus.yml (1)
36-50: Scraping host services from inside container relies onhost.docker.internal– not portable
host.docker.internalonly resolves on Docker Desktop & recent Docker‐CE builds for macOS/Windows; many Linux flavours require an extra--add-hostor a bridge alias. Users following the README on plain Docker-Engine will get scrape failures.Consider:
- Instructing users to run Prometheus with
extra_hosts: ["host.docker.internal:host-gateway"](already in compose?)- Or replace targets with environment-interpolated addresses (
${HOST_IP}:8000).Add a short note in the README / comments so users are not trapped.
deploy/metrics/grafana_dashboards/grafana-dynamo-dashboard.json (1)
1-924: Dashboard JSON should live under source control but large autogenerated blobs hinder diffsStoring the full rendered JSON (924 lines) makes reviews painful and small tweaks unreadable.
A common pattern is:
- Keep the JSON in Grafana and export only on releases.
- Or commit a trimmed, formatted version generated with
grafonnet,grafana-builder, orhelm jsonnet, so diffs stay semantic.Not blocking, but consider tooling to generate dashboards programmatically in future PRs.
deploy/metrics/docker-compose.yml (2)
87-89: Typo in comment (“te firewall”).Minor but worth fixing to keep docs professional.
- # To access Prometheus from another machine, you may need to disable te firewall on your host. On Ubuntu: + # To access Prometheus from another machine, you may need to disable the firewall on your host. On Ubuntu:
124-127: Mount dashboards as read-only unless write access is required.Grafana never needs to write to provisioning JSON at runtime. Using
:roprevents accidental modifications from inside the container.- - ./grafana_dashboards:/etc/grafana/provisioning/dashboards:rw + - ./grafana_dashboards:/etc/grafana/provisioning/dashboards:rocomponents/metrics/README.md (2)
9-12: Missing commas and double hyphen break the sentence.-This is a demo implementation. The metrics component is currently under active development and this documentation will change as the implementation evolves. -- In this demo the metrics names use the prefix "llm", but in production they will be prefixed with "nv_llm" (e.g., the HTTP `/metrics` endpoint will serve metrics with "nv_llm" prefixes) -- This demo will only work when using examples/llm/configs/agg.yml-- other configurations will not work +This is a demo implementation, and the component is under active development; the documentation will evolve accordingly. +* In this demo the metric names use the prefix `llm`, but in production they will be prefixed with `nv_llm` (e.g., the `/metrics` endpoint will serve metrics with `nv_llm` prefixes). +* This demo only works with `examples/llm/configs/agg.yml` — other configurations are not yet supported.
58-64: Grammar: “a mock workers”.-Step 1: Launch a mock workers via the following command (if already built): +Step 1: Launch a mock worker via the following command (if already built):lib/runtime/docker-compose.yml (2)
124-126: Dashboard mount should be read-only for runtime safety.- - ./grafana_dashboards:/etc/grafana/provisioning/dashboards:rw + - ./grafana_dashboards:/etc/grafana/provisioning/dashboards:ro
16-24: High maintenance overhead keeping two compose files “in sync”.Consider extracting the common monitoring stack into a single compose file and using Compose’s
extendsor a CI check to prevent drift.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
deploy/metrics/grafana-dynamo-composite.pngis excluded by!**/*.png
📒 Files selected for processing (10)
components/metrics/README.md(8 hunks)components/metrics/src/bin/mock_worker.rs(1 hunks)deploy/metrics/README.md(1 hunks)deploy/metrics/docker-compose.yml(3 hunks)deploy/metrics/grafana_dashboards/grafana-dynamo-dashboard.json(1 hunks)deploy/metrics/prometheus.yml(1 hunks)lib/runtime/docker-compose.yml(2 hunks)lib/runtime/grafana-datasources.yml(1 hunks)lib/runtime/grafana_dashboards(1 hunks)lib/runtime/metrics/prometheus.yml(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
components/metrics/src/bin/mock_worker.rs (1)
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.
components/metrics/README.md (1)
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.
deploy/metrics/docker-compose.yml (1)
Learnt from: GuanLuo
PR: ai-dynamo/dynamo#1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.
🪛 LanguageTool
components/metrics/README.md
[uncategorized] ~9-~9: Possible missing comma found.
Context: ...ics component is currently under active development and this documentation will change as t...
(AI_HYDRA_LEO_MISSING_COMMA)
[typographical] ~10-~10: It appears that a comma is missing.
Context: ...s the implementation evolves. - In this demo the metrics names use the prefix "llm",...
(DURING_THAT_TIME_COMMA)
[grammar] ~58-~58: The plural noun “workers” cannot be used with the article “a”. Did you mean “a mock worker” or “mock workers”?
Context: ...ate event-based metrics Step 1: Launch a mock workers via the following command (if already b...
(A_NNS)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - vllm
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (3)
lib/runtime/grafana_dashboards (1)
1-1: Confirmed:lib/runtime/grafana_dashboardsis a valid symlink
git ls-files -s lib/runtime/grafana_dashboardsreports mode 120000ls -lshows it points to../../deploy/metrics/grafana_dashboards- Relative path resolves correctly from
lib/runtimeNo further changes needed.
deploy/metrics/README.md (1)
103-103: Verify asset path
grafana-dynamo-composite.pngmust exist indeploy/metrics/. Missing assets render as broken images in GitHub & Docs sites.components/metrics/README.md (1)
130-137: Keep example output consistent with earlier changes.The sample Prometheus metrics still show the
llm_prefix even after the earlier note that production will switch tonv_llm. Consider adding a footnote or updating the sample to avoid confusion.
Overview:
Add a Dynamo composite graph that includes Dynamo (ForwardPassMetrics), DCGM HW stats, NATS, and etcd. Update README.md for updated instructions.
Details:
Where should the reviewer start?
Start by reviewing the file changes in README.md.
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Documentation
Chores