Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Jul 7, 2025

Overview:

Add a Dynamo composite graph that includes Dynamo (ForwardPassMetrics), DCGM HW stats, NATS, and etcd. Update README.md for updated instructions.

Details:

  • Adds Common sections in YAMLs for shared config.
  • Updates worker/service config references to use common-configs.
  • More precise hardware requirements and notes in README.

Where should the reviewer start?

Start by reviewing the file changes in README.md.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • New Features

    • Introduced a comprehensive Grafana dashboard for monitoring Dynamo runtime, GPU, messaging, and storage metrics.
    • Added new Docker Compose services and configurations for enhanced monitoring, including exporters, Prometheus, and Grafana with improved network isolation and version pinning.
  • Documentation

    • Expanded and clarified metrics and deployment documentation, including updated instructions, new dashboard images, and improved configuration details.
    • Improved README clarity and consistency in naming conventions for metrics components and examples.
  • Chores

    • Added reference files for Grafana dashboards and Prometheus configuration to streamline setup and maintenance.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 7, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 7, 2025

Walkthrough

This update reorganizes and expands metrics monitoring and visualization infrastructure. It introduces new Grafana dashboards, updates Prometheus and docker-compose configurations, clarifies documentation, and standardizes naming conventions. New files reference centralized configuration locations, and service images and networking are pinned and made explicit for improved deployment control and monitoring integration.

Changes

File(s) Change Summary
components/metrics/README.md, components/metrics/src/bin/mock_worker.rs Updated documentation for clarity, naming consistency (e.g., MyComponent), and expanded mock worker usage; minor code update to match component name casing.
deploy/metrics/README.md, deploy/metrics/docker-compose.yml Documentation and configuration improved: dashboard image updated, dashboard files reorganized under grafana_dashboards/, volume mounts simplified, and comments added for firewall and credentials.
deploy/metrics/grafana_dashboards/grafana-dynamo-dashboard.json Added new Grafana dashboard JSON for composite Dynamo, DCGM, NATS, and etcd metrics visualization.
deploy/metrics/prometheus.yml Added new Prometheus scrape jobs for demo services, with explanatory comments and updated targets.
lib/runtime/docker-compose.yml Overhauled docker-compose: explicit version pinning, new bridge networks, added monitoring/exporter services, moved Prometheus/Grafana to bridged networking, updated ports, dependencies, and environment variables.
lib/runtime/grafana-datasources.yml, lib/runtime/grafana_dashboards, lib/runtime/metrics/prometheus.yml New files referencing shared configuration/data locations for Grafana datasources, dashboards, and Prometheus config.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Grafana
    participant Prometheus
    participant Exporters (NATS/DCGM)
    participant Dynamo Metrics Service

    User->>Grafana: Access dashboards (port 3001)
    Grafana->>Prometheus: Query metrics
    Prometheus->>Exporters (NATS/DCGM): Scrape metrics (per config)
    Prometheus->>Dynamo Metrics Service: Scrape /metrics endpoint
    Exporters (NATS/DCGM)-->>Prometheus: Return hardware/software metrics
    Dynamo Metrics Service-->>Prometheus: Return aggregated LLM/Dynamo metrics
    Prometheus-->>Grafana: Serve metrics data
    Grafana-->>User: Display visualizations
Loading

Possibly related PRs

Poem

In the warren of metrics, dashboards bloom bright,
With Prometheus and Grafana, our data takes flight.
Networks now bridged, exporters in tow,
Dynamo’s pulse beats where the dashboards glow.
🐰 With configs aligned and services in sync,
This monitoring garden is fresher than you think!

Warning

Review ran into problems

🔥 Problems

Check-run timed out after 90 seconds. Some checks/pipelines were still in progress when the timeout was reached. Consider increasing the reviews.tools.github-checks.timeout_ms value in your CodeRabbit configuration to allow more time for checks to complete.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (2)
deploy/metrics/docker-compose.yml (1)

90-101: Prometheus tag v3.4.1 is invalid – will fail to pull.

Up-stream prom/prometheus has not published any v3.* images (latest stable is still v2.*). Compose-up will error with manifest unknown.
Use a valid tag (e.g. v2.52.0) or pin to latest if you really need the bleeding edge.

-    image: prom/prometheus:v3.4.1
+    # NOTE: Prometheus has not released v3 – use a valid v2 tag.
+    image: prom/prometheus:v2.52.0
lib/runtime/docker-compose.yml (1)

90-108: Same invalid Prometheus tag as in deploy/metrics – compose up will fail.

Please align with the fix suggested earlier.

-    image: prom/prometheus:v3.4.1
+    image: prom/prometheus:v2.52.0
🧹 Nitpick comments (11)
lib/runtime/metrics/prometheus.yml (1)

36-41: host.docker.internal is not portable across all Docker hosts
Linux installations prior to Docker 20.10 or custom DNS setups won’t resolve this hostname, causing the llm-demo scrape job to fail silently. Consider parameterising the target or documenting the requirement explicitly.

lib/runtime/grafana-datasources.yml (1)

16-24: Optional: expose datasource URL via environment variable

Hard-coding http://prometheus:9090 couples the runtime image to an internal Docker DNS name.
Expose it through an env-substitution (${PROM_URL:http://prometheus:9090}) so integrators can override without editing the file.

deploy/metrics/README.md (1)

111-115: Markdown bullet list renders as a single wrapped paragraph

Empty line between the list items and preceding paragraph is missing, so GitHub will not interpret the dashes as a list.

-## Required Files
-The following configuration files should be present in this directory:
+## Required Files
+
+The following configuration files should be present in this directory:

Minor but improves readability.

deploy/metrics/prometheus.yml (1)

36-50: Scraping host services from inside container relies on host.docker.internal – not portable

host.docker.internal only resolves on Docker Desktop & recent Docker‐CE builds for macOS/Windows; many Linux flavours require an extra --add-host or a bridge alias. Users following the README on plain Docker-Engine will get scrape failures.

Consider:

  1. Instructing users to run Prometheus with extra_hosts: ["host.docker.internal:host-gateway"] (already in compose?)
  2. Or replace targets with environment-interpolated addresses (${HOST_IP}:8000).

Add a short note in the README / comments so users are not trapped.

deploy/metrics/grafana_dashboards/grafana-dynamo-dashboard.json (1)

1-924: Dashboard JSON should live under source control but large autogenerated blobs hinder diffs

Storing the full rendered JSON (924 lines) makes reviews painful and small tweaks unreadable.
A common pattern is:

  1. Keep the JSON in Grafana and export only on releases.
  2. Or commit a trimmed, formatted version generated with grafonnet, grafana-builder, or helm jsonnet, so diffs stay semantic.

Not blocking, but consider tooling to generate dashboards programmatically in future PRs.

deploy/metrics/docker-compose.yml (2)

87-89: Typo in comment (“te firewall”).

Minor but worth fixing to keep docs professional.

-  # To access Prometheus from another machine, you may need to disable te firewall on your host. On Ubuntu:
+  # To access Prometheus from another machine, you may need to disable the firewall on your host. On Ubuntu:

124-127: Mount dashboards as read-only unless write access is required.

Grafana never needs to write to provisioning JSON at runtime. Using :ro prevents accidental modifications from inside the container.

-      - ./grafana_dashboards:/etc/grafana/provisioning/dashboards:rw
+      - ./grafana_dashboards:/etc/grafana/provisioning/dashboards:ro
components/metrics/README.md (2)

9-12: Missing commas and double hyphen break the sentence.

-This is a demo implementation. The metrics component is currently under active development and this documentation will change as the implementation evolves.
-- In this demo the metrics names use the prefix "llm", but in production they will be prefixed with "nv_llm" (e.g., the HTTP `/metrics` endpoint will serve metrics with "nv_llm" prefixes)
-- This demo will only work when using examples/llm/configs/agg.yml-- other configurations will not work
+This is a demo implementation, and the component is under active development; the documentation will evolve accordingly.
+* In this demo the metric names use the prefix `llm`, but in production they will be prefixed with `nv_llm` (e.g., the `/metrics` endpoint will serve metrics with `nv_llm` prefixes).
+* This demo only works with `examples/llm/configs/agg.yml` — other configurations are not yet supported.

58-64: Grammar: “a mock workers”.

-Step 1: Launch a mock workers via the following command (if already built):
+Step 1: Launch a mock worker via the following command (if already built):
lib/runtime/docker-compose.yml (2)

124-126: Dashboard mount should be read-only for runtime safety.

-      - ./grafana_dashboards:/etc/grafana/provisioning/dashboards:rw
+      - ./grafana_dashboards:/etc/grafana/provisioning/dashboards:ro

16-24: High maintenance overhead keeping two compose files “in sync”.

Consider extracting the common monitoring stack into a single compose file and using Compose’s extends or a CI check to prevent drift.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1630f8b and d81e8a1.

⛔ Files ignored due to path filters (1)
  • deploy/metrics/grafana-dynamo-composite.png is excluded by !**/*.png
📒 Files selected for processing (10)
  • components/metrics/README.md (8 hunks)
  • components/metrics/src/bin/mock_worker.rs (1 hunks)
  • deploy/metrics/README.md (1 hunks)
  • deploy/metrics/docker-compose.yml (3 hunks)
  • deploy/metrics/grafana_dashboards/grafana-dynamo-dashboard.json (1 hunks)
  • deploy/metrics/prometheus.yml (1 hunks)
  • lib/runtime/docker-compose.yml (2 hunks)
  • lib/runtime/grafana-datasources.yml (1 hunks)
  • lib/runtime/grafana_dashboards (1 hunks)
  • lib/runtime/metrics/prometheus.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
components/metrics/src/bin/mock_worker.rs (1)
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.
components/metrics/README.md (1)
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.
deploy/metrics/docker-compose.yml (1)
Learnt from: GuanLuo
PR: ai-dynamo/dynamo#1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.
🪛 LanguageTool
components/metrics/README.md

[uncategorized] ~9-~9: Possible missing comma found.
Context: ...ics component is currently under active development and this documentation will change as t...

(AI_HYDRA_LEO_MISSING_COMMA)


[typographical] ~10-~10: It appears that a comma is missing.
Context: ...s the implementation evolves. - In this demo the metrics names use the prefix "llm",...

(DURING_THAT_TIME_COMMA)


[grammar] ~58-~58: The plural noun “workers” cannot be used with the article “a”. Did you mean “a mock worker” or “mock workers”?
Context: ...ate event-based metrics Step 1: Launch a mock workers via the following command (if already b...

(A_NNS)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (3)
lib/runtime/grafana_dashboards (1)

1-1: Confirmed: lib/runtime/grafana_dashboards is a valid symlink

  • git ls-files -s lib/runtime/grafana_dashboards reports mode 120000
  • ls -l shows it points to ../../deploy/metrics/grafana_dashboards
  • Relative path resolves correctly from lib/runtime

No further changes needed.

deploy/metrics/README.md (1)

103-103: Verify asset path

grafana-dynamo-composite.png must exist in deploy/metrics/. Missing assets render as broken images in GitHub & Docs sites.

components/metrics/README.md (1)

130-137: Keep example output consistent with earlier changes.

The sample Prometheus metrics still show the llm_ prefix even after the earlier note that production will switch to nv_llm. Consider adding a footnote or updating the sample to avoid confusion.

@keivenchang keivenchang enabled auto-merge (squash) July 8, 2025 01:47
@keivenchang keivenchang merged commit ebd2336 into main Jul 8, 2025
9 checks passed
@keivenchang keivenchang deleted the keivenchang/FT-grafana-metrics_DYN-678 branch July 8, 2025 02:24
atchernych pushed a commit that referenced this pull request Jul 9, 2025
Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants