fix: fix metrics docs; add dcgm-exporter #2712

mohammedabdulwahhab · 2025-08-26T17:48:16Z

Overview:

Fixes for metrics docs:

Adds a fix to ensure podmonitors are picked up
Adds a section for dcgm exporter

closes https://linear.app/nvidia/issue/DIS-403/add-dcgm-metricssection-to-k8s-prometheusgrafana-guide

Summary by CodeRabbit

Documentation
- Updated guide to use kube-prometheus-stack instead of Prometheus Operator, with revised prerequisites and install flow.
- Added Helm values for enabling PodMonitors across namespaces.
- Introduced optional DCGM metrics collection guidance and updated dashboard text for DCGM-based GPU metrics.
- Updated Prometheus and Grafana access steps, including new service names, credential retrieval from secrets, and login workflow.
- Adjusted port-forward targets and commands for consistency with kube-prometheus-stack.
- Minor text refinements for clarity and accuracy.

copy-pr-bot · 2025-08-26T17:48:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-08-26T17:54:13Z

Walkthrough

The guide updates Kubernetes metrics documentation to use kube-prometheus-stack instead of Prometheus Operator, adds DCGM exporter instructions, adjusts Helm values for PodMonitor discovery, revises Prometheus/Grafana port-forward and credential steps, and updates dashboard and GPU metrics references accordingly.

Changes

Cohort / File(s)	Summary
Docs: kube-prometheus-stack migration & DCGM `docs/guides/dynamo_deploy/k8s_metrics.md`	Replaced Prometheus Operator with kube-prometheus-stack; added Helm values for PodMonitor selection; introduced optional DCGM exporter section; updated Prometheus and Grafana port-forward targets; changed Grafana credential retrieval and login steps; revised dashboard text for DCGM GPU metrics; minor command/notes adjustments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

feat: FT enable DCGM and optional Prometheus and Grafana, plus fixes #1488 — Introduces DCGM-based GPU metrics and aligns Prometheus/Grafana targets, matching this guide’s DCGM and stack updates.
feat: add a new composite SW/HW grafana (DYN-678) #1788 — Adds monitoring assets and scrape configs for kube-prometheus-stack; complements the doc’s new install and access steps.
docs: exporting metrics in k8s (dep-302) #2271 — Updates operators, PodMonitors, and dashboards; directly related to the guide’s PodMonitor and Grafana instructions.

Poem

I twitch my ears at graphs anew,
Prom stacks hop from old to new.
DCGM carrots glow in night, 🥕
GPUs purr in dashboard light.
Secrets whispered, ports unfurled—
I bound through charts, a metrics world.
Thump-thump: alerts are tightly curled.

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

docs/guides/dynamo_deploy/k8s_metrics.md (6)
12-14: Clarify install target namespace and release naming up-front.

You mention kube-prometheus-stack correctly includes Prometheus Operator. To avoid later confusion with -n monitoring usages, explicitly state the intended namespace (monitoring) and release name (prometheus) here, or introduce env vars (e.g., MON_NS, RELEASE). See helm command fix below.

31-33: Tighten the Note to pin the assumed release/namespace.

Reduce ambiguity by calling out the exact assumptions the rest of the guide makes.
-> The commands enumerated below assume you have installed the kube-prometheus-stack with the installation method listed above. Depending on your installation configuration of the monitoring stack, you may need to modify the `kubectl` commands that follow in this document accordingly (e.g modifying Namespace or Service names accordingly).
+> The commands below assume a Helm release name of `prometheus` in the `monitoring` namespace, installed with the exact flags shown above. If you used different names, adjust subsequent `kubectl`/`port-forward` commands accordingly (e.g., namespaces and service names).
34-44: Polish DCGM section: fix typo and tighten phrasing + install hint.

Spelling: “relataed” → “related”.

Style: avoid repeated “you need to”.

Optional: call out that DaemonSet names may vary (dcgm-exporter or nvidia-dcgm-exporter).
-### DCGM Metrics Collection (Optional)
-
-GPU utilization metrics are collected and exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization relataed to your Dynamo deployment. For that panel to be populated, you need to ensure that the dcgm-exporter is running in your cluster. To check if the dcgm-exporter is running, please run the following command:
+### DCGM Metrics Collection (Optional)
+
+GPU utilization metrics are exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization related to your Dynamo deployment. To populate that panel, ensure dcgm-exporter is running in your cluster. Check with:
@@
-If the output is empty, you need to install the dcgm-exporter. For more information, please consult the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html).
+If the output is empty, install dcgm-exporter; see the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html). Note: depending on how it’s installed, the DaemonSet may be named `dcgm-exporter` or `nvidia-dcgm-exporter`.
206-206: Service name depends on Helm release; add a quick sanity check.

With the release name prometheus and namespace monitoring, this is correct. If users changed either, the service name changes. Suggest adding a one-liner to discover the service dynamically:
-kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
+kubectl -n monitoring get svc | grep kube-prometheus-prometheus
+# If the service name differs, substitute it below:
+kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
217-225: Fix typo, and avoid printing passwords to stdout.

“credss” → “credentials”.

Avoid echoing admin password in logs/scrollback. Export vars silently and proceed to port-forward.
-# Get Grafana credss
+# Get Grafana credentials
 export GRAFANA_USER=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-user}" | base64 --decode)
 export GRAFANA_PASSWORD=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode)
-echo "Grafana user: $GRAFANA_USER"
-echo "Grafana password: $GRAFANA_PASSWORD"
+echo "Grafana user: $GRAFANA_USER"
+# Password stored in $GRAFANA_PASSWORD (not echoed for security)
 
 # Port forward Grafana service
 kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
227-230: Replace bare URL and clarify where to find the dashboard.

Satisfies markdownlint (MD034) and improves readability.
-Visit http://localhost:3000 and log in with the credentials captured above.
+Visit [http://localhost:3000](http://localhost:3000) and log in with the credentials captured above.
 
-Once logged in, find the Dynamo dashboard under General.
+Once logged in, find the “Dynamo” dashboard under the “General” folder.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 766d3f2 and 8317f9d.

📒 Files selected for processing (1)

docs/guides/dynamo_deploy/k8s_metrics.md (4 hunks)

🧰 Additional context used

🪛 LanguageTool

docs/guides/dynamo_deploy/k8s_metrics.md

[grammar] ~36-~36: Ensure spelling is correct
Context: ...rd includes a panel for GPU utilization relataed to your Dynamo deployment. For that pan...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[style] ~42-~42: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...porter ``` If the output is empty, you need to install the dcgm-exporter. For more inf...

(REP_NEED_TO_VB)

🪛 markdownlint-cli2 (0.17.2)

docs/guides/dynamo_deploy/k8s_metrics.md

227-227: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (2)

docs/guides/dynamo_deploy/k8s_metrics.md (2)

5-5: Good switch to kube-prometheus-stack; concise context.

The overview accurately frames PodMonitor-based discovery with kube-prometheus-stack. No action needed.

200-200: Nice addition calling out GPU utilization via DCGM.

Helps set expectations for the optional DCGM step. No changes needed.

docs/guides/dynamo_deploy/k8s_metrics.md

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>

docs/guides/dynamo_deploy/k8s_metrics.md

Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>

whoisj

LGTM

docs/guides/dynamo_deploy/k8s_metrics.md

Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: ayushag <ayushag@nvidia.com>

Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: Jason Zhou <jasonzho@jasonzho-mlt.client.nvidia.com>

Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>

Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: nnshah1 <neelays@nvidia.com>

fix: fix

8317f9d

pull-request-size bot added the size/M label Aug 26, 2025

github-actions bot added the fix label Aug 26, 2025

coderabbitai bot reviewed Aug 26, 2025

View reviewed changes

docs/guides/dynamo_deploy/k8s_metrics.md Show resolved Hide resolved

mohammedabdulwahhab changed the title ~~fix: fix metrics docs~~ fix: fix metrics docs; add dcgm-exporter Aug 26, 2025

fix: fix

8edafc9

julienmancuso approved these changes Aug 26, 2025

View reviewed changes

Merge branch 'main' into mabdulwahhab/metrics-docs-fixes

6c9025a

rmccorm4 reviewed Aug 26, 2025

View reviewed changes

docs/guides/dynamo_deploy/k8s_metrics.md Outdated Show resolved Hide resolved

rmccorm4 approved these changes Aug 26, 2025

View reviewed changes

Update docs/guides/dynamo_deploy/k8s_metrics.md

a640b71

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>

mohammedabdulwahhab commented Aug 26, 2025

View reviewed changes

docs/guides/dynamo_deploy/k8s_metrics.md Outdated Show resolved Hide resolved

Apply suggestions from code review

aaa5690

Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>

mohammedabdulwahhab enabled auto-merge (squash) August 26, 2025 18:17

whoisj approved these changes Aug 26, 2025

View reviewed changes

docs/guides/dynamo_deploy/k8s_metrics.md Show resolved Hide resolved

Merge branch 'main' into mabdulwahhab/metrics-docs-fixes

106cd4e

mohammedabdulwahhab merged commit 6cf96e0 into main Aug 26, 2025
9 checks passed

mohammedabdulwahhab deleted the mabdulwahhab/metrics-docs-fixes branch August 26, 2025 20:14

mohammedabdulwahhab added a commit that referenced this pull request Aug 26, 2025

fix: fix metrics docs; add dcgm-exporter (#2712)

1b1a049

Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

mohammedabdulwahhab mentioned this pull request Aug 26, 2025

fix: cp metrics docs fix #2720

Merged

ayushag-nv pushed a commit that referenced this pull request Aug 27, 2025

fix: fix metrics docs; add dcgm-exporter (#2712)

27ef0bc

Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: ayushag <ayushag@nvidia.com>

nnshah1 pushed a commit that referenced this pull request Sep 8, 2025

fix: fix metrics docs; add dcgm-exporter (#2712)

b656f09

Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: nnshah1 <neelays@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fix metrics docs; add dcgm-exporter #2712

fix: fix metrics docs; add dcgm-exporter #2712

Uh oh!

mohammedabdulwahhab commented Aug 26, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Aug 26, 2025

Uh oh!

coderabbitai bot commented Aug 26, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

whoisj left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fix: fix metrics docs; add dcgm-exporter #2712

fix: fix metrics docs; add dcgm-exporter #2712

Uh oh!

Conversation

mohammedabdulwahhab commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Aug 26, 2025

Uh oh!

coderabbitai bot commented Aug 26, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

whoisj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mohammedabdulwahhab commented Aug 26, 2025 •

edited

Loading