Skip to content

Conversation

@mohammedabdulwahhab
Copy link
Contributor

@mohammedabdulwahhab mohammedabdulwahhab commented Aug 26, 2025

Overview:

Fixes for metrics docs:

  • Adds a fix to ensure podmonitors are picked up
  • Adds a section for dcgm exporter

closes https://linear.app/nvidia/issue/DIS-403/add-dcgm-metricssection-to-k8s-prometheusgrafana-guide

Summary by CodeRabbit

  • Documentation
    • Updated guide to use kube-prometheus-stack instead of Prometheus Operator, with revised prerequisites and install flow.
    • Added Helm values for enabling PodMonitors across namespaces.
    • Introduced optional DCGM metrics collection guidance and updated dashboard text for DCGM-based GPU metrics.
    • Updated Prometheus and Grafana access steps, including new service names, credential retrieval from secrets, and login workflow.
    • Adjusted port-forward targets and commands for consistency with kube-prometheus-stack.
    • Minor text refinements for clarity and accuracy.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 26, 2025

Walkthrough

The guide updates Kubernetes metrics documentation to use kube-prometheus-stack instead of Prometheus Operator, adds DCGM exporter instructions, adjusts Helm values for PodMonitor discovery, revises Prometheus/Grafana port-forward and credential steps, and updates dashboard and GPU metrics references accordingly.

Changes

Cohort / File(s) Summary
Docs: kube-prometheus-stack migration & DCGM
docs/guides/dynamo_deploy/k8s_metrics.md
Replaced Prometheus Operator with kube-prometheus-stack; added Helm values for PodMonitor selection; introduced optional DCGM exporter section; updated Prometheus and Grafana port-forward targets; changed Grafana credential retrieval and login steps; revised dashboard text for DCGM GPU metrics; minor command/notes adjustments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Poem

I twitch my ears at graphs anew,
Prom stacks hop from old to new.
DCGM carrots glow in night, 🥕
GPUs purr in dashboard light.
Secrets whispered, ports unfurled—
I bound through charts, a metrics world.
Thump-thump: alerts are tightly curled.

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
docs/guides/dynamo_deploy/k8s_metrics.md (6)

12-14: Clarify install target namespace and release naming up-front.

You mention kube-prometheus-stack correctly includes Prometheus Operator. To avoid later confusion with -n monitoring usages, explicitly state the intended namespace (monitoring) and release name (prometheus) here, or introduce env vars (e.g., MON_NS, RELEASE). See helm command fix below.


31-33: Tighten the Note to pin the assumed release/namespace.

Reduce ambiguity by calling out the exact assumptions the rest of the guide makes.

-> The commands enumerated below assume you have installed the kube-prometheus-stack with the installation method listed above. Depending on your installation configuration of the monitoring stack, you may need to modify the `kubectl` commands that follow in this document accordingly (e.g modifying Namespace or Service names accordingly).
+> The commands below assume a Helm release name of `prometheus` in the `monitoring` namespace, installed with the exact flags shown above. If you used different names, adjust subsequent `kubectl`/`port-forward` commands accordingly (e.g., namespaces and service names).

34-44: Polish DCGM section: fix typo and tighten phrasing + install hint.

  • Spelling: “relataed” → “related”.
  • Style: avoid repeated “you need to”.
  • Optional: call out that DaemonSet names may vary (dcgm-exporter or nvidia-dcgm-exporter).
-### DCGM Metrics Collection (Optional)
-
-GPU utilization metrics are collected and exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization relataed to your Dynamo deployment. For that panel to be populated, you need to ensure that the dcgm-exporter is running in your cluster. To check if the dcgm-exporter is running, please run the following command:
+### DCGM Metrics Collection (Optional)
+
+GPU utilization metrics are exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization related to your Dynamo deployment. To populate that panel, ensure dcgm-exporter is running in your cluster. Check with:
@@
-If the output is empty, you need to install the dcgm-exporter. For more information, please consult the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html).
+If the output is empty, install dcgm-exporter; see the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html). Note: depending on how it’s installed, the DaemonSet may be named `dcgm-exporter` or `nvidia-dcgm-exporter`.

206-206: Service name depends on Helm release; add a quick sanity check.

With the release name prometheus and namespace monitoring, this is correct. If users changed either, the service name changes. Suggest adding a one-liner to discover the service dynamically:

-kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
+kubectl -n monitoring get svc | grep kube-prometheus-prometheus
+# If the service name differs, substitute it below:
+kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

217-225: Fix typo, and avoid printing passwords to stdout.

  • “credss” → “credentials”.
  • Avoid echoing admin password in logs/scrollback. Export vars silently and proceed to port-forward.
-# Get Grafana credss
+# Get Grafana credentials
 export GRAFANA_USER=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-user}" | base64 --decode)
 export GRAFANA_PASSWORD=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode)
-echo "Grafana user: $GRAFANA_USER"
-echo "Grafana password: $GRAFANA_PASSWORD"
+echo "Grafana user: $GRAFANA_USER"
+# Password stored in $GRAFANA_PASSWORD (not echoed for security)
 
 # Port forward Grafana service
 kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring

227-230: Replace bare URL and clarify where to find the dashboard.

Satisfies markdownlint (MD034) and improves readability.

-Visit http://localhost:3000 and log in with the credentials captured above.
+Visit [http://localhost:3000](http://localhost:3000) and log in with the credentials captured above.
 
-Once logged in, find the Dynamo dashboard under General.
+Once logged in, find the “Dynamo” dashboard under the “General” folder.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 766d3f2 and 8317f9d.

📒 Files selected for processing (1)
  • docs/guides/dynamo_deploy/k8s_metrics.md (4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/guides/dynamo_deploy/k8s_metrics.md

[grammar] ~36-~36: Ensure spelling is correct
Context: ...rd includes a panel for GPU utilization relataed to your Dynamo deployment. For that pan...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[style] ~42-~42: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...porter ``` If the output is empty, you need to install the dcgm-exporter. For more inf...

(REP_NEED_TO_VB)

🪛 markdownlint-cli2 (0.17.2)
docs/guides/dynamo_deploy/k8s_metrics.md

227-227: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
docs/guides/dynamo_deploy/k8s_metrics.md (2)

5-5: Good switch to kube-prometheus-stack; concise context.

The overview accurately frames PodMonitor-based discovery with kube-prometheus-stack. No action needed.


200-200: Nice addition calling out GPU utilization via DCGM.

Helps set expectations for the optional DCGM step. No changes needed.

@mohammedabdulwahhab mohammedabdulwahhab changed the title fix: fix metrics docs fix: fix metrics docs; add dcgm-exporter Aug 26, 2025
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>
@mohammedabdulwahhab mohammedabdulwahhab enabled auto-merge (squash) August 26, 2025 18:17
Copy link
Collaborator

@whoisj whoisj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mohammedabdulwahhab mohammedabdulwahhab merged commit 6cf96e0 into main Aug 26, 2025
9 checks passed
@mohammedabdulwahhab mohammedabdulwahhab deleted the mabdulwahhab/metrics-docs-fixes branch August 26, 2025 20:14
mohammedabdulwahhab added a commit that referenced this pull request Aug 26, 2025
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
ayushag-nv pushed a commit that referenced this pull request Aug 27, 2025
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Signed-off-by: ayushag <ayushag@nvidia.com>
jasonqinzhou pushed a commit that referenced this pull request Aug 30, 2025
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Signed-off-by: Jason Zhou <jasonzho@jasonzho-mlt.client.nvidia.com>
KrishnanPrash pushed a commit that referenced this pull request Sep 2, 2025
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
nnshah1 pushed a commit that referenced this pull request Sep 8, 2025
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants