-
Notifications
You must be signed in to change notification settings - Fork 690
fix: fix metrics docs; add dcgm-exporter #2712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe guide updates Kubernetes metrics documentation to use kube-prometheus-stack instead of Prometheus Operator, adds DCGM exporter instructions, adjusts Helm values for PodMonitor discovery, revises Prometheus/Grafana port-forward and credential steps, and updates dashboard and GPU metrics references accordingly. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
docs/guides/dynamo_deploy/k8s_metrics.md (6)
12-14: Clarify install target namespace and release naming up-front.You mention kube-prometheus-stack correctly includes Prometheus Operator. To avoid later confusion with -n monitoring usages, explicitly state the intended namespace (monitoring) and release name (prometheus) here, or introduce env vars (e.g., MON_NS, RELEASE). See helm command fix below.
31-33: Tighten the Note to pin the assumed release/namespace.Reduce ambiguity by calling out the exact assumptions the rest of the guide makes.
-> The commands enumerated below assume you have installed the kube-prometheus-stack with the installation method listed above. Depending on your installation configuration of the monitoring stack, you may need to modify the `kubectl` commands that follow in this document accordingly (e.g modifying Namespace or Service names accordingly). +> The commands below assume a Helm release name of `prometheus` in the `monitoring` namespace, installed with the exact flags shown above. If you used different names, adjust subsequent `kubectl`/`port-forward` commands accordingly (e.g., namespaces and service names).
34-44: Polish DCGM section: fix typo and tighten phrasing + install hint.
- Spelling: “relataed” → “related”.
- Style: avoid repeated “you need to”.
- Optional: call out that DaemonSet names may vary (dcgm-exporter or nvidia-dcgm-exporter).
-### DCGM Metrics Collection (Optional) - -GPU utilization metrics are collected and exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization relataed to your Dynamo deployment. For that panel to be populated, you need to ensure that the dcgm-exporter is running in your cluster. To check if the dcgm-exporter is running, please run the following command: +### DCGM Metrics Collection (Optional) + +GPU utilization metrics are exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization related to your Dynamo deployment. To populate that panel, ensure dcgm-exporter is running in your cluster. Check with: @@ -If the output is empty, you need to install the dcgm-exporter. For more information, please consult the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html). +If the output is empty, install dcgm-exporter; see the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html). Note: depending on how it’s installed, the DaemonSet may be named `dcgm-exporter` or `nvidia-dcgm-exporter`.
206-206: Service name depends on Helm release; add a quick sanity check.With the release name
prometheusand namespacemonitoring, this is correct. If users changed either, the service name changes. Suggest adding a one-liner to discover the service dynamically:-kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring +kubectl -n monitoring get svc | grep kube-prometheus-prometheus +# If the service name differs, substitute it below: +kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
217-225: Fix typo, and avoid printing passwords to stdout.
- “credss” → “credentials”.
- Avoid echoing admin password in logs/scrollback. Export vars silently and proceed to port-forward.
-# Get Grafana credss +# Get Grafana credentials export GRAFANA_USER=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-user}" | base64 --decode) export GRAFANA_PASSWORD=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode) -echo "Grafana user: $GRAFANA_USER" -echo "Grafana password: $GRAFANA_PASSWORD" +echo "Grafana user: $GRAFANA_USER" +# Password stored in $GRAFANA_PASSWORD (not echoed for security) # Port forward Grafana service kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
227-230: Replace bare URL and clarify where to find the dashboard.Satisfies markdownlint (MD034) and improves readability.
-Visit http://localhost:3000 and log in with the credentials captured above. +Visit [http://localhost:3000](http://localhost:3000) and log in with the credentials captured above. -Once logged in, find the Dynamo dashboard under General. +Once logged in, find the “Dynamo” dashboard under the “General” folder.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docs/guides/dynamo_deploy/k8s_metrics.md(4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/guides/dynamo_deploy/k8s_metrics.md
[grammar] ~36-~36: Ensure spelling is correct
Context: ...rd includes a panel for GPU utilization relataed to your Dynamo deployment. For that pan...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[style] ~42-~42: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...porter ``` If the output is empty, you need to install the dcgm-exporter. For more inf...
(REP_NEED_TO_VB)
🪛 markdownlint-cli2 (0.17.2)
docs/guides/dynamo_deploy/k8s_metrics.md
227-227: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
docs/guides/dynamo_deploy/k8s_metrics.md (2)
5-5: Good switch to kube-prometheus-stack; concise context.The overview accurately frames PodMonitor-based discovery with kube-prometheus-stack. No action needed.
200-200: Nice addition calling out GPU utilization via DCGM.Helps set expectations for the optional DCGM step. No changes needed.
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu>
whoisj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: ayushag <ayushag@nvidia.com>
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: Jason Zhou <jasonzho@jasonzho-mlt.client.nvidia.com>
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
Signed-off-by: mohammedabdulwahhab <furkhan324@berkeley.edu> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
Overview:
Fixes for metrics docs:
closes https://linear.app/nvidia/issue/DIS-403/add-dcgm-metricssection-to-k8s-prometheusgrafana-guide
Summary by CodeRabbit