You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guides/dynamo_deploy/k8s_metrics.md
+35-8Lines changed: 35 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,15 +2,15 @@
2
2
3
3
## Overview
4
4
5
-
This guide provides a walkthrough for collecting and visualizing metrics from Dynamo components using the Prometheus Operator stack. The Prometheus Operator provides a powerful and flexible way to configure monitoring for Kubernetes applications through custom resources like PodMonitors, making it easy to automatically discover and scrape metrics from Dynamo components.
5
+
This guide provides a walkthrough for collecting and visualizing metrics from Dynamo components using the kube-prometheus-stack. The kube-prometheus-stack provides a powerful and flexible way to configure monitoring for Kubernetes applications through custom resources like PodMonitors, making it easy to automatically discover and scrape metrics from Dynamo components.
6
6
7
7
## Prerequisites
8
8
9
9
### Install Dynamo Operator
10
10
Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Installation Guide](../dynamo_deploy/dynamo_cloud.md) for detailed instructions on deploying the Dynamo operator.
11
11
12
-
### Install Prometheus Operator
13
-
If you don't have an existing Prometheus setup, you'll need to install the Prometheus Operator. The Prometheus Operator introduces custom resources that make it easy to deploy and manage Prometheus monitoring in Kubernetes:
12
+
### Install kube-prometheus-stack
13
+
If you don't have an existing Prometheus setup, you'll likely want to install the kube-prometheus-stack. This is a collection of Kubernetes manifests that includes the Prometheus Operator, Prometheus, Grafana, and other monitoring components in a pre-configured setup. The stack introduces custom resources that make it easy to deploy and manage monitoring in Kubernetes:
14
14
15
15
-`PodMonitor`: Automatically discovers and scrapes metrics from pods based on label selectors
16
16
-`ServiceMonitor`: Similar to PodMonitor but works with Services
> The commands enumerated below assume you have installed the kube-prometheus-stack with the installation method listed above. Depending on your installation configuration of the monitoring stack, you may need to modify the `kubectl` commands that follow in this document accordingly (e.g modifying Namespace or Service names accordingly).
32
+
33
+
### DCGM Metrics Collection (Optional)
34
+
35
+
GPU utilization metrics are collected and exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization related to your Dynamo deployment. For that panel to be populated, you need to ensure that the dcgm-exporter is running in your cluster. To check if the dcgm-exporter is running, please run the following command:
36
+
37
+
```bash
38
+
kubectl get daemonset -A | grep dcgm-exporter
39
+
```
40
+
41
+
If the output is empty, you need to install the dcgm-exporter. For more information, please consult the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html).
42
+
43
+
26
44
## Deploy a DynamoGraphDeployment
27
45
28
46
Let's start by deploying a simple vLLM aggregated deployment:
@@ -178,13 +196,13 @@ The dashboard is embedded in the ConfigMap. Since it is labeled with `grafana_da
0 commit comments