You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: components/metrics/README.md
+33-25Lines changed: 33 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,14 @@
1
1
# Metrics
2
2
3
3
The `metrics` component is a utility that can collect, aggregate, and publish
4
-
metrics from a Dynamo deployment for use in other applications or visualization
5
-
tools like Prometheus and Grafana.
4
+
metrics from a Dynamo deployment. After collecting and aggregating metrics from
5
+
workers, it exposes them via an HTTP `/metrics` endpoint in Prometheus format
6
+
that other applications or visualization tools like Prometheus server and Grafana can
7
+
pull from.
8
+
9
+
**Note**: This is a demo implementation. The metrics component is currently under active development and this documentation will change as the implementation evolves.
10
+
- In this demo the metrics names use the prefix "llm", but in production they will be prefixed with "nv_llm" (e.g., the HTTP `/metrics` endpoint will serve metrics with "nv_llm" prefixes)
11
+
- This demo will only work when using examples/llm/configs/agg.yml-- other configurations will not work
# 2025-03-17T00:07:05.202558Z INFO metrics: Scraping endpoint dynamo/my_component/my_endpoint for stats
33
+
# 2025-03-17T00:07:05.202558Z INFO metrics: Scraping endpoint dynamo/MyComponent/my_endpoint for stats
28
34
# 2025-03-17T00:07:05.202955Z INFO metrics: Prometheus metrics server started at 0.0.0.0:9091/metrics
29
35
# ...
30
36
```
31
37
32
38
With no matching endpoints running to collect stats from, you should see warnings in the logs:
33
39
```bash
34
-
2025-03-17T00:07:06.204756Z WARN metrics: No endpoints found matching dynamo/my_component/my_endpoint
40
+
2025-03-17T00:07:06.204756Z WARN metrics: No endpoints found matching dynamo/MyComponent/my_endpoint
35
41
```
36
42
37
43
After a worker with a matching endpoint gets started, the endpoint
@@ -44,22 +50,23 @@ so below are some examples of workers and how they can be monitored.
44
50
45
51
### Mock Worker
46
52
47
-
For quick testing and debugging, there is a Rust-based
48
-
[mock worker](src/bin/mock_worker.rs) that registers a mock
49
-
`StatsHandler` under an endpoint named
50
-
`dynamo/my_component/my_endpoint` and publishes random data.
53
+
To try out how `metrics` works, there is a demo Rust-based
54
+
[mock worker](src/bin/mock_worker.rs) that provides sample data through two mechanisms:
55
+
1. Exposes a stats handler at `dynamo/MyComponent/my_endpoint` that responds to polling requests (from `metrics`) with randomly generated `ForwardPassMetrics` data
56
+
2. Publishes mock `KVHitRateEvent` data every second to demonstrate event-based metrics
51
57
58
+
Step 1: Launch a mock workers via the following command (if already built):
52
59
```bash
53
-
# Can run multiple workers in separate shells to see aggregation as well.
54
-
# Or to build/run from source: cargo run --bin mock_worker
60
+
# or build/run from source: DYN_LOG=DEBUG cargo run --bin mock_worker
55
61
mock_worker
56
62
57
-
# 2025-03-16T23:49:28.101668Z INFO mock_worker: Starting Mock Worker on Endpoint: dynamo/my_component/my_endpoint
63
+
# 2025-03-16T23:49:28.101668Z INFO mock_worker: Starting Mock Worker on Endpoint: dynamo/MyComponent/my_endpoint
58
64
```
59
65
60
-
To monitor the metrics of these mock workers, run:
66
+
Step 2: Monitor the metrics of these mock workers, and prepare its own Prometheus endpoint at
67
+
port 9091 (a default, when --port is not specified) on /metrics:
-[grafana_dashboards/grafana-dynamo-dashboard.json](./grafana_dashboards/grafana-dynamo-dashboard.json): A general Dynamo Dashboard for both SW and HW metrics.
113
+
-[grafana_dashboards/grafana-llm-metrics.json](./grafana_dashboards/grafana-llm-metrics.json): Contains Grafana dashboard configuration for LLM specific metrics.
114
+
-[grafana_dashboards/grafana-dcgm-metrics.json](./grafana_dashboards/grafana-dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
0 commit comments