Skip to content

Commit 8d56011

Browse files
keivenchangatchernych
authored andcommitted
feat: add a new composite SW/HW grafana (DYN-678) (#1788)
Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
1 parent be9d082 commit 8d56011

File tree

10 files changed

+707
-1850
lines changed

10 files changed

+707
-1850
lines changed

components/metrics/README.md

Lines changed: 33 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,14 @@
11
# Metrics
22

33
The `metrics` component is a utility that can collect, aggregate, and publish
4-
metrics from a Dynamo deployment for use in other applications or visualization
5-
tools like Prometheus and Grafana.
4+
metrics from a Dynamo deployment. After collecting and aggregating metrics from
5+
workers, it exposes them via an HTTP `/metrics` endpoint in Prometheus format
6+
that other applications or visualization tools like Prometheus server and Grafana can
7+
pull from.
8+
9+
**Note**: This is a demo implementation. The metrics component is currently under active development and this documentation will change as the implementation evolves.
10+
- In this demo the metrics names use the prefix "llm", but in production they will be prefixed with "nv_llm" (e.g., the HTTP `/metrics` endpoint will serve metrics with "nv_llm" prefixes)
11+
- This demo will only work when using examples/llm/configs/agg.yml-- other configurations will not work
612

713
<div align="center">
814
<img src="images/dynamo_metrics_grafana.png" alt="Dynamo Metrics Dashboard"/>
@@ -22,16 +28,16 @@ For example:
2228
```bash
2329
# Default namespace is "dynamo", but can be configured with --namespace
2430
# For more detailed output, try setting the env var: DYN_LOG=debug
25-
metrics --component my_component --endpoint my_endpoint
31+
metrics --component MyComponent --endpoint my_endpoint
2632

27-
# 2025-03-17T00:07:05.202558Z INFO metrics: Scraping endpoint dynamo/my_component/my_endpoint for stats
33+
# 2025-03-17T00:07:05.202558Z INFO metrics: Scraping endpoint dynamo/MyComponent/my_endpoint for stats
2834
# 2025-03-17T00:07:05.202955Z INFO metrics: Prometheus metrics server started at 0.0.0.0:9091/metrics
2935
# ...
3036
```
3137

3238
With no matching endpoints running to collect stats from, you should see warnings in the logs:
3339
```bash
34-
2025-03-17T00:07:06.204756Z WARN metrics: No endpoints found matching dynamo/my_component/my_endpoint
40+
2025-03-17T00:07:06.204756Z WARN metrics: No endpoints found matching dynamo/MyComponent/my_endpoint
3541
```
3642

3743
After a worker with a matching endpoint gets started, the endpoint
@@ -44,22 +50,23 @@ so below are some examples of workers and how they can be monitored.
4450

4551
### Mock Worker
4652

47-
For quick testing and debugging, there is a Rust-based
48-
[mock worker](src/bin/mock_worker.rs) that registers a mock
49-
`StatsHandler` under an endpoint named
50-
`dynamo/my_component/my_endpoint` and publishes random data.
53+
To try out how `metrics` works, there is a demo Rust-based
54+
[mock worker](src/bin/mock_worker.rs) that provides sample data through two mechanisms:
55+
1. Exposes a stats handler at `dynamo/MyComponent/my_endpoint` that responds to polling requests (from `metrics`) with randomly generated `ForwardPassMetrics` data
56+
2. Publishes mock `KVHitRateEvent` data every second to demonstrate event-based metrics
5157

58+
Step 1: Launch a mock workers via the following command (if already built):
5259
```bash
53-
# Can run multiple workers in separate shells to see aggregation as well.
54-
# Or to build/run from source: cargo run --bin mock_worker
60+
# or build/run from source: DYN_LOG=DEBUG cargo run --bin mock_worker
5561
mock_worker
5662

57-
# 2025-03-16T23:49:28.101668Z INFO mock_worker: Starting Mock Worker on Endpoint: dynamo/my_component/my_endpoint
63+
# 2025-03-16T23:49:28.101668Z INFO mock_worker: Starting Mock Worker on Endpoint: dynamo/MyComponent/my_endpoint
5864
```
5965

60-
To monitor the metrics of these mock workers, run:
66+
Step 2: Monitor the metrics of these mock workers, and prepare its own Prometheus endpoint at
67+
port 9091 (a default, when --port is not specified) on /metrics:
6168
```bash
62-
metrics --component my_component --endpoint my_endpoint
69+
metrics --component MyComponent --endpoint my_endpoint
6370
```
6471

6572
### Real Worker
@@ -69,13 +76,14 @@ see the examples in [examples/llm](../../examples/llm).
6976

7077
For example, for a VLLM + KV Routing based deployment that
7178
exposes statistics on an endpoint labeled
72-
`dynamo/VllmWorker/load_metrics`:
79+
`dynamo/VllmWorker/load_metrics` (note: this does NOT currently work
80+
with any other example such as examples/vllm_v0, vllm_v1, ...):
7381
```bash
7482
cd deploy/examples/llm
75-
dynamo serve <vllm kv routing example args>
83+
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
7684
```
7785

78-
To monitor the metrics of these VllmWorkers, run:
86+
Then, to monitor the metrics of these VllmWorkers, run:
7987
```bash
8088
metrics --component VllmWorker --endpoint load_metrics
8189
```
@@ -105,10 +113,10 @@ Prometheus server or curl client can pull from:
105113

106114
```bash
107115
# Start metrics server on default host (0.0.0.0) and port (9091)
108-
metrics --component my_component --endpoint my_endpoint
116+
metrics --component MyComponent --endpoint my_endpoint
109117

110118
# Or specify a custom port
111-
metrics --component my_component --endpoint my_endpoint --port 9092
119+
metrics --component MyComponent --endpoint my_endpoint --port 9092
112120
```
113121

114122
In pull mode:
@@ -121,12 +129,12 @@ curl localhost:9091/metrics
121129

122130
# # HELP llm_kv_blocks_active Active KV cache blocks
123131
# # TYPE llm_kv_blocks_active gauge
124-
# llm_kv_blocks_active{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033398"} 40
125-
# llm_kv_blocks_active{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033401"} 2
132+
# llm_kv_blocks_active{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033398"} 40
133+
# llm_kv_blocks_active{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033401"} 2
126134
# # HELP llm_kv_blocks_total Total KV cache blocks
127135
# # TYPE llm_kv_blocks_total gauge
128-
# llm_kv_blocks_total{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033398"} 100
129-
# llm_kv_blocks_total{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033401"} 100
136+
# llm_kv_blocks_total{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033398"} 100
137+
# llm_kv_blocks_total{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033401"} 100
130138
```
131139

132140
### Push Mode
@@ -145,7 +153,7 @@ Start the metrics component in `--push` mode, specifying the host and port of yo
145153
```bash
146154
# Push metrics to a Prometheus PushGateway every --push-interval seconds
147155
metrics \
148-
--component my_component \
156+
--component MyComponent \
149157
--endpoint my_endpoint \
150158
--host 127.0.0.1 \
151159
--port 9091 \
@@ -173,7 +181,7 @@ For easy iteration while making edits to the metrics component, you can use `car
173181
to build and run with your local changes:
174182

175183
```bash
176-
cargo run --bin metrics -- --component my_component --endpoint my_endpoint
184+
cargo run --bin metrics -- --component MyComponent --endpoint my_endpoint
177185
```
178186

179187

components/metrics/src/bin/mock_worker.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ async fn backend(runtime: DistributedRuntime) -> Result<()> {
146146
let namespace = runtime.namespace("dynamo")?;
147147
// we must first create a service, then we can attach one more more endpoints
148148
let component = namespace
149-
.component("my_component")?
149+
.component("MyComponent")?
150150
.service_builder()
151151
.create()
152152
.await?;

deploy/metrics/README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -100,16 +100,18 @@ Note: You may need to adjust the target based on your host configuration and net
100100
Grafana is pre-configured with:
101101
- Prometheus datasource
102102
- Sample dashboard for visualizing service metrics
103-
![grafana image](./grafana1.png)
103+
![grafana image](./grafana-dynamo-composite.png)
104104

105105
## Required Files
106106

107107
The following configuration files should be present in this directory:
108108
- [docker-compose.yml](./docker-compose.yml): Defines the Prometheus and Grafana services
109109
- [prometheus.yml](./prometheus.yml): Contains Prometheus scraping configuration
110-
- [grafana.json](./grafana.json): Contains Grafana dashboard configuration
111110
- [grafana-datasources.yml](./grafana-datasources.yml): Contains Grafana datasource configuration
112-
- [grafana-dashboard-providers.yml](./grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration
111+
- [grafana_dashboards/grafana-dashboard-providers.yml](./grafana_dashboards/grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration
112+
- [grafana_dashboards/grafana-dynamo-dashboard.json](./grafana_dashboards/grafana-dynamo-dashboard.json): A general Dynamo Dashboard for both SW and HW metrics.
113+
- [grafana_dashboards/grafana-llm-metrics.json](./grafana_dashboards/grafana-llm-metrics.json): Contains Grafana dashboard configuration for LLM specific metrics.
114+
- [grafana_dashboards/grafana-dcgm-metrics.json](./grafana_dashboards/grafana-dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
113115

114116
## Running the example `metrics` component
115117

deploy/metrics/docker-compose.yml

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16+
# IMPORT NOTE: Make sure this is in sync with lib/runtime/docker-compose.yml
1617
networks:
1718
server:
1819
driver: bridge
@@ -83,6 +84,8 @@ services:
8384
networks:
8485
- monitoring
8586

87+
# To access Prometheus from another machine, you may need to disable te firewall on your host. On Ubuntu:
88+
# sudo ufw allow 9090/tcp
8689
prometheus:
8790
image: prom/prometheus:v3.4.1
8891
container_name: prometheus
@@ -98,35 +101,43 @@ services:
98101
restart: unless-stopped
99102
# Example to pull from the /query endpoint:
100103
# {__name__=~"DCGM.*", job="dcgm-exporter"}
101-
ports:
102-
- "9090:9090"
103104
networks:
104105
- monitoring
106+
ports:
107+
- "9090:9090"
105108
profiles: [metrics]
109+
extra_hosts:
110+
- "host.docker.internal:host-gateway"
106111
depends_on:
107112
- dcgm-exporter
108113
- nats-prometheus-exporter
109114
- etcd-server
110115

111116
# grafana connects to prometheus via the /query endpoint.
112117
# Default credentials are dynamo/dynamo.
118+
# To access Grafana from another machine, you may need to disable te firewall on your host. On Ubuntu:
119+
# sudo ufw allow 3001/tcp
113120
grafana:
114121
image: grafana/grafana-enterprise:12.0.1
115122
container_name: grafana
116123
volumes:
117-
- ./grafana.json:/etc/grafana/provisioning/dashboards/llm-worker-dashboard.json
118-
- ./grafana-dcgm-dashboard.json:/etc/grafana/provisioning/dashboards/dcgm-dashboard.json
124+
- ./grafana_dashboards:/etc/grafana/provisioning/dashboards
119125
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
120-
- ./grafana-dashboard-providers.yml:/etc/grafana/provisioning/dashboards/dashboard-providers.yml
121126
environment:
122127
# Port 3000 is already used by "dynamo serve", so use 3001
123128
- GF_SERVER_HTTP_PORT=3001
129+
# do not make it admin/admin, because you will be prompted to change the password every time
124130
- GF_SECURITY_ADMIN_USER=dynamo
125131
- GF_SECURITY_ADMIN_PASSWORD=dynamo
126132
- GF_USERS_ALLOW_SIGN_UP=false
127133
- GF_INSTALL_PLUGINS=grafana-piechart-panel
128134
# Default min interval is 5s, but can be configured lower
129135
- GF_DASHBOARDS_MIN_REFRESH_INTERVAL=2s
136+
# Disable password change requirement
137+
- GF_SECURITY_DISABLE_INITIAL_ADMIN_CREATION=false
138+
- GF_SECURITY_ADMIN_PASSWORD_POLICY=false
139+
- GF_AUTH_DISABLE_LOGIN_FORM=false
140+
- GF_AUTH_DISABLE_SIGNOUT_MENU=false
130141
restart: unless-stopped
131142
ports:
132143
- "3001:3001"

0 commit comments

Comments
 (0)