Expose metrics for each service #144

hackaugusto · 2024-01-15T19:26:53Z

Decide on a library to collect and publish metrics
- candidate https://github.com/prometheus/client_rust

Add metrics for each component, this will change but here is an initial list:

Node metrics (cpu, memory, disk usage, etc) should not be exposed here, this should be done by an external agent (e.g. https://prometheus.io/docs/instrumenting/exporters/)

hackaugusto · 2024-01-15T19:39:27Z

The above metrics allows an operator to monitor and troubleshoot a system. For example:

Severe issues can be detected when:

The number of blocks in the store doesn't increase
The number of non status-200 responses increases
The number of transactions in the store is not increasing, but the number of transactions in the block producer is
The runtime seconds was reset (i.e. the process was killed/restarted)

Performance issues can be detected when:

The percentiles of the request handlers is too high
The percentiles of the provers is too high
The number of transactions in-flight is increasing
There are failures when generating proofs, and the cpu/memory of the provers is high

Each of the above scenarios requires different operation. The first will need additional debugging, and looking over the logs and tracing. The later requires increasing the number of provers, and maybe their sizes.

These metrics can be collected via prometheus, influxdb, opentsdb, and alerts can be created via alertmanager, opengenie. The metrics can be inspect via graphana, and so on.

okcan · 2024-02-15T02:53:36Z

Can I get details these metrics? Such as Data, Size, type so I may recommend to use these tools "These metrics can be collected via prometheus, influxdb, opentsdb, and alerts can be created via alertmanager, opengenie. The metrics can be inspect via graphana, and so on."

hackaugusto · 2024-02-15T10:52:17Z

Can I get details these metrics? Such as Data, Size, type so I may recommend to use these tools

There are a mix of gauges, counters, histogram, and events. For example:

number of transactions in a pool: gauge
number of seconds the server is running: counter
percentiles: histogram
restarts/reconfigure: events

Some of the metrics would benefit of tag metadata, specially the http response status

okcan · 2024-02-16T00:58:03Z

Thanks, got it, what kind of data entry do you expect, daily or instantaneous? Such as hourly/GB hourly/MB and also whats your expectation for total size? TB?

hackaugusto · 2024-02-19T13:08:04Z

Thanks, got it, what kind of data entry do you expect, daily or instantaneous? Such as hourly/GB hourly/MB and also whats your expectation for total size? TB?

Near real time. The metrics should support automatic alerting and incident detection. To collect state transitions timing would have to be relative to our batch timeouts Duration::from_secs(2), that would need roughly 1 sample every 0.5s. I think that is too high, so instead of state transitions we can just use the metrics to collect trends, and a scrape time of 10s will do.

As for the size of the data, it depends on the number of metrics, their encoding, the data retention policy, and number of nodes. For some back of the envelope, lets assume 5 nodes. Lets assume no compression and that each data point takes 4bytes (u32/f32 should be enough). Lets assume the histograms have 6 metrics (99, 95, 50, average, sum, count).

A rough count on the number of metrics:

35 metrics for tokio, per node (175 in total)
28 metrics per task, lets estimate the number of tracked tasks as 3 per node (420 in total)
Two histograms per endpoint, + at least 3 buckets for response status. Lets say there are 10 endpoints per node (the real number is more like 3 for the time being). (750 in total)
system metrics, e.g. memory, disk, network, cpu, loadavg. Using node_exporter as a reference on my machine with the default these are 438 metrics curl http://localhost:9100/metrics | grep -v '^#' | wc (will change depending on the number of devices/partitions etc)
store metrics (12 total)
block producer (25 total)

The above is about 2k metrics total, with about 8640 points per day at 4bytes, that is less than 100MB a day. I think we can round it to 1GB for good measure, and assume the log retention will be one month long at full precision, so 30GB a month would be sufficient.

okcan · 2024-02-25T23:50:50Z

Thank you @hackaugusto for the great explanation

This was referenced Jan 15, 2024

Add healthchecking #145

Open

add healthchecking and monitoring to the servers #82

Closed

Dominik1999 added this to Builder's testnet Jan 15, 2024

phklive self-assigned this Jan 17, 2024

phklive moved this to Todo in Builder's testnet Jan 22, 2024

phklive mentioned this issue Feb 12, 2024

Add End-to-End tests for the Miden Node #222

Open

Dominik1999 mentioned this issue Feb 16, 2024

Set up a Miden node with Polygon DevOps team #221

Closed

3 tasks

phklive moved this from Todo to In Progress in Builder's testnet Mar 6, 2024

phklive moved this from In Progress to Todo in Builder's testnet Mar 6, 2024

Dominik1999 unassigned phklive Mar 25, 2024

Dominik1999 removed this from Builder's testnet Mar 25, 2024

bobbinth mentioned this issue Dec 3, 2024

feat(proxy, worker): add metrics for the proxy and worker 0xPolygonMiden/miden-base#1004

Closed

igamigo mentioned this issue Jan 10, 2025

Stress-testing and tooling #609

Open

Mirko-von-Leipzig self-assigned this Jan 16, 2025

bobbinth added this to the v0.8 milestone Jan 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose metrics for each service #144

Expose metrics for each service #144

hackaugusto commented Jan 15, 2024 •

edited

Loading

hackaugusto commented Jan 15, 2024 •

edited

Loading

okcan commented Feb 15, 2024

hackaugusto commented Feb 15, 2024

okcan commented Feb 16, 2024

hackaugusto commented Feb 19, 2024

okcan commented Feb 25, 2024

Expose metrics for each service #144

Expose metrics for each service #144

Comments

hackaugusto commented Jan 15, 2024 • edited Loading

hackaugusto commented Jan 15, 2024 • edited Loading

okcan commented Feb 15, 2024

hackaugusto commented Feb 15, 2024

okcan commented Feb 16, 2024

hackaugusto commented Feb 19, 2024

okcan commented Feb 25, 2024

hackaugusto commented Jan 15, 2024 •

edited

Loading

hackaugusto commented Jan 15, 2024 •

edited

Loading