Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose metrics for each service #144

Open
35 tasks
hackaugusto opened this issue Jan 15, 2024 · 6 comments
Open
35 tasks

Expose metrics for each service #144

hackaugusto opened this issue Jan 15, 2024 · 6 comments
Assignees
Milestone

Comments

@hackaugusto
Copy link
Contributor

hackaugusto commented Jan 15, 2024

Add metrics for each component, this will change but here is an initial list:

  • All
    • Collect metrics from the event loop, e.g. tokio-metrics
    • Collect metrics from gRPC / tonic / axum
      • Number of requests, with status (200, 404, 500, etc), per request handler
      • Timing for request handlers with percentile (at least 99, 95, 50)
      • Request/Response sizes with percentiles
    • Timing for downstream requests with percentile
    • Number of seconds the service has been running
    • Metrics from the control plane
      • Number of operations performed
      • Configuration reload
  • Store
    • Number of blocks persisted so far
    • Number of leaves in the MMR
    • Number of non-empty leaves in the Nullifier tree
    • Number of accounts created
    • Number of transactions persisted so far (not sure if we have this)
    • Size of the chain (could be approximated by the size of the sqlite db file)
    • Percentiles for queries
      • This overlaps with the spans from the distributed tracer, ideally we should reuse the data from the tracer for metrics too
  • Block Producer
    • Number of known provers, with status (e.g. healthy/unhealthy, or responsive/unresponsive)
    • Number of locally proven transactions waiting in the queue
    • Number of chain transactions waiting in the queue
    • Number of batches in the queue
    • Number of accounts with in-flight transactions
    • Number of in-flight notes
    • Timing for proving times with percentile (at least 99, 95, 50)
      • for the chain transactions
      • for the batches
      • for the block
    • Age of the oldest transaction waiting to be included in a batch (this is probably the best strategy to trigger when to increase the number of batch prover machines)
  • RPC

Node metrics (cpu, memory, disk usage, etc) should not be exposed here, this should be done by an external agent (e.g. https://prometheus.io/docs/instrumenting/exporters/)

@hackaugusto
Copy link
Contributor Author

hackaugusto commented Jan 15, 2024

The above metrics allows an operator to monitor and troubleshoot a system. For example:

Severe issues can be detected when:

  • The number of blocks in the store doesn't increase
  • The number of non status-200 responses increases
  • The number of transactions in the store is not increasing, but the number of transactions in the block producer is
  • The runtime seconds was reset (i.e. the process was killed/restarted)

Performance issues can be detected when:

  • The percentiles of the request handlers is too high
  • The percentiles of the provers is too high
  • The number of transactions in-flight is increasing
  • There are failures when generating proofs, and the cpu/memory of the provers is high

Each of the above scenarios requires different operation. The first will need additional debugging, and looking over the logs and tracing. The later requires increasing the number of provers, and maybe their sizes.

These metrics can be collected via prometheus, influxdb, opentsdb, and alerts can be created via alertmanager, opengenie. The metrics can be inspect via graphana, and so on.

@okcan
Copy link

okcan commented Feb 15, 2024

Can I get details these metrics? Such as Data, Size, type so I may recommend to use these tools "These metrics can be collected via prometheus, influxdb, opentsdb, and alerts can be created via alertmanager, opengenie. The metrics can be inspect via graphana, and so on."

@hackaugusto
Copy link
Contributor Author

Can I get details these metrics? Such as Data, Size, type so I may recommend to use these tools

There are a mix of gauges, counters, histogram, and events. For example:

  • number of transactions in a pool: gauge
  • number of seconds the server is running: counter
  • percentiles: histogram
  • restarts/reconfigure: events

Some of the metrics would benefit of tag metadata, specially the http response status

@okcan
Copy link

okcan commented Feb 16, 2024

Thanks, got it, what kind of data entry do you expect, daily or instantaneous? Such as hourly/GB hourly/MB and also whats your expectation for total size? TB?

@hackaugusto
Copy link
Contributor Author

Thanks, got it, what kind of data entry do you expect, daily or instantaneous? Such as hourly/GB hourly/MB and also whats your expectation for total size? TB?

Near real time. The metrics should support automatic alerting and incident detection. To collect state transitions timing would have to be relative to our batch timeouts Duration::from_secs(2), that would need roughly 1 sample every 0.5s. I think that is too high, so instead of state transitions we can just use the metrics to collect trends, and a scrape time of 10s will do.

As for the size of the data, it depends on the number of metrics, their encoding, the data retention policy, and number of nodes. For some back of the envelope, lets assume 5 nodes. Lets assume no compression and that each data point takes 4bytes (u32/f32 should be enough). Lets assume the histograms have 6 metrics (99, 95, 50, average, sum, count).

A rough count on the number of metrics:

  • 35 metrics for tokio, per node (175 in total)
  • 28 metrics per task, lets estimate the number of tracked tasks as 3 per node (420 in total)
  • Two histograms per endpoint, + at least 3 buckets for response status. Lets say there are 10 endpoints per node (the real number is more like 3 for the time being). (750 in total)
  • system metrics, e.g. memory, disk, network, cpu, loadavg. Using node_exporter as a reference on my machine with the default these are 438 metrics curl http://localhost:9100/metrics | grep -v '^#' | wc (will change depending on the number of devices/partitions etc)
  • store metrics (12 total)
  • block producer (25 total)

The above is about 2k metrics total, with about 8640 points per day at 4bytes, that is less than 100MB a day. I think we can round it to 1GB for good measure, and assume the log retention will be one month long at full precision, so 30GB a month would be sufficient.

@okcan
Copy link

okcan commented Feb 25, 2024

Thank you @hackaugusto for the great explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants