-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose metrics for each service #144
Comments
The above metrics allows an operator to monitor and troubleshoot a system. For example: Severe issues can be detected when:
Performance issues can be detected when:
Each of the above scenarios requires different operation. The first will need additional debugging, and looking over the logs and tracing. The later requires increasing the number of provers, and maybe their sizes. These metrics can be collected via prometheus, influxdb, opentsdb, and alerts can be created via alertmanager, opengenie. The metrics can be inspect via graphana, and so on. |
Can I get details these metrics? Such as Data, Size, type so I may recommend to use these tools |
There are a mix of gauges, counters, histogram, and events. For example:
Some of the metrics would benefit of tag metadata, specially the http response status |
Thanks, got it, what kind of data entry do you expect, daily or instantaneous? Such as hourly/GB hourly/MB and also whats your expectation for total size? TB? |
Near real time. The metrics should support automatic alerting and incident detection. To collect state transitions timing would have to be relative to our batch timeouts As for the size of the data, it depends on the number of metrics, their encoding, the data retention policy, and number of nodes. For some back of the envelope, lets assume 5 nodes. Lets assume no compression and that each data point takes 4bytes (u32/f32 should be enough). Lets assume the histograms have 6 metrics (99, 95, 50, average, sum, count). A rough count on the number of metrics:
The above is about 2k metrics total, with about 8640 points per day at 4bytes, that is less than 100MB a day. I think we can round it to 1GB for good measure, and assume the log retention will be one month long at full precision, so 30GB a month would be sufficient. |
Thank you @hackaugusto for the great explanation |
Add metrics for each component, this will change but here is an initial list:
Node metrics (cpu, memory, disk usage, etc) should not be exposed here, this should be done by an external agent (e.g. https://prometheus.io/docs/instrumenting/exporters/)
The text was updated successfully, but these errors were encountered: