Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics/prometheus: added prometheus metrics #17077

Merged
merged 4 commits into from
Apr 11, 2019
Merged

Conversation

chapsuk
Copy link
Contributor

@chapsuk chapsuk commented Jun 26, 2018

Hi, guys!

  • added 2 flags:
  --metrics.prometheus             Enable prometheus HTTP server
  --metrics.prometheus.addr value  Prometheus server listening address (default: "localhost:9455")
  • when --metrics.prometheus flag provided, start http server which returns internal geth metrics by http in prometheus format; log output:
≻ ./build/bin/geth --metrics --metrics.prometheus
INFO [06-26|04:18:12.544] Enabling metrics collection
INFO [06-26|04:18:12.544] Starting prometheus http server          addr=localhost:9455
INFO [06-26|04:18:12.546] Maximum peer count                       ETH=25 LES=0 total=25

@chapsuk
Copy link
Contributor Author

chapsuk commented Jun 28, 2018

take a look please
@karalabe @nonsense

@karalabe
Copy link
Member

There's no reason to start up a new HTTP server for Prometheus. The metrics are already exposed on the pprof server, you can just add an additional endpoint there to expose further metrics.

@chapsuk
Copy link
Contributor Author

chapsuk commented Jun 28, 2018

Thanks @karalabe!

I'm looks to #1697, metrics flags put in a separate block. Looks like all metrics settings need to be there.

Several reasons to start up a new HTTP server:

  • enable/disable pprof does not affect collect internal metrics, this is set up by --metrics flag
  • different access levels to metrics and pprof ports
  • enable Prometheus handler only if needed, this can be done through --pprof.prometheus flag,
    but it confuse, because it not part of standard pprof handler;

In my case it would be great to enable the Prometheus handler when pprof is disabled. This is true for those environments where it is not possible to use pprof but need to collect metrics and monitor geth node.

Separate settings look more obvious for configuring geth according to own requirements.

@chapsuk chapsuk force-pushed the prom branch 2 times, most recently from 0a29635 to ed9ae6c Compare July 2, 2018 13:15
@chapsuk
Copy link
Contributor Author

chapsuk commented Jul 4, 2018

@karalabe what you think about it #17077 (comment)?

@2color
Copy link

2color commented Apr 4, 2019

This would be greatly appreciated!

@karalabe karalabe added this to the 1.9.0 milestone Apr 4, 2019
@karalabe karalabe self-requested a review April 4, 2019 09:11
@holiman
Copy link
Contributor

holiman commented Apr 4, 2019

This is true for those environments where it is not possible to use pprof but need to collect metrics and monitor geth node.

Why would that not be possible?

We're planning to include this into 1.9, but it needs to be reworked to use the same endpoint as pprof (not a separate http server) and also needs licensing headers. @chapsuk are you willing to work more on this? Otherwise we'll put it into the backlog to try and finish it ourselves.

@chapsuk
Copy link
Contributor Author

chapsuk commented Apr 4, 2019

Thanks for replay @holiman.
Not a problem, i will change to use pprof endpoints in 2-3 days.
See you later

@chapsuk
Copy link
Contributor Author

chapsuk commented Apr 4, 2019

Done, PTAL

@karalabe
Copy link
Member

karalabe commented Apr 5, 2019

Hey @chapsuk, thanks a lot for this. I've been playing around with it a bit and have a few questions though.

Prometheus generally has each metric as its separate instance:

# Minimalistic line:
metric_without_timestamp_and_labels 12.47

# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045

However, you grouped all everything under 4 metrics (counters, gauges, meters, timers):

timers{name="chain/account/reads",quantile="0.50"} 0
timers{name="chain/account/reads",quantile="0.75"} 0
timers{name="chain/account/reads",quantile="0.95"} 0
timers{name="chain/account/reads",quantile="0.99"} 0
timers{name="chain/account/reads",quantile="0.999"} 0
timers{name="chain/account/reads",quantile="0.9999"} 0
timers_count{name="chain/storage/commits"} 22094
timers{name="chain/storage/commits",quantile="0.50"} 0
timers{name="chain/storage/commits",quantile="0.75"} 0
timers{name="chain/storage/commits",quantile="0.95"} 0
timers{name="chain/storage/commits",quantile="0.99"} 0
timers{name="chain/storage/commits",quantile="0.999"} 0
timers{name="chain/storage/commits",quantile="0.9999"} 0
timers_count{name="chain/execution"} 22094

I don't have much experience with Prometheus, but this seems the wrong way to approach it, since then all these data points will be pushed into the same table of the time series database, instead of separate ones per metric.

It also seems to make Grafana charting more complex, since you need to fetch timers first, and then filter for chain/storage/commits.

Instead of creating these grouping, which imho only add complexity but don't provide any value, couldn't er simply convert / to _ and report every metric on it's own? Seems like that's the official way. Eg.

chain_storage_commits_count 22094
chain_storage_commits{quantile="0.50"} 0
chain_storage_commits{quantile="0.75"} 0
chain_storage_commits{quantile="0.95"} 0
chain_storage_commits{quantile="0.99"} 0
chain_storage_commits{quantile="0.999"} 0
chain_storage_commits{quantile="0.9999"} 0

This would also simplify your code a lot, since it completely removes the need for the collector type, which just facilitates the groupings. Instead you can just use a singe byte buffer in the Handler and push every meter in the order of appearance, no need to change it.


A second issue is that you seem to convert go-metrics counters into Prometheus gauges. I think we can retain the counter type in Prometheus too.

@2color
Copy link

2color commented Apr 5, 2019

@karalabe

You're right about naming/grouping metrics based on their semantic value rather than unit type.

Some good pointers can be found:


Regarding Prometheus counters, they can only increment unlike go-metric counters which can also decrement.

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

@chapsuk
Copy link
Contributor Author

chapsuk commented Apr 5, 2019

@karalabe named metrics better then typed groups 👍

This would also simplify your code a lot, since it completely removes the need for the collector type, which just facilitates the groupings. Instead you can just use a singe byte buffer in the Handler and push every meter in the order of appearance, no need to change it.

Not sure about that, i need to send type of each metrics to Prometheus. But i will try


@2color fully right about second question.

@karalabe
Copy link
Member

karalabe commented Apr 5, 2019

Not sure about that, i need to send type of each metrics to Prometheus. But i will try

Based on the Prometheus sample, you don't need to explicitly specify the type. But even if you want to, you can still do with the type switch you currently have, you just don't need to "group" all counters together for example, you can simply output as the come along: Counter1, Meter1, Counter2, Counter3, Meter2, etc.

@chapsuk
Copy link
Contributor Author

chapsuk commented Apr 5, 2019

Based on the Prometheus sample, you don't need to explicitly specify the type.

Yep, it works without TYPE , but I left him

Fixed names of metrics, PTAL

@chapsuk
Copy link
Contributor Author

chapsuk commented Apr 11, 2019

@karalabe
Need something else to do?

@karalabe
Copy link
Member

Nope, just pushes a tiny typo fix on top. LGTM, will merge when CI is green.

Copy link
Member

@karalabe karalabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karalabe
Copy link
Member

karalabe commented Apr 11, 2019

Oh @chapsuk, would you be willing to share your Grafana dashboard setup? I'd like to write up a blog post about metrics for the 1.9 release and would be nice to have a Grafana dashboard included there too.

@chapsuk
Copy link
Contributor Author

chapsuk commented Apr 11, 2019

This repo contains dev environment and Grafana dashboard. I'm updated dashboard and test last changes with it, but README screenshot not actual. I will try to fix it tonight.

@karalabe karalabe merged commit 31bc2a2 into ethereum:master Apr 11, 2019
@karalabe
Copy link
Member

Thanks a lot for this PR! Btw, we'll probably simplify the flags a bit even further so not even --pprof should be needed. I'll ping you on the followup PR when I open it.

@chapsuk chapsuk deleted the prom branch April 11, 2019 12:09
@2color
Copy link

2color commented Apr 11, 2019

Thanks @chapsuk! Much appreciated

gzliudan added a commit to gzliudan/XDPoSChain that referenced this pull request Dec 10, 2024
gzliudan added a commit to gzliudan/XDPoSChain that referenced this pull request Dec 10, 2024
gzliudan added a commit to gzliudan/XDPoSChain that referenced this pull request Dec 13, 2024
JukLee0ira pushed a commit to JukLee0ira/XDPoSChain that referenced this pull request Dec 16, 2024
JukLee0ira pushed a commit to JukLee0ira/XDPoSChain that referenced this pull request Dec 16, 2024
JukLee0ira pushed a commit to JukLee0ira/XDPoSChain that referenced this pull request Dec 20, 2024
JukLee0ira pushed a commit to JukLee0ira/XDPoSChain that referenced this pull request Dec 20, 2024
JukLee0ira pushed a commit to JukLee0ira/XDPoSChain that referenced this pull request Dec 22, 2024
JukLee0ira pushed a commit to JukLee0ira/XDPoSChain that referenced this pull request Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants