-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add status metrics on prometheus #4251
Conversation
b49dd48
to
02ff4f6
Compare
Sorry for the radio silence on this. I'm going to have a look today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing. Added a few comments, please bear with me 🙃
|
||
var ( | ||
pluginStatus = prometheus.NewGaugeVec( | ||
prometheus.GaugeOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert with prometheus, so please bear with me on the naive question: Is it common to use gauges like that? From the docs,
Gauge is a Metric that represents a single numerical value that can arbitrarily go up and down.
A Gauge is typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of running goroutines.
Now, I'm not sure how "status" and the unix nano metrics below fit into this. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is common, as the Gauge is an int64 and can be set as any value it is perfect for timestamp.
For example:
https://github.com/kubernetes/kube-state-metrics/blob/master/docs/cronjob-metrics.md the metric kube_cronjob_next_schedule_time
https://github.com/prometheus/node_exporter the metric node_time_seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose ultimately, it depends on the types of metrics you'd like to put on a dashboard...
I think I can follow the argument wrt having the timestamps in guages, but what is plugin_status_gauge
about? From what I can tell, we're setting the value to 1 and use the label to carry the status... is that useful? I guess I don't see what this adds over the other gauges...? should this be a last_status_change
timestamp? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @srenatus , I haven’t made this clear what I was willing to do with each metric, my bad sorry.
The table below has how I use each metric (The queries are just examples).
Metric name | Metric type | Description | What questions it answers (dashboard) | alert |
---|---|---|---|---|
plugin_status_gauge | gauge | Number of plugins by name and status. | - How many plugins I have? 'sum by(instance)(plugin_status_gauge{})'. - which instance has the plugin *? 'plugin_status_gauge{name="x"}' | -Has any plugin unhealthy? 'plugin_status_gauge{status!~"OK,NOT_READY"} > 0' |
bundle_loaded_counter | counter | Number of bundles loaded with success. | How Many successful pooling I Do Per minute? 'sum by(name)(rate(bundle_loaded_counter{}[1m]))' | |
bundle_failed_load_counter | counter | Number of bundles that failed to load. | How Many unsuccessfull pooling pooling I Do Per minute ?'sum by(instance, name, code, message)(rate(bundle_failed_load_counter{}[1m]))' | The Erro-rate is > 10% ? '((sum by(instance, name)(rate(bundle_failed_load_counter{}[1m]))+sum by(instance, name)(rate(bundle_loaded_counter{}[1m])))/sum by(instance, name)(rate(bundle_failed_load_counter{}[1m]))>10' |
last_bundle_request | gauge | Last bundle request in UNIX nanoseconds. | How old was the last bundle request '(time()-((max by(name)(last_bundle_request{instance= |
|
last_success_bundle_activation | gauge | Last successfully bundle activation in UNIX nanoseconds. | How old was the last successfully bundle activation '(time()-((max by(name)(last_success_bundle_activation{instance= |
|
last_success_bundle_download | gauge | Last bundle request in UNIX nanoseconds. | How old was the last successfully bundle download '(time()-((max by(name)(last_success_bundle_download{instance= |
|
last_success_bundle_request | gauge | Last successfully bundle request in UNIX nanoseconds. | How old was the last successfully bundle request '(time()-((max by(name)(last_success_bundle_request{instance= |
|
bundle_loading_duration_ns | histogram | A histogram of duration for bundle loading. | What is my 95% bundle loading duration? 'histogram_quantile(0.95, sum(rate(bundle_loading_duration_ns_bucket{instance="$instance",name=~"$bundle"}[1m])) by (le,name))' |
Thinking about which questions The status metrics should answer, what questions would you do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a18001c
to
3f3a092
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bearing with me and thanks for adding tests. I've got a question inline on your opinion about making the metrics registration part of the plugin manager.
return p.registry.Register(c) | ||
} | ||
|
||
// MustRegister register the collectors on OPA prometheus registry and panics when an error occurs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? I don't think it's used, let's just keep this minimal. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented these methods here only to implement the interface prometheus.Registerer, If Prometheus already has an interface for registering, I prefer to use that instead of creating a specific for OPA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Thanks, that makes sense.
p.registry.MustRegister(cs...) | ||
} | ||
|
||
// Unregister unregister the collectors on OPA prometheus registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, do you think we'll need the ability to unregister a collector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same explication above.
plugins/discovery/discovery.go
Outdated
metrics metrics.Metrics | ||
readyOnce sync.Once | ||
logger logging.Logger | ||
prometheusRegister prometheus.Registerer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the back and forth, but in my head, it would perhaps look kind of nice if the prometheus.Registerer mechanism was part of the plugin manager: every plugin could then register its metrics with the plugin manager. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks nice, I've seen an issue about that #2348. I will try to implement the interface prometheus.Registerer on the manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @srenatus , in this commit 3cce236 I've moved the Prometheus.Register to the plugin manager. with that, all plugins can register metrics.
In my point of view as the plugin manager can be used in places without Prometheus (as an example the SDK)I prefer to expose the prometheus.Registerer as struct variable than register methods inside the plugin manager struct.
What do you think about that?
vendor/github.com/prometheus/client_golang/prometheus/testutil/promlint/promlint.go
Show resolved
Hide resolved
Hey @srenatus , could you review it again? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some copy-edits on the docs, but this looks good to go. Thanks for taking the time!
Hey @srenatus I've just finished the last requests and updated the branch with the main ;) Please review it again. |
@rafaelreinert thanks for bearing with me. As a final step, could you please squash your commits into one, with a brief description what's going on here? We'll just merge it then. |
…e monitoring Signed-off-by: rafael otero reinert <rafaelreinert@gmail.com>
e49b2c4
to
9f4187d
Compare
@srenatus done |
This PR is adding the status metrics on Prometheus register with these metrics:
In order to give control to the user if the metrics should be exported or not I've created a new config on status:
And if Prometheus be true, the status plugin will register these metrics on start and will update the metrics using the UpdateRequestV1 at every oneShot function call.
These two issues are related of this pr. #1472 and #1506