Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add status metrics on prometheus #4251

Merged
merged 1 commit into from
Feb 8, 2022

Conversation

rafaelreinert
Copy link
Contributor

This PR is adding the status metrics on Prometheus register with these metrics:

Metric name Metric type Description
plugin_status_gauge gauge Number of plugins by name and status.
bundle_loaded_counter counter Number of bundles loaded with success.
bundle_failed_load_counter counter Number of bundles that failed to load.
last_bundle_request gauge Last bundle request in UNIX nanoseconds.
last_success_bundle_activation gauge Last successfully bundle activation in UNIX nanoseconds.
last_success_bundle_download gauge Last successfully bundle download in UNIX nanoseconds.
last_success_bundle_request gauge Last successfully bundle request in UNIX nanoseconds.
bundle_loading_duration_ns histogram A histogram of duration for bundle loading.

In order to give control to the user if the metrics should be exported or not I've created a new config on status:

status:
  prometheus: true

And if Prometheus be true, the status plugin will register these metrics on start and will update the metrics using the UpdateRequestV1 at every oneShot function call.

These two issues are related of this pr. #1472 and #1506

@srenatus
Copy link
Contributor

Sorry for the radio silence on this. I'm going to have a look today.

Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing. Added a few comments, please bear with me 🙃


var (
pluginStatus = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert with prometheus, so please bear with me on the naive question: Is it common to use gauges like that? From the docs,

Gauge is a Metric that represents a single numerical value that can arbitrarily go up and down.

A Gauge is typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of running goroutines.

Now, I'm not sure how "status" and the unix nano metrics below fit into this. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is common, as the Gauge is an int64 and can be set as any value it is perfect for timestamp.
For example:
https://github.com/kubernetes/kube-state-metrics/blob/master/docs/cronjob-metrics.md the metric kube_cronjob_next_schedule_time
https://github.com/prometheus/node_exporter the metric node_time_seconds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose ultimately, it depends on the types of metrics you'd like to put on a dashboard...

I think I can follow the argument wrt having the timestamps in guages, but what is plugin_status_gauge about? From what I can tell, we're setting the value to 1 and use the label to carry the status... is that useful? I guess I don't see what this adds over the other gauges...? should this be a last_status_change timestamp? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @srenatus , I haven’t made this clear what I was willing to do with each metric, my bad sorry.
The table below has how I use each metric (The queries are just examples).

Metric name Metric type Description What questions it answers (dashboard) alert
plugin_status_gauge gauge Number of plugins by name and status. - How many plugins I have? 'sum by(instance)(plugin_status_gauge{})'. - which instance has the plugin *? 'plugin_status_gauge{name="x"}' -Has any plugin unhealthy? 'plugin_status_gauge{status!~"OK,NOT_READY"} > 0'
bundle_loaded_counter counter Number of bundles loaded with success. How Many successful pooling I Do Per minute? 'sum by(name)(rate(bundle_loaded_counter{}[1m]))'
bundle_failed_load_counter counter Number of bundles that failed to load. How Many unsuccessfull pooling pooling I Do Per minute ?'sum by(instance, name, code, message)(rate(bundle_failed_load_counter{}[1m]))' The Erro-rate is > 10% ? '((sum by(instance, name)(rate(bundle_failed_load_counter{}[1m]))+sum by(instance, name)(rate(bundle_loaded_counter{}[1m])))/sum by(instance, name)(rate(bundle_failed_load_counter{}[1m]))>10'
last_bundle_request gauge Last bundle request in UNIX nanoseconds. How old was the last bundle request '(time()-((max by(name)(last_bundle_request{instance="$instance", name="$bundle"})/1e+9)))'
last_success_bundle_activation gauge Last successfully bundle activation in UNIX nanoseconds. How old was the last successfully bundle activation '(time()-((max by(name)(last_success_bundle_activation{instance="$instance", name="$bundle"})/1e+9)))'
last_success_bundle_download gauge Last bundle request in UNIX nanoseconds. How old was the last successfully bundle download '(time()-((max by(name)(last_success_bundle_download{instance="$instance", name="$bundle"})/1e+9)))'
last_success_bundle_request gauge Last successfully bundle request in UNIX nanoseconds. How old was the last successfully bundle request '(time()-((max by(name)(last_success_bundle_request{instance="$instance", name="$bundle"})/1e+9)))'
bundle_loading_duration_ns histogram A histogram of duration for bundle loading. What is my 95% bundle loading duration? 'histogram_quantile(0.95, sum(rate(bundle_loading_duration_ns_bucket{instance="$instance",name=~"$bundle"}[1m])) by (le,name))'

Thinking about which questions The status metrics should answer, what questions would you do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For exemple part of my dash : image

plugins/status/plugin.go Outdated Show resolved Hide resolved
plugins/status/plugin.go Outdated Show resolved Hide resolved
internal/prometheus/prometheus.go Outdated Show resolved Hide resolved
plugins/status/plugin.go Show resolved Hide resolved
docs/content/monitoring.md Outdated Show resolved Hide resolved
Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bearing with me and thanks for adding tests. I've got a question inline on your opinion about making the metrics registration part of the plugin manager.

return p.registry.Register(c)
}

// MustRegister register the collectors on OPA prometheus registry and panics when an error occurs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? I don't think it's used, let's just keep this minimal. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented these methods here only to implement the interface prometheus.Registerer, If Prometheus already has an interface for registering, I prefer to use that instead of creating a specific for OPA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Thanks, that makes sense.

p.registry.MustRegister(cs...)
}

// Unregister unregister the collectors on OPA prometheus registry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, do you think we'll need the ability to unregister a collector?

Copy link
Contributor Author

@rafaelreinert rafaelreinert Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same explication above.

metrics metrics.Metrics
readyOnce sync.Once
logger logging.Logger
prometheusRegister prometheus.Registerer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the back and forth, but in my head, it would perhaps look kind of nice if the prometheus.Registerer mechanism was part of the plugin manager: every plugin could then register its metrics with the plugin manager. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks nice, I've seen an issue about that #2348. I will try to implement the interface prometheus.Registerer on the manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @srenatus , in this commit 3cce236 I've moved the Prometheus.Register to the plugin manager. with that, all plugins can register metrics.
In my point of view as the plugin manager can be used in places without Prometheus (as an example the SDK)I prefer to expose the prometheus.Registerer as struct variable than register methods inside the plugin manager struct.
What do you think about that?

@rafaelreinert
Copy link
Contributor Author

Hey @srenatus , could you review it again? Thanks

srenatus
srenatus previously approved these changes Feb 7, 2022
Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some copy-edits on the docs, but this looks good to go. Thanks for taking the time!

plugins/status/plugin.go Outdated Show resolved Hide resolved
docs/content/management-status.md Show resolved Hide resolved
docs/content/management-status.md Outdated Show resolved Hide resolved
docs/content/monitoring.md Outdated Show resolved Hide resolved
@rafaelreinert
Copy link
Contributor Author

Hey @srenatus I've just finished the last requests and updated the branch with the main ;) Please review it again.
After the approval what do I need to do? just merge it to main?
Thanks a lot for the review and I hope do other contributions soon ;)

@srenatus
Copy link
Contributor

srenatus commented Feb 8, 2022

@rafaelreinert thanks for bearing with me. As a final step, could you please squash your commits into one, with a brief description what's going on here? We'll just merge it then.

…e monitoring

Signed-off-by: rafael otero reinert <rafaelreinert@gmail.com>
@rafaelreinert
Copy link
Contributor Author

@srenatus done

@srenatus srenatus merged commit 8569551 into open-policy-agent:main Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants