Metrics for configuration policy errors #84

willkutler · 2022-12-05T18:33:53Z

https://issues.redhat.com/browse/ACM-2216?filter=-1

Adds 2 metrics - user error and system error gauges that increment when certain errors are found. They are labeled with the parent policy and policy template that caused the error, as well as the reason for the error.

Also adds a check to ensure the spec of a config policy is not nil - when testing the metrics, I found that if a config policy did not have the spec field, the error message would be incorrect and the parent policy would be compliant

gparvin

Would you be able to add a test that creates a policy referencing a CRD that doesn't exist and then making sure the error shows up as expected? I don't have an idea for validating the system error path right now.

dhaiducek

If these metrics are only being added to, should they be Counters rather than Gauges?

https://prometheus.io/docs/concepts/metric_types/

This PR creates new metrics for user and system errors and sends them when a config policy is invalid (no spec or remediationAction) [user], when the object template CRD cannot be found [user] and when the API call to update status fails [system]. It also adds the check for config policies not having a spec, which didn't exist before and caused some weird undefined behavior in policies without specs. - ref: https://issues.redhat.com/browse/ACM-2216?filter=-1 Signed-off-by: Will Kutler <wkutler@redhat.com>

gparvin · 2022-12-07T15:58:40Z

@dhaiducek Does the .Add(1) increment the value of the metric or does it add a new entry for the metric at the current time? To me it seems like the metric should be a gauge that reflects the number of errors in the current policy. If the policy gets fixed and there are no more errors, I think a gauge that indicates there are no errors makes sense instead of showing a counter that reflects errors in previous versions of the policy.

Edit: Thinking about this more... I guess after a fix there will just no longer be entries created for the metric. Right?

gparvin

Thanks for adding the test! This looks good even though I have some fuzziness on this being a counter. Look forward to trying it out.

openshift-ci · 2022-12-07T17:28:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gparvin, willkutler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [gparvin,willkutler]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added the dco-signoff: no label Dec 5, 2022

openshift-ci bot requested review from ChunxiAlexLuo and gparvin December 5, 2022 18:33

openshift-ci bot added the approved label Dec 5, 2022

willkutler requested review from dhaiducek and removed request for ChunxiAlexLuo December 5, 2022 18:35

willkutler force-pushed the error-metrics branch from 3130758 to 6d3ee40 Compare December 5, 2022 18:43

openshift-ci bot added dco-signoff: yes and removed dco-signoff: no labels Dec 5, 2022

gparvin requested changes Dec 5, 2022

View reviewed changes

openshift-ci bot assigned gparvin Dec 5, 2022

willkutler force-pushed the error-metrics branch from 096619a to 1b8207e Compare December 6, 2022 16:03

openshift-ci bot added dco-signoff: no and removed dco-signoff: yes labels Dec 6, 2022

willkutler force-pushed the error-metrics branch from 1b8207e to 635201e Compare December 6, 2022 16:09

openshift-ci bot added dco-signoff: yes and removed dco-signoff: no labels Dec 6, 2022

dhaiducek reviewed Dec 6, 2022

View reviewed changes

willkutler force-pushed the error-metrics branch from 26039a8 to 49723a1 Compare December 6, 2022 23:07

gparvin approved these changes Dec 7, 2022

View reviewed changes

openshift-ci bot added the lgtm label Dec 7, 2022

openshift-merge-robot merged commit f5af0a2 into open-cluster-management-io:main Dec 7, 2022

magic-mirror-bot bot mentioned this pull request Dec 7, 2022

🤖 Sync from open-cluster-management-io/config-policy-controller: #84 stolostron/config-policy-controller#383

Merged

dhaiducek mentioned this pull request Dec 12, 2022

Add template-sync metrics open-cluster-management-io/governance-policy-framework-addon#23

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics for configuration policy errors #84

Metrics for configuration policy errors #84

willkutler commented Dec 5, 2022

gparvin left a comment

dhaiducek left a comment

gparvin commented Dec 7, 2022 •

edited

Loading

gparvin left a comment

openshift-ci bot commented Dec 7, 2022

Metrics for configuration policy errors #84

Metrics for configuration policy errors #84

Conversation

willkutler commented Dec 5, 2022

gparvin left a comment

Choose a reason for hiding this comment

dhaiducek left a comment

Choose a reason for hiding this comment

gparvin commented Dec 7, 2022 • edited Loading

gparvin left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Dec 7, 2022

gparvin commented Dec 7, 2022 •

edited

Loading