Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics for configuration policy errors #84

Conversation

willkutler
Copy link
Contributor

https://issues.redhat.com/browse/ACM-2216?filter=-1

Adds 2 metrics - user error and system error gauges that increment when certain errors are found. They are labeled with the parent policy and policy template that caused the error, as well as the reason for the error.

Also adds a check to ensure the spec of a config policy is not nil - when testing the metrics, I found that if a config policy did not have the spec field, the error message would be incorrect and the parent policy would be compliant

Copy link
Member

@gparvin gparvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be able to add a test that creates a policy referencing a CRD that doesn't exist and then making sure the error shows up as expected? I don't have an idea for validating the system error path right now.

Copy link
Member

@dhaiducek dhaiducek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these metrics are only being added to, should they be Counters rather than Gauges?

https://prometheus.io/docs/concepts/metric_types/

This PR creates new metrics for user and system errors and sends them when a config policy is invalid (no spec or remediationAction) [user], when the object template CRD cannot be found [user] and when the API call to update status fails [system]. It also adds the check for config policies not having a spec, which didn't exist before and caused some weird undefined behavior in policies without specs.

- ref: https://issues.redhat.com/browse/ACM-2216?filter=-1

Signed-off-by: Will Kutler <wkutler@redhat.com>
@gparvin
Copy link
Member

gparvin commented Dec 7, 2022

@dhaiducek Does the .Add(1) increment the value of the metric or does it add a new entry for the metric at the current time? To me it seems like the metric should be a gauge that reflects the number of errors in the current policy. If the policy gets fixed and there are no more errors, I think a gauge that indicates there are no errors makes sense instead of showing a counter that reflects errors in previous versions of the policy.

Edit: Thinking about this more... I guess after a fix there will just no longer be entries created for the metric. Right?

Copy link
Member

@gparvin gparvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test! This looks good even though I have some fuzziness on this being a counter. Look forward to trying it out.

@openshift-ci openshift-ci bot added the lgtm label Dec 7, 2022
@openshift-ci
Copy link

openshift-ci bot commented Dec 7, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gparvin, willkutler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants