-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-1803: Remove compliance_operator_compliance_scan_error_total … #223
OCPBUGS-1803: Remove compliance_operator_compliance_scan_error_total … #223
Conversation
@rhmdnd: This pull request references Jira Issue OCPBUGS-1803, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Maybe we just need to remove the error message labels from the metric, instead of completely removing the error metric? Just read Matt's comments, and maybe removing the error metric makes sense here |
What exactly does this metric provide? Can you please elaborate a bit more about it? Before we remove it we should notify the users - ideally set a deprecation note first and then remove next release. If we decide to go the later route, we can issue an internal KCS for Red Hat customers and send a T3 blog to share with the TAMs/CSM to spread the knowledge that way and move on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm... is this really the solution? I'd say this really is an issue with the metric's cardinality, and we should instead remove the error message from the metric's labels instead. That would reduce the cardinality greatly. IMO, this is a better solution than removing the metric entirely, as it still gives operators a per-metric view of the error rate in the deployments, as opposed to leaving them in the dark.
I'm not even sure if adding the scan name as metric label is useful. we could probably just have a global error counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus developer here, I agree that keeping the metric but removing the high-cardinality labels is probably better than removing it.
Thanks for the feedback - I'll respin this. |
/hold for test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better IMO. Thanks!
/retest |
@rhmdnd seems you need to update the tests:
|
…rror_total metric This metric contained the scan error, which can exceed lenghts of 2k (sometimes 11k), and causes resource issues with Prometheus and integrating metrics into different storage backends. This commit removes the error to reduce cardinality of the metric and follow Prometheus best practices: https://prometheus.io/docs/practices/naming/#labels
Fixed, and I have a clean run locally. Need to update the metric to actually remove the error label. |
The metric was providing the scan name and the scan error. The error could be a number of different things, which increases cardinality of the metric (potentially bloating promethues and goes against prometheus best practices).
We decided to keep the metric, but just remove the error from the metric labels (reducing cardinality) and making the metric more useful. |
Verification pass with 4.13.0-0.nightly-2023-03-07-081835 + code in the PR:
|
/label qe-approved |
/jira refresh |
@xiaojiey: This pull request references Jira Issue OCPBUGS-1803, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@xiaojiey: This pull request references Jira Issue OCPBUGS-1803, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@xiaojiey: This pull request references Jira Issue OCPBUGS-1803, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I think we have an issue in some of our cleanup code. I've seen the failure pop up a couple times, but none the tests fail directly. Opened #258 to track a fix. |
/retest |
@jhrozek @Vincent056 should be ready for another review from dev. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JAORMX, jhrozek, rhmdnd The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Removing the hold label since this was verified. |
/retest-required DNS issues in CI should be resolved now. |
@rhmdnd: Jira Issue OCPBUGS-1803: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-1803 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…metric
This metric contained the scan error, which can exceed lenghts of 2k (sometimes 11k), and causes resource issues with Prometheus and integrating metrics into different storage backends.
This commit removes the metric since it goes against Prometheus best practices:
https://prometheus.io/docs/practices/naming/#labels