Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 'ceph_disk_occupation' query expressions #2812

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aruniiird
Copy link
Contributor

@aruniiird aruniiird commented Sep 19, 2024

Need to address changes in 'ceph_disk_occupation' metric labels.

What is the change in 'ceph_disk_occupation' metric?
'ceph_disk_occupation' result no longer has 'exported_instance' label, instead it has 'instance' label.

What is the issue we are facing because of it?
We are hitting 'PrometheusRuleFailures' due to this new label change in our alerts / rules.
Second issue is that we are not seeing any results for some of the query expressions.

What is the solution?
Update the query expressions, change 'exported_instance' to 'instance'. Any 'label_replace' action which changes 'exported_instance' label to 'instance' label is no longer required (as the 'instance' label is directly available now)

Copy link
Contributor

openshift-ci bot commented Sep 19, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aruniiird
Once this PR has been reviewed and has the lgtm label, please assign iamniting for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aruniiird
Copy link
Contributor Author

@weirdwiz , @jmolmo , @umangachapagain please take a look.
Tested all the changed expressions in a live cluster with expected results (where as previous queries were either not giving any result or giving prometheus-exprn-evaluation-errors).

@jmolmo
Copy link
Member

jmolmo commented Sep 24, 2024

I think that the change is ok.
Just to comment that the metric "'ceph_disk_occupation' comes from the "disk_occupation" metric generated by the prometheus manager module.

As you can see, this metric never had the label "exported_instance", So the change in the label name probably comes from the ODF side. Probably you will need to check and understand when and why this label changed. And after that review that it does not impact in another metrics.

Need to address changes in 'ceph_disk_occupation' metric labels.

What is the change in 'ceph_disk_occupation' metric?
'ceph_disk_occupation' result no longer has 'exported_instance' label,
instead it has 'instance' label.

What is the issue we are facing because of it?
We are hitting 'PrometheusRuleFailures' due to this new label changes
in our alerts / rules, where this metric is used.
Second issue is that we are not seeing any results for some of the
query expressions.

What is the solution?
Update the query expressions, change 'exported_instance' to 'instance'.
Any 'label_replace' action which changes 'exported_instance' label to
'instance' label is no longer required (as the 'instance' label is
directly available now)

Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
Copy link
Contributor

openshift-ci bot commented Sep 24, 2024

@aruniiird: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ocs-operator-bundle-e2e-aws 2d544db link true /test ocs-operator-bundle-e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@malayparida2000
Copy link
Contributor

@aruniiird Would this be a blocker in 4.17?

@aruniiird
Copy link
Contributor Author

I think that the change is ok. Just to comment that the metric "'ceph_disk_occupation' comes from the "disk_occupation" metric generated by the prometheus manager module.

As you can see, this metric never had the label "exported_instance", So the change in the label name probably comes from the ODF side. Probably you will need to check and understand when and why this label changed. And after that review that it does not impact in another metrics.

Correct @jmolmo . Checked in the ODF / OCS side, couldn't find much. There might be a chance that this records/alerts where not working for a long time. Current changes are working (with this PR) thus enabling those named records and alerts from now on wards.

@aruniiird
Copy link
Contributor Author

@aruniiird Would this be a blocker in 4.17?

@malayparida2000 , this won't be a blocker (as the query may not have worked for some time), but this is a good candidate for a 4.17 z-stream release and for newer (4.18) releases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants