Bug 1952576: csv_succeeded metric not present #2213

josefkarasek · 2021-06-23T13:16:46Z

csv_succeeded metric is lost between pod restarts.
This is because this metric is only emitted when CSV.Status is changed.

Description of the change:
Emit csv_succeeded/csv_abnormal metric during every CSV sync loop.

Motivation for the change:
csv_succeeded metric is lost between pod restarts.

Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation
Sufficient unit test coverage
Sufficient end-to-end test coverage
Docs updated or added to /doc
Commit messages sensible and descriptive

openshift-ci · 2021-06-23T13:16:53Z

@josefkarasek: This pull request references Bugzilla bug 1952576, which is invalid:

expected the bug to target the "4.9.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1952576: csv_succeeded metric not present

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-06-23T13:16:54Z

Hi @josefkarasek. Thanks for your PR.

I'm waiting for a operator-framework member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-06-23T13:16:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: josefkarasek
To complete the pull request process, please assign kevinrizza after the PR has been reviewed.
You can assign the PR to them by writing /assign @kevinrizza in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

`csv_succeeded` metric is lost between pod restarts. This is because this metric is only emitted when CSV.Status is changed. Signed-off-by: Josef Karasek <jkarasek@redhat.com>

timflannagan · 2021-06-23T13:29:25Z

pkg/controller/operators/olm/operator.go

 }
 }

+ // always emit csv metrics
+ metrics.EmitCSVMetric(clusterServiceVersion, outCSV)


Are there any cardinality concerns with always emitting CSV metrics?

I think the metric is using a good, qualified name, which is always unique for one CSV

csv_succeeded{name="etcdoperator.v0.9.4",namespace="olm",version="0.9.4"} 1

Agreed, that doesn't look crazy to me, but it looks like we're emitting other metrics besides that csv_succeeded one when I was poking around that metrics package - I'm trying to wrap my head around how all those CSV-related metrics, emitting on a per-step basis, can lead to problems affecting the core monitoring stack.

Ah, I found #1099 - @awgreene any idea on whether we can safely emit metrics on a per-step basis now?

Other approach to fixing this bug could be to emit the metric for all CSVs during pod startup and then update it only when a change happens

@timflannagan I don't think there's any cardinality concern here. csv_succeeded is a prometheus gauge, and within EmitCSVMetrics we're always first deleting the old metric for the csv being synced, and then emitting a new metric and setting the gauge value to 1 or 0 (succeeded/did not succeeded). Even if we were not deleting the old metric, iirc metric points for a unique set of values are only emitted once, i.e they're always unique data points in the set of emitted metrics, which is what @josefkarasek clarified in the first comment.

However, I'm not convinced this actually solves the problem. @josefkarasek the original problem was that we were only edge triggering this metric, i.e whenever the controller syncs a clusterserviceversion (syncClusterServiceVersion holds the logic for when that does happen), and that happens only when there's a change in the CSV object on cluster. But we need some way to level drive this metric too, which is what the first part of your last comment is.

update it only when a change happens
edge triggered

emit the metric for all CSVs during pod startup
level driven

I'm assuming that all CSVs are queued up during pod start and reconciled. My assumption is that this approach is edge+level driven at the same time. From what you're saying it sounds like this assumption doesn't hold.

Although CSVs are queued up during pod start, that is still edge trigger, the trigger here being queuing of the CSV. True level driven is when you query the state of the cluster and attempt to reconcile with the desired state. There's always that chance with edge triggers that we'll miss an event, so querying for existing CSVs and emitting metrics for them on pod restart is the most full proof method to solve this problem.

timflannagan · 2021-06-23T13:29:36Z

/ok-to-test

josefkarasek · 2021-06-23T14:04:33Z

How can I fix bugzilla/invalid-bug label on the PR?

timflannagan · 2021-06-23T14:07:15Z

@josefkarasek The bug bot is complaining about the current state of the BZ: #2213 (comment):

expected the bug to target the "4.9.0" release, but it targets "---" instead

In order to fix this, update the BZ's "Target Release" dropdown button to the 4.9.0, instead of the default value (empty release, "---"), save the BZ, and then comment /bugzilla refresh here.

josefkarasek · 2021-06-23T14:11:23Z

/bugzilla refresh

openshift-ci · 2021-06-23T14:11:38Z

@josefkarasek: This pull request references Bugzilla bug 1952576, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.9.0) matches configured target release for branch (4.9.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianzhangbjz

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-06-23T14:11:39Z

@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: jianzhangbjz.

Note that only operator-framework members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@josefkarasek: This pull request references Bugzilla bug 1952576, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)

bug target release (4.9.0) matches configured target release for branch (4.9.0)

bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianzhangbjz

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

anik120

We'd also want to add test for this as proof of concept.

anik120 · 2021-06-23T15:07:32Z

pkg/controller/operators/olm/operator.go

 }
 }

+ // always emit csv metrics
+ metrics.EmitCSVMetric(clusterServiceVersion, outCSV)


@timflannagan I don't think there's any cardinality concern here. csv_succeeded is a prometheus gauge, and within EmitCSVMetrics we're always first deleting the old metric for the csv being synced, and then emitting a new metric and setting the gauge value to 1 or 0 (succeeded/did not succeeded). Even if we were not deleting the old metric, iirc metric points for a unique set of values are only emitted once, i.e they're always unique data points in the set of emitted metrics, which is what @josefkarasek clarified in the first comment.

However, I'm not convinced this actually solves the problem. @josefkarasek the original problem was that we were only edge triggering this metric, i.e whenever the controller syncs a clusterserviceversion (syncClusterServiceVersion holds the logic for when that does happen), and that happens only when there's a change in the CSV object on cluster. But we need some way to level drive this metric too, which is what the first part of your last comment is.

update it only when a change happens
edge triggered

emit the metric for all CSVs during pod startup
level driven

anik120 · 2021-06-23T15:13:26Z

Also, @timflannagan we don't need the bug number in the PR title right? We'll only need it when we downstream the PR?

josefkarasek · 2021-07-12T14:43:18Z

Closing in favor of #2216

josefkarasek · 2021-07-12T14:43:26Z

Closing in favor of #2216

openshift-ci · 2021-07-12T14:43:33Z

@josefkarasek: This pull request references Bugzilla bug 1952576. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

Bug 1952576: csv_succeeded metric not present

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jun 23, 2021

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 23, 2021

openshift-ci bot requested a review from hasbro17 June 23, 2021 13:16

openshift-ci bot requested a review from timflannagan June 23, 2021 13:16

Bug 1952576: csv_succeeded metric not present

b6e6510

`csv_succeeded` metric is lost between pod restarts. This is because this metric is only emitted when CSV.Status is changed. Signed-off-by: Josef Karasek <jkarasek@redhat.com>

josefkarasek force-pushed the bz-1952576 branch from 9bf9b01 to b6e6510 Compare June 23, 2021 13:19

timflannagan reviewed Jun 23, 2021

View reviewed changes

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 23, 2021

openshift-ci bot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jun 23, 2021

anik120 requested changes Jun 23, 2021

View reviewed changes

josefkarasek mentioned this pull request Jun 28, 2021

Emit CSV metric on startup #2216

Merged

5 tasks

josefkarasek closed this Jul 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1952576: csv_succeeded metric not present #2213

Bug 1952576: csv_succeeded metric not present #2213

josefkarasek commented Jun 23, 2021

openshift-ci bot commented Jun 23, 2021

openshift-ci bot commented Jun 23, 2021

openshift-ci bot commented Jun 23, 2021

timflannagan Jun 23, 2021

josefkarasek Jun 23, 2021

timflannagan Jun 23, 2021

timflannagan Jun 23, 2021

josefkarasek Jun 23, 2021

anik120 Jun 23, 2021

josefkarasek Jun 23, 2021

anik120 Jun 23, 2021

timflannagan commented Jun 23, 2021

josefkarasek commented Jun 23, 2021

timflannagan commented Jun 23, 2021

josefkarasek commented Jun 23, 2021

openshift-ci bot commented Jun 23, 2021

openshift-ci bot commented Jun 23, 2021

anik120 left a comment

anik120 Jun 23, 2021

anik120 commented Jun 23, 2021

josefkarasek commented Jul 12, 2021

josefkarasek commented Jul 12, 2021

openshift-ci bot commented Jul 12, 2021

Bug 1952576: csv_succeeded metric not present #2213

Bug 1952576: csv_succeeded metric not present #2213

Conversation

josefkarasek commented Jun 23, 2021

openshift-ci bot commented Jun 23, 2021

openshift-ci bot commented Jun 23, 2021

openshift-ci bot commented Jun 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timflannagan commented Jun 23, 2021

josefkarasek commented Jun 23, 2021

timflannagan commented Jun 23, 2021

josefkarasek commented Jun 23, 2021

openshift-ci bot commented Jun 23, 2021

openshift-ci bot commented Jun 23, 2021

anik120 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anik120 commented Jun 23, 2021

josefkarasek commented Jul 12, 2021

josefkarasek commented Jul 12, 2021

openshift-ci bot commented Jul 12, 2021