Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1952576: csv_succeeded metric not present #2213

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions pkg/controller/operators/olm/operator.go
Original file line number Diff line number Diff line change
Expand Up @@ -1115,11 +1115,12 @@ func (a *Operator) syncClusterServiceVersion(obj interface{}) (syncError error)
} else {
syncError = fmt.Errorf("error transitioning ClusterServiceVersion: %s and error updating CSV status: %s", syncError, updateErr)
}
} else {
metrics.EmitCSVMetric(clusterServiceVersion, outCSV)
}
}

// always emit csv metrics
metrics.EmitCSVMetric(clusterServiceVersion, outCSV)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any cardinality concerns with always emitting CSV metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the metric is using a good, qualified name, which is always unique for one CSV

csv_succeeded{name="etcdoperator.v0.9.4",namespace="olm",version="0.9.4"} 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that doesn't look crazy to me, but it looks like we're emitting other metrics besides that csv_succeeded one when I was poking around that metrics package - I'm trying to wrap my head around how all those CSV-related metrics, emitting on a per-step basis, can lead to problems affecting the core monitoring stack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I found #1099 - @awgreene any idea on whether we can safely emit metrics on a per-step basis now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other approach to fixing this bug could be to emit the metric for all CSVs during pod startup and then update it only when a change happens

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timflannagan I don't think there's any cardinality concern here. csv_succeeded is a prometheus gauge, and within EmitCSVMetrics we're always first deleting the old metric for the csv being synced, and then emitting a new metric and setting the gauge value to 1 or 0 (succeeded/did not succeeded). Even if we were not deleting the old metric, iirc metric points for a unique set of values are only emitted once, i.e they're always unique data points in the set of emitted metrics, which is what @josefkarasek clarified in the first comment.

However, I'm not convinced this actually solves the problem. @josefkarasek the original problem was that we were only edge triggering this metric, i.e whenever the controller syncs a clusterserviceversion (syncClusterServiceVersion holds the logic for when that does happen), and that happens only when there's a change in the CSV object on cluster. But we need some way to level drive this metric too, which is what the first part of your last comment is.

update it only when a change happens
edge triggered

emit the metric for all CSVs during pod startup
level driven

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that all CSVs are queued up during pod start and reconciled. My assumption is that this approach is edge+level driven at the same time. From what you're saying it sounds like this assumption doesn't hold.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although CSVs are queued up during pod start, that is still edge trigger, the trigger here being queuing of the CSV. True level driven is when you query the state of the cluster and attempt to reconcile with the desired state. There's always that chance with edge triggers that we'll miss an event, so querying for existing CSVs and emitting metrics for them on pod restart is the most full proof method to solve this problem.


operatorGroup := a.operatorGroupFromAnnotations(logger, clusterServiceVersion)
if operatorGroup == nil {
logger.WithField("reason", "no operatorgroup found for active CSV").Debug("skipping potential RBAC creation in target namespaces")
Expand Down