-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ACM-11453 Fix flaky subscription constraints not satisfiable condition #258
ACM-11453 Fix flaky subscription constraints not satisfiable condition #258
Conversation
b242866
to
ce197ba
Compare
ce197ba
to
aebb208
Compare
Weird flakes? 😭
Attempt 1
/home/runner/work/config-policy-controller/config-policy-controller/test/e2e/case23_invalid_field_test.go:76 |
@mprahl this now has the change you requested in Slack, to only update a smaller part of the Subscription status. |
@@ -216,14 +219,14 @@ func (r *OperatorPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Reque | |||
errs = append(errs, err) | |||
} | |||
|
|||
if err := r.Status().Update(ctx, policy); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid always updating the status if nothing changed? This could increase API server hits a lot. Maybe a reflect.DeepEqual comparing before and after would suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add the DeepEqual, I'm surprised controller-runtime (or whatever) wouldn't do that for us...
For extra context, the cause of this was the intervention timestamp not being in a condition. I felt like we were dealing with confusing cases where the Subscription status wasn't always updated correctly by OLM, I wanted to be more paranoid about making sure our status was accurate.
return mergedSub, nil, changed, nil | ||
} | ||
|
||
csvIdx := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this always be 0 because of if len(relatedCSVs) != 1 {
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RelatedObjsOfKind
gives back a map 😅 I thought it would be a good idea because then it's easy to update the condition in-place... I mostly stand by that, but it does make this usage look odd.
Usually index 0 is the CatalogSource.
|
||
updateStatus(policy, updatedCond("Subscription"), updatedObj(mergedSub)) | ||
|
||
return mergedSub, nil, true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should policy.Status.SubscriptionInterventionTime
be set to empty after the update is successful and before returning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I had this just apply the intervention once, then wait another 30 seconds again before trying again. What I found in testing was that about 5% of the time, OLM would then add a different condition, remove our update, and I think confuse itself... I still don't know what exactly was happening there, but the new implementation will now immediately re-apply the intervention in that case, as long as 10 seconds haven't passed. And it requires not clearing the timestamp until those 10 seconds are up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The weirdest thing looking back at it is that after it got into that weird situation of confusing itself, the controller would re-apply the intervention 30 seconds later, but OLM seemed to do the same weird thing consistently this time (not a 5% chance like I'd expect). It wasn't just a bad sync of timing, because I tried a different grace period and it would still happen.
aebb208
to
d0db499
Compare
After the changes suggested in the review, the failure was a repeat of one of the previous "flakes":
/home/runner/work/config-policy-controller/config-policy-controller/test/e2e/case38_install_operator_test.go:3232 |
First 2 runs passed completely. The third run had this during hosted mode tests, which is odd:
/home/runner/work/config-policy-controller/config-policy-controller/test/e2e/case38_install_operator_test.go:2663 |
2e51945
to
5e795c6
Compare
@@ -216,14 +221,16 @@ func (r *OperatorPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Reque | |||
errs = append(errs, err) | |||
} | |||
|
|||
if !reflect.DeepEqual(policy.Status, originalStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional optimization:
if !reflect.DeepEqual(policy.Status, originalStatus) { | |
if conditionChanged || !reflect.DeepEqual(policy.Status, originalStatus) { |
@JustinKuli looks like you need to rebase. I'll approve it again afterwards. |
Previously, the selector used to find OLM resources like InstallPlans and CSVs was not exactly what OLM would use in situations where the name and namespace of the operator was very long. It should now match the behavior of OLM. Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>
Some places were not using the logger provided in the Reconcile context, which has some potentially helpful information. Logs were added to record API calls like create/update/delete, and some debug logs were added for potentially complex methods. Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>
Previously, the controller needed to parse and sort the message associated with the ConstraintsNotSatisfiable condition on the Subscription, because the order of certain parts ot the message were inconsistent. The intent was to not update the OperatorPolicy condition when the actual situation on the cluster was unchanged. In some situations, particularly when the subscription is deleted and quickly recreated, the condition on the Subscription does not just have parts in an inconsistent order: it can alternate between two different clauses. To prevent constant updates to the OperatorPolicy status, it now just reports a generic "conditions not satisfiable" condition, and directs the user to the Subscription for more information. This also reduces noise in the compliance events be removing the lists of operator versions. Refs: - https://issues.redhat.com/browse/ACM-11453 Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>
In some situations when actions are happening quickly, OLM can lose the connection between the Subscription and the ClusterServiceVersion, which leads to a "constraints not satisfiable" condition on the Subscription. It can be hard to reproduce the exact situation we've seen rarely in our tests, but manually deleting and immediately recreating the subscription causes a similar situation. In this change, in those situations, the controller will intervene after 30 seconds by updating the Subscription status directly. The implementation allows for a 10 second window for the intervention, during which the controller may update the status multiple times, to address a case where OLM immediately overwrote the update. If the window is missed, the controller may schedule another time. This is intended to give time to OLM to potentially resolve the situation on its own. Refs: - https://issues.redhat.com/browse/ACM-11453 Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>
5e795c6
to
88c37db
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JustinKuli, mprahl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
4b0fd44
into
open-cluster-management-io:main
Syncs changes from open-cluster-management-io/config-policy-controller#258 Refs: - https://issues.redhat.com/browse/ACM-11453 Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>
Syncs changes from open-cluster-management-io/config-policy-controller#258 Refs: - https://issues.redhat.com/browse/ACM-11453 Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>
Syncs changes from open-cluster-management-io/config-policy-controller#258 Refs: - https://issues.redhat.com/browse/ACM-11453 Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com> (cherry picked from commit d89fa6dad5f88886ef1b602980fd27ecb6a71170)
Syncs changes from open-cluster-management-io/config-policy-controller#258 Refs: - https://issues.redhat.com/browse/ACM-11453 Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com> (cherry picked from commit d89fa6dad5f88886ef1b602980fd27ecb6a71170)
See each commit for details