-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reconciler/managed: make more resilient to error conditions #651
reconciler/managed: make more resilient to error conditions #651
Conversation
…tx canceled Signed-off-by: Dr. Stefan Schimanski <stefan.schimanski@upbound.io>
Signed-off-by: Dr. Stefan Schimanski <stefan.schimanski@upbound.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change seems sensible.
However, when I read the code, I noticed that we don't propagate the possible error of UpdateCriticalAnnotations to the managed resource's status, we only emit an event
// We handle annotations specially here because it's
// critical that they are persisted to the API server.
// If we don't add the external-create-failed annotation
// the reconciler will refuse to proceed, because it
// won't know whether or not it created an external
// resource.
meta.SetExternalCreateFailed(managed, time.Now())
if err := r.managed.UpdateCriticalAnnotations(ctx, managed); err != nil {
log.Debug(errUpdateManagedAnnotations, "error", err)
record.Event(managed, event.Warning(reasonCannotUpdateManaged, errors.Wrap(err, errUpdateManagedAnnotations)))
// We only log and emit an event here rather
// than setting a status condition and returning
// early because presumably it's more useful to
// set our status condition to the reason the
// create failed.
}
managed.SetConditions(xpv1.Creating(), xpv1.ReconcileError(errors.Wrap(err, errReconcileCreate)))
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
the if err := r.managed.UpdateCriticalAnnotations(ctx, managed); err != nil {
creates a new err variable, and does not overwrite the create error (https://go.dev/play/p/DXtx63QBA3v), so xpv1.ReconcileError(errors.Wrap(err, errReconcileCreate))
will contain only the info of that the create failed. maybe that is what we want, but I was wondering that if these annotations are so critical to add, in order to avoid dead objects, that maybe is worth propagating to status as well?
@@ -212,7 +212,9 @@ func NewRetryingCriticalAnnotationUpdater(c client.Client) *RetryingCriticalAnno | |||
// case of a conflict error. | |||
func (u *RetryingCriticalAnnotationUpdater) UpdateCriticalAnnotations(ctx context.Context, o client.Object) error { | |||
a := o.GetAnnotations() | |||
err := retry.OnError(retry.DefaultRetry, resource.IsAPIError, func() error { | |||
err := retry.OnError(retry.DefaultRetry, func(err error) bool { | |||
return !errors.Is(err, context.Canceled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems common enough that it could be encapsulated somewhere; to our own RetryOnError func?
Or when would we want to retry although the context was cancelled?
Isn't there any solution to retry without the annotations? What does it take to do a reverse-lookup of the external resource to see if it indeed was created before the controller crashed, and then delete or update it on a new reconcile? |
Successfully created backport PR for |
Successfully created backport PR for |
@luxas as far as I know, there's no way to do this. Especially no way that will work for every external system we want to reconcile. I opened crossplane/docs#688 to (finally) document what these annotations are doing and why. |
Description of your changes
Two cases where the manager reconciler bricked an object forever:
external.Create
returned a conflict error. This was an oversight in reconciler/managed: only debug log transient conflict errors PT.2 #540 when introducing less noisy conflict handling.There is still a risk that an object becomes bricked, e.g.:
I have:
make reviewable test
to ensure this PR is ready for review.How has this code been tested
Careful code reading and review. This is hard to test other than under real-life condition under heavy load.