Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconciler/managed: make more resilient to error conditions #651

Merged

Conversation

sttts
Copy link
Contributor

@sttts sttts commented Jan 27, 2024

Description of your changes

Two cases where the manager reconciler bricked an object forever:

  1. when external.Create returned a conflict error. This was an oversight in reconciler/managed: only debug log transient conflict errors PT.2 #540 when introducing less noisy conflict handling.
  2. on non-API errors when the critical failed/succeeded annotations were applied. Think of network issues. These are not API errors and hence they stopped the retry loop.

There is still a risk that an object becomes bricked, e.g.:

  • when the ctx is cancelled after some timeout (30s or so per reconcile?). I.e. if the error persists for long enough.
  • when the controller crashes before apply the critical annotations.

I have:

  • Read and followed Crossplane's [contribution process].
  • Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

Careful code reading and review. This is hard to test other than under real-life condition under heavy load.

sttts added 2 commits January 27, 2024 15:59
…tx canceled

Signed-off-by: Dr. Stefan Schimanski <stefan.schimanski@upbound.io>
Signed-off-by: Dr. Stefan Schimanski <stefan.schimanski@upbound.io>
@sttts sttts requested review from a team as code owners January 27, 2024 15:06
@sttts sttts requested review from turkenh and MisterMX January 27, 2024 15:06
@sttts sttts changed the title reconciler/managed: make more resilient to API errors reconciler/managed: make more resilient to error conditions Jan 27, 2024
Copy link

@luxas luxas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change seems sensible.

However, when I read the code, I noticed that we don't propagate the possible error of UpdateCriticalAnnotations to the managed resource's status, we only emit an event

			// We handle annotations specially here because it's
			// critical that they are persisted to the API server.
			// If we don't add the external-create-failed annotation
			// the reconciler will refuse to proceed, because it
			// won't know whether or not it created an external
			// resource.
			meta.SetExternalCreateFailed(managed, time.Now())
			if err := r.managed.UpdateCriticalAnnotations(ctx, managed); err != nil {
				log.Debug(errUpdateManagedAnnotations, "error", err)
				record.Event(managed, event.Warning(reasonCannotUpdateManaged, errors.Wrap(err, errUpdateManagedAnnotations)))

				// We only log and emit an event here rather
				// than setting a status condition and returning
				// early because presumably it's more useful to
				// set our status condition to the reason the
				// create failed.
			}

			managed.SetConditions(xpv1.Creating(), xpv1.ReconcileError(errors.Wrap(err, errReconcileCreate)))
			return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)

the if err := r.managed.UpdateCriticalAnnotations(ctx, managed); err != nil { creates a new err variable, and does not overwrite the create error (https://go.dev/play/p/DXtx63QBA3v), so xpv1.ReconcileError(errors.Wrap(err, errReconcileCreate)) will contain only the info of that the create failed. maybe that is what we want, but I was wondering that if these annotations are so critical to add, in order to avoid dead objects, that maybe is worth propagating to status as well?

@@ -212,7 +212,9 @@ func NewRetryingCriticalAnnotationUpdater(c client.Client) *RetryingCriticalAnno
// case of a conflict error.
func (u *RetryingCriticalAnnotationUpdater) UpdateCriticalAnnotations(ctx context.Context, o client.Object) error {
a := o.GetAnnotations()
err := retry.OnError(retry.DefaultRetry, resource.IsAPIError, func() error {
err := retry.OnError(retry.DefaultRetry, func(err error) bool {
return !errors.Is(err, context.Canceled)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems common enough that it could be encapsulated somewhere; to our own RetryOnError func?
Or when would we want to retry although the context was cancelled?

@luxas
Copy link

luxas commented Jan 29, 2024

There is still a risk that an object becomes bricked, e.g.:
when the ctx is cancelled after some timeout (30s or so per reconcile?). I.e. if the error persists for long enough.
when the controller crashes before apply the critical annotations.

Isn't there any solution to retry without the annotations? What does it take to do a reverse-lookup of the external resource to see if it indeed was created before the controller crashed, and then delete or update it on a new reconcile?

@phisco phisco merged commit 93bd0c2 into crossplane:master Jan 29, 2024
8 of 9 checks passed
Copy link

Successfully created backport PR for release-1.14:

Copy link

Successfully created backport PR for release-1.15:

@negz
Copy link
Member

negz commented Jan 30, 2024

What does it take to do a reverse-lookup of the external resource to see if it indeed was created before the controller crashed, and then delete or update it on a new reconcile?

@luxas as far as I know, there's no way to do this. Especially no way that will work for every external system we want to reconcile. I opened crossplane/docs#688 to (finally) document what these annotations are doing and why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants