Skip to content
This repository has been archived by the owner on Oct 7, 2020. It is now read-only.

Remove finalizer in controller #656

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -111,8 +111,6 @@ func (r *ReconcileIstioControlPlane) Reconcile(request reconcile.Request) (recon
return reconcile.Result{}, err
}
deleted := u.GetDeletionTimestamp() != nil
finalizers := u.GetFinalizers()
finalizerIndex := indexOf(finalizers, finalizer)

// declare read-only icp instance to create the reconciler
icp := &v1alpha2.IstioControlPlane{}
Expand All @@ -132,48 +130,13 @@ func (r *ReconcileIstioControlPlane) Reconcile(request reconcile.Request) (recon
log.Infof("Got IstioControlPlaneSpec: \n\n%s\n", string(os))

if deleted {
if finalizerIndex < 0 {
log.Info("IstioControlPlane deleted")
return reconcile.Result{}, nil
}
log.Info("Deleting IstioControlPlane")

reconciler, err := r.factory.New(icp, r.client)
if err == nil {
err = reconciler.Delete()
} else {
log.Errorf("failed to create reconciler: %s", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By removing the finalizer there's no guarantee that this code will be executed. If it is truly unnecessary, it should be removed, too. If the controller adds the ownerReference to every object it creates, this code is indeed unnecessary, as the garbage collector will delete everything that this code is supposed to delete. If that is the case, this code will just result in race conditions between the GC and the controller and might cause conflicts to be logged in the controller's log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought. This code might not be safe to remove, as it also prunes cluster-scoped resources, which should not have an ownerReference added to them, since a namespaced object should not be the owner of a cluster-scoped object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you have a fix for the finalizer but I think we'd still have the race problem. If you have some time to spend looking into it we can put this PR on hold to see if a fix is possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my finalizer fix is for a different problem. Regarding the race itself, I don't see any good solutions.

I don't even think we should be fixing this problem. Users install an operator if they want to automate things. By removing the operator, they are saying they no longer want the automation and would instead like to manage things manually. Removing an operator doesn't mean the operator should remove everything it has deployed, as the user might want to deploy the operator in a different namespace / outside the cluster / deploy a different version of the operator / etc. and the user might want the mesh to be running in the mean time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, agreed, but this is a different scenario. Here, the user is deleting the CR, in which case I think it's reasonable for Istio to be deleted and operator to be untouched.
If operator is deleted, nothing should happen to either Istio or the CR.
What the finalizer was doing was waiting for Istio to be deleted before the CR was deleted. Unfortunately, a common scenario is where users also delete the operator at the same time, in which case the CR is in a frozen state with the finalizer not removed.
In that case, is it reasonable to proceed with this PR? The practical effect is that the CR will be deleted before all of Istio is. Hopefully users will read the instructions to remove everything cleanly but at least we won't have this "stuck because of finalizer" situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the linked issue (istio/istio#18815) says that the operator was deleted, too.

If the user had only deleted the ICP, the operator would have undeployed Istio just fine. Well, maybe not, since the operator and the k8s GC are both racing to delete the resources that have ownerReference set. When deleting a resource, the operator treats any error returned from {{client.Delete()}} as an actual error. But it should really treat a NotFound error as a successful deletion, as the end-state is what the operator wants - regardless if it was the one that deleted the object or if it was deleted by anyone else.

This PR would remove that race, but since there would be no guarantee that reconciler.Delete() will always get called, it would cause resources that the k8s GC doesn't delete to remain in the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may still need ownerrefs from ICP CR to Istio resources (we should think this through since there may be subtle problems). But delete is not guaranteed to be called here regardless of whether we have the finalizer or not, since we cannot prevent the user from deleting the controller.
So I'm not claiming this solves all the problems, just improves a bad situation. The stuck finalizer is very hard for users to get out of, whereas it's much simpler to delete any leftover orphaned resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By removing the finalizer there's no guarantee that this code will be executed. If it is truly unnecessary, it should be removed, too. If the controller adds the ownerReference to every object it creates, this code is indeed unnecessary, as the garbage collector will delete everything that this code is supposed to delete. If that is the case, this code will just result in race conditions between the GC and the controller and might cause conflicts to be logged in the controller's log.

OwnerRefs are not set for all objects created by Istio's deployments (e.g. citiadel creates certs, galley creates validatngwebhookconfiguration, etc. The project needs to take a longer look at ownerrefs across the project.

See here for an example of dangling object that has been causing severe trouble in the operator controller: istio/istio#19164 (comment)

There are other dangling objects not GCed by K8s...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bottom line is:

If you remove the finalizer-managing code from the controller, resources won't get cleaned up properly even if the controller is running.

If you keep the finalizer code, the controller will clean up everything properly, but things will get stuck if you delete the controller.

The point of using a controller is to have fully automated management of the control plane. If the controller doesn't clean up everything it should, it's useless. If a user changes their mind and removes the controller before letting it finish cleaning up, they should be prepared to manually clean things up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why the resources won't get cleaned up properly? I may be missing something...
The flow we have here without the finalizer is:

  • check if deletion timestamp exists in ICP
  • if it does, call reconciler.Delete which synchronously loops through all resources with our label and deletes each one

// TODO: for now, nuke the resources, regardless of errors
finalizers = append(finalizers[:finalizerIndex], finalizers[finalizerIndex+1:]...)
u.SetFinalizers(finalizers)
finalizerError := r.client.Update(context.TODO(), u)
for retryCount := 0; errors.IsConflict(finalizerError) && retryCount < finalizerMaxRetries; retryCount++ {
// workaround for https://github.com/kubernetes/kubernetes/issues/73098 for k8s < 1.14
// TODO: make this error message more meaningful.
log.Info("conflict during finalizer removal, retrying")
_ = r.client.Get(context.TODO(), request.NamespacedName, u)
finalizers = u.GetFinalizers()
finalizerIndex = indexOf(finalizers, finalizer)
finalizers = append(finalizers[:finalizerIndex], finalizers[finalizerIndex+1:]...)
u.SetFinalizers(finalizers)
finalizerError = r.client.Update(context.TODO(), u)
}
if finalizerError != nil {
log.Errorf("error removing finalizer: %s", finalizerError)
return reconcile.Result{}, finalizerError
}
return reconcile.Result{}, err
} else if finalizerIndex < 0 {
// TODO: make this error message more meaningful.
log.Infof("Adding finalizer %v", finalizer)
finalizers = append(finalizers, finalizer)
u.SetFinalizers(finalizers)
err = r.client.Update(context.TODO(), u)
if err != nil {
log.Errorf("Failed to update IstioControlPlane with finalizer, %v", err)
return reconcile.Result{}, err
}
}

log.Info("Updating IstioControlPlane")
Expand Down