-
Notifications
You must be signed in to change notification settings - Fork 81
Remove finalizer in controller #656
base: master
Are you sure you want to change the base?
Conversation
@ostromart: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix: istio/istio#18815
return reconcile.Result{}, nil | ||
} | ||
log.Info("Deleting IstioControlPlane") | ||
|
||
reconciler, err := r.factory.New(icp, r.client) | ||
if err == nil { | ||
err = reconciler.Delete() | ||
} else { | ||
log.Errorf("failed to create reconciler: %s", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By removing the finalizer there's no guarantee that this code will be executed. If it is truly unnecessary, it should be removed, too. If the controller adds the ownerReference to every object it creates, this code is indeed unnecessary, as the garbage collector will delete everything that this code is supposed to delete. If that is the case, this code will just result in race conditions between the GC and the controller and might cause conflicts to be logged in the controller's log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought. This code might not be safe to remove, as it also prunes cluster-scoped resources, which should not have an ownerReference added to them, since a namespaced object should not be the owner of a cluster-scoped object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you have a fix for the finalizer but I think we'd still have the race problem. If you have some time to spend looking into it we can put this PR on hold to see if a fix is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my finalizer fix is for a different problem. Regarding the race itself, I don't see any good solutions.
I don't even think we should be fixing this problem. Users install an operator if they want to automate things. By removing the operator, they are saying they no longer want the automation and would instead like to manage things manually. Removing an operator doesn't mean the operator should remove everything it has deployed, as the user might want to deploy the operator in a different namespace / outside the cluster / deploy a different version of the operator / etc. and the user might want the mesh to be running in the mean time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, agreed, but this is a different scenario. Here, the user is deleting the CR, in which case I think it's reasonable for Istio to be deleted and operator to be untouched.
If operator is deleted, nothing should happen to either Istio or the CR.
What the finalizer was doing was waiting for Istio to be deleted before the CR was deleted. Unfortunately, a common scenario is where users also delete the operator at the same time, in which case the CR is in a frozen state with the finalizer not removed.
In that case, is it reasonable to proceed with this PR? The practical effect is that the CR will be deleted before all of Istio is. Hopefully users will read the instructions to remove everything cleanly but at least we won't have this "stuck because of finalizer" situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the linked issue (istio/istio#18815) says that the operator was deleted, too.
If the user had only deleted the ICP, the operator would have undeployed Istio just fine. Well, maybe not, since the operator and the k8s GC are both racing to delete the resources that have ownerReference set. When deleting a resource, the operator treats any error returned from {{client.Delete()}} as an actual error. But it should really treat a NotFound error as a successful deletion, as the end-state is what the operator wants - regardless if it was the one that deleted the object or if it was deleted by anyone else.
This PR would remove that race, but since there would be no guarantee that reconciler.Delete() will always get called, it would cause resources that the k8s GC doesn't delete to remain in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may still need ownerrefs from ICP CR to Istio resources (we should think this through since there may be subtle problems). But delete is not guaranteed to be called here regardless of whether we have the finalizer or not, since we cannot prevent the user from deleting the controller.
So I'm not claiming this solves all the problems, just improves a bad situation. The stuck finalizer is very hard for users to get out of, whereas it's much simpler to delete any leftover orphaned resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By removing the finalizer there's no guarantee that this code will be executed. If it is truly unnecessary, it should be removed, too. If the controller adds the ownerReference to every object it creates, this code is indeed unnecessary, as the garbage collector will delete everything that this code is supposed to delete. If that is the case, this code will just result in race conditions between the GC and the controller and might cause conflicts to be logged in the controller's log.
OwnerRefs are not set for all objects created by Istio's deployments (e.g. citiadel creates certs, galley creates validatngwebhookconfiguration, etc. The project needs to take a longer look at ownerrefs across the project.
See here for an example of dangling object that has been causing severe trouble in the operator controller: istio/istio#19164 (comment)
There are other dangling objects not GCed by K8s...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bottom line is:
If you remove the finalizer-managing code from the controller, resources won't get cleaned up properly even if the controller is running.
If you keep the finalizer code, the controller will clean up everything properly, but things will get stuck if you delete the controller.
The point of using a controller is to have fully automated management of the control plane. If the controller doesn't clean up everything it should, it's useless. If a user changes their mind and removes the controller before letting it finish cleaning up, they should be prepared to manually clean things up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why the resources won't get cleaned up properly? I may be missing something...
The flow we have here without the finalizer is:
- check if deletion timestamp exists in ICP
- if it does, call reconciler.Delete which synchronously loops through all resources with our label and deletes each one
@ostromart can you fix the linting so we can get this work merged for master/1.4.3? Cheers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline with @ostromart. We should let kube-apiserver GC (via ownerRef) unless there's a reason to handle it ourselves.
If I understand correctly, the only cases where a finalizer would be useful is to protect against re-creating a ICP which was previously deleted but the underlying resources haven't been deleted yet. The controller mode should converge. The CLI mode would partially fail. Given this is an edge case and the CLI user would need to retry the command anyway, it seems reasonable to pick robustness in the common case.
Cluster scoped resources and resources in other namespaces cannot have owner references, which is why the finalizer would be necessary. |
Folks, thanks for your input and patience with this PR. We've spent a lot of cycles debating and I think in the process our understanding of the situation has improved a great deal (at least mine has). |
Could you elaborate on the "stuck finalizer" problem? In the original comment, you say that the resource remains when the operator is deleted. The simple answer is: don't delete the operator. I also wonder if this is mainly an issue when the ICP resource is collocated with the operator and the operator namespace itself is deleted (the only workaround for this use case is to manually remove the finalizer from the resource). The solution here is: don't install a control plane in the operator namespace. This is something that shouldn't be done anyway, as the operator namespace should not be accessible to regular users, because the operator is running with elevated privileges. To clarify the operator usage model: The operator performs tasks that a cluster-admin (or some other user with elevated privileges) would normally perform. For us, this is installing and configuring a control plane based on an ICP resource. The operator is installed by cluster-admin and should not be accessible to other users. When using something like operator lifecycle manager (OLM), cluster-admin can approve which operators can be installed by users, allowing individual users to install and update operators. This allows a regular user to install the operator, without having cluster-admin privileges (OLM is actually performing the install/management of the operator). The operator's lifecycle extends beyond the lifecycle of any given control plane. Currently, this may seem a little confusing, as we only support installation of a single control plane, but once multi-tenancy is supported it will make more sense. To wrap up:
|
We don't have that option. Users tend to not follow instructions which is clearly illustrated by the fact that this issue has happened several times leaving users very frustrated. Users will tolerate not everything being cleaned up and having to remove some stuff manually, but they will get very upset about being left with a namespace they can't delete. It's particularly bad because it's close to impossible for an Istio user to know that a finalizer on ICP is blocking their namespace deletion.
The same thing happens with ICP in istio-system. The issue is that if the operator is deleted before it's removed the finalizer, there's nothing left to remove that finalizer and the namespace is stuck.
In many cases the user is the cluster admin. Multi-tenant is an important use case but not the main Istio use case.
There are no plans to recommend OLM for Istio OSS project. If someone wants to take on implementing a fix using preStop hook or propose some other way to fix this issue I'm completely open to it. However I strongly disagree that doing nothing and leaving things as they are leads to an acceptable user experience. |
Don't delete the operator is a requirement if you're using the operator. If you're talking about a single user, who is doing everything as cluster-admin, with a single install, no multi-tenancy, and they only care about install/delete, then there's not really any reason to use the operator. However, once the operator becomes more functional (e.g. managing certs used by webhooks), the operator is a requirement for a control plane to function normally. Deleting the operator would then require a human operator to take over the tasks that were being managed by the automated operator. That means updating the keys/certs used by the webhooks, cleaning up the control plane resources, removing the finalizer, etc. This is true whether OLM is used or not. If this is an issue, then it should be highlighted in the documentation. The documentation should also state that the operator does more than facilitate installs (once it does more). JFYI, the reason I mentioned cluster-admin at all is that I think it's important to clarify the scope required to use various components. For a dev testing this out with minikube, they'll have no problems. However, if they're using a managed cluster, whoever is managing that cluster will care, so it's important to specify what permissions are required for the various aspects (cluster-admin to install/manage the operator, while general user can install and manage the control plane). FWIW, this issue, moving from full access dev env to production was mentioned in a talk at KubeCon. Everything worked fine on dev machines, but the higher level security in production prevented installation of their control plane, let alone the applications in the mesh. One of the disadvantages of doing all your dev with cluster-admin. |
@ostromart: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
All the finalizer does is keep ICP CR around until other resources have been deleted. This is nice to have, unfortunately it's prone to races when the operator itself is deleted, since then there's nothing to remove the finalizer. Then we are stuck with a namespace that can't be easily deleted.
Removing this for now since this is more of a nice to have than essential feature.