Skip to content
This repository has been archived by the owner on Oct 7, 2020. It is now read-only.

Remove finalizer in controller #656

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ostromart
Copy link
Contributor

All the finalizer does is keep ICP CR around until other resources have been deleted. This is nice to have, unfortunately it's prone to races when the operator itself is deleted, since then there's nothing to remove the finalizer. Then we are stuck with a namespace that can't be easily deleted.
Removing this for now since this is more of a nice to have than essential feature.

@ostromart ostromart requested a review from a team as a code owner December 2, 2019 20:00
@googlebot googlebot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Dec 2, 2019
@istio-testing istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 2, 2019
@ostromart ostromart added cherrypick/release-1.4 Set this label on a PR to auto-merge it to the release-1.4 branch and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 2, 2019
@istio-testing
Copy link

@ostromart: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
lint_operator 63c7821 link /test lint_operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Member

@morvencao morvencao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return reconcile.Result{}, nil
}
log.Info("Deleting IstioControlPlane")

reconciler, err := r.factory.New(icp, r.client)
if err == nil {
err = reconciler.Delete()
} else {
log.Errorf("failed to create reconciler: %s", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By removing the finalizer there's no guarantee that this code will be executed. If it is truly unnecessary, it should be removed, too. If the controller adds the ownerReference to every object it creates, this code is indeed unnecessary, as the garbage collector will delete everything that this code is supposed to delete. If that is the case, this code will just result in race conditions between the GC and the controller and might cause conflicts to be logged in the controller's log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought. This code might not be safe to remove, as it also prunes cluster-scoped resources, which should not have an ownerReference added to them, since a namespaced object should not be the owner of a cluster-scoped object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you have a fix for the finalizer but I think we'd still have the race problem. If you have some time to spend looking into it we can put this PR on hold to see if a fix is possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my finalizer fix is for a different problem. Regarding the race itself, I don't see any good solutions.

I don't even think we should be fixing this problem. Users install an operator if they want to automate things. By removing the operator, they are saying they no longer want the automation and would instead like to manage things manually. Removing an operator doesn't mean the operator should remove everything it has deployed, as the user might want to deploy the operator in a different namespace / outside the cluster / deploy a different version of the operator / etc. and the user might want the mesh to be running in the mean time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, agreed, but this is a different scenario. Here, the user is deleting the CR, in which case I think it's reasonable for Istio to be deleted and operator to be untouched.
If operator is deleted, nothing should happen to either Istio or the CR.
What the finalizer was doing was waiting for Istio to be deleted before the CR was deleted. Unfortunately, a common scenario is where users also delete the operator at the same time, in which case the CR is in a frozen state with the finalizer not removed.
In that case, is it reasonable to proceed with this PR? The practical effect is that the CR will be deleted before all of Istio is. Hopefully users will read the instructions to remove everything cleanly but at least we won't have this "stuck because of finalizer" situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the linked issue (istio/istio#18815) says that the operator was deleted, too.

If the user had only deleted the ICP, the operator would have undeployed Istio just fine. Well, maybe not, since the operator and the k8s GC are both racing to delete the resources that have ownerReference set. When deleting a resource, the operator treats any error returned from {{client.Delete()}} as an actual error. But it should really treat a NotFound error as a successful deletion, as the end-state is what the operator wants - regardless if it was the one that deleted the object or if it was deleted by anyone else.

This PR would remove that race, but since there would be no guarantee that reconciler.Delete() will always get called, it would cause resources that the k8s GC doesn't delete to remain in the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may still need ownerrefs from ICP CR to Istio resources (we should think this through since there may be subtle problems). But delete is not guaranteed to be called here regardless of whether we have the finalizer or not, since we cannot prevent the user from deleting the controller.
So I'm not claiming this solves all the problems, just improves a bad situation. The stuck finalizer is very hard for users to get out of, whereas it's much simpler to delete any leftover orphaned resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By removing the finalizer there's no guarantee that this code will be executed. If it is truly unnecessary, it should be removed, too. If the controller adds the ownerReference to every object it creates, this code is indeed unnecessary, as the garbage collector will delete everything that this code is supposed to delete. If that is the case, this code will just result in race conditions between the GC and the controller and might cause conflicts to be logged in the controller's log.

OwnerRefs are not set for all objects created by Istio's deployments (e.g. citiadel creates certs, galley creates validatngwebhookconfiguration, etc. The project needs to take a longer look at ownerrefs across the project.

See here for an example of dangling object that has been causing severe trouble in the operator controller: istio/istio#19164 (comment)

There are other dangling objects not GCed by K8s...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bottom line is:

If you remove the finalizer-managing code from the controller, resources won't get cleaned up properly even if the controller is running.

If you keep the finalizer code, the controller will clean up everything properly, but things will get stuck if you delete the controller.

The point of using a controller is to have fully automated management of the control plane. If the controller doesn't clean up everything it should, it's useless. If a user changes their mind and removes the controller before letting it finish cleaning up, they should be prepared to manually clean things up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why the resources won't get cleaned up properly? I may be missing something...
The flow we have here without the finalizer is:

  • check if deletion timestamp exists in ICP
  • if it does, call reconciler.Delete which synchronously loops through all resources with our label and deletes each one

@sdake
Copy link
Member

sdake commented Dec 9, 2019

@ostromart can you fix the linting so we can get this work merged for master/1.4.3?

Cheers
-steve

Copy link
Contributor

@ayj ayj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with @ostromart. We should let kube-apiserver GC (via ownerRef) unless there's a reason to handle it ourselves.

If I understand correctly, the only cases where a finalizer would be useful is to protect against re-creating a ICP which was previously deleted but the underlying resources haven't been deleted yet. The controller mode should converge. The CLI mode would partially fail. Given this is an edge case and the CLI user would need to retry the command anyway, it seems reasonable to pick robustness in the common case.

@rcernich
Copy link
Contributor

Cluster scoped resources and resources in other namespaces cannot have owner references, which is why the finalizer would be necessary.

@ostromart
Copy link
Contributor Author

Folks, thanks for your input and patience with this PR. We've spent a lot of cycles debating and I think in the process our understanding of the situation has improved a great deal (at least mine has).
I had some additional discussion with @ayj to see if there was some other way we could work around the stuck finalizer problem. He suggested a preStop hook which indeed I think could work but it would be at the cost of additional complexity.
One thing that's clear to me now is that the PR will not work in the current state because deletion timestamp is not created without a finalizer. So at least a change is needed to test for the non-existence of any CR in a watched namespace and treat that as a delete event, rather than using deletion timestamp. But if I add that change I think this is a simpler solution than adding a preStop hook. wdyt?

@ostromart ostromart added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Dec 12, 2019
@rcernich
Copy link
Contributor

Could you elaborate on the "stuck finalizer" problem? In the original comment, you say that the resource remains when the operator is deleted. The simple answer is: don't delete the operator.

I also wonder if this is mainly an issue when the ICP resource is collocated with the operator and the operator namespace itself is deleted (the only workaround for this use case is to manually remove the finalizer from the resource). The solution here is: don't install a control plane in the operator namespace. This is something that shouldn't be done anyway, as the operator namespace should not be accessible to regular users, because the operator is running with elevated privileges.

To clarify the operator usage model:

The operator performs tasks that a cluster-admin (or some other user with elevated privileges) would normally perform. For us, this is installing and configuring a control plane based on an ICP resource.

The operator is installed by cluster-admin and should not be accessible to other users. When using something like operator lifecycle manager (OLM), cluster-admin can approve which operators can be installed by users, allowing individual users to install and update operators. This allows a regular user to install the operator, without having cluster-admin privileges (OLM is actually performing the install/management of the operator).

The operator's lifecycle extends beyond the lifecycle of any given control plane. Currently, this may seem a little confusing, as we only support installation of a single control plane, but once multi-tenancy is supported it will make more sense.

To wrap up:

  • The operator should not be deleted while any ICP resources exist.
  • ICP resources should not be created in the same namespace as the operator.
  • Only cluster-admin should have access to the operator namespace.
  • Using OLM simplifies this entire process.

@ostromart
Copy link
Contributor Author

The simple answer is: don't delete the operator.

We don't have that option. Users tend to not follow instructions which is clearly illustrated by the fact that this issue has happened several times leaving users very frustrated. Users will tolerate not everything being cleaned up and having to remove some stuff manually, but they will get very upset about being left with a namespace they can't delete. It's particularly bad because it's close to impossible for an Istio user to know that a finalizer on ICP is blocking their namespace deletion.
This issue is a must fix for 1.5.

I also wonder if this is mainly an issue when the ICP resource is collocated with the operator and the operator namespace itself.

The same thing happens with ICP in istio-system. The issue is that if the operator is deleted before it's removed the finalizer, there's nothing left to remove that finalizer and the namespace is stuck.

Only cluster-admin should have access to the operator namespace.

In many cases the user is the cluster admin. Multi-tenant is an important use case but not the main Istio use case.

Using OLM simplifies this entire process.

There are no plans to recommend OLM for Istio OSS project.

If someone wants to take on implementing a fix using preStop hook or propose some other way to fix this issue I'm completely open to it. However I strongly disagree that doing nothing and leaving things as they are leads to an acceptable user experience.

@rcernich
Copy link
Contributor

Don't delete the operator is a requirement if you're using the operator.

If you're talking about a single user, who is doing everything as cluster-admin, with a single install, no multi-tenancy, and they only care about install/delete, then there's not really any reason to use the operator.

However, once the operator becomes more functional (e.g. managing certs used by webhooks), the operator is a requirement for a control plane to function normally. Deleting the operator would then require a human operator to take over the tasks that were being managed by the automated operator. That means updating the keys/certs used by the webhooks, cleaning up the control plane resources, removing the finalizer, etc. This is true whether OLM is used or not.

If this is an issue, then it should be highlighted in the documentation. The documentation should also state that the operator does more than facilitate installs (once it does more).

JFYI, the reason I mentioned cluster-admin at all is that I think it's important to clarify the scope required to use various components. For a dev testing this out with minikube, they'll have no problems. However, if they're using a managed cluster, whoever is managing that cluster will care, so it's important to specify what permissions are required for the various aspects (cluster-admin to install/manage the operator, while general user can install and manage the control plane). FWIW, this issue, moving from full access dev env to production was mentioned in a talk at KubeCon. Everything worked fine on dev machines, but the higher level security in production prevented installation of their control plane, let alone the applications in the mesh. One of the disadvantages of doing all your dev with cluster-admin.

@istio-testing
Copy link

@ostromart: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Feb 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cherrypick/release-1.4 Set this label on a PR to auto-merge it to the release-1.4 branch cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. needs-rebase Indicates a PR needs to be rebased before being merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants