-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleting the namespace holding a CR usually orphans cluster-scoped resources "owned" by this CR #137
Comments
With this option, what would happen if a user just went in and manually tried to delete the helm release secret? It seems like the following steps would occur:
The deletion of the helm release secret would hang until the finalizer was deleted, and we would be unable to process any further helm upgrades (you can't update an object that has a deletion timestamp) It seems like the only option would be to have the reconciler remove the finalizer, let the deletion proceed, and then re-install the release to get the secret re-created. But that seems like it would making handling the "delete the namespace" case problematic since you can't re-create anything in a terminating namespace. I'm not a huge fan of the alternative solution for the reason you mention (there's no guarantee the secret matches the CR) In general (and unfortunately) there are lots of issues with deleting namespaces without cleaning up inside them first. Off the top of my head:
My point being, perhaps a simpler immediate solution is to document that namespaces containing CRs managed by helm-based operators should not be deleted. Nonetheless, I'm definitely open to ideas to come up with a way to solve this to help avoid these races altogether. |
Currently I think the only safe option would be to:
This would solve the "namespace deletion" case. As for the "user decided to delete just the release secret" case, this should at least guide them to what they did wrong - by following the breadcrumbs of secret -> ownerRef -> CR -> status. In addition to conflicting with the namespace deletion, I don't see how letting the secret deletion proceed and re-installing the release could be made reliable. AFAICT the release secret is the only state storage mechanism we have. Once we let it disappear, we lose track of what resources should be deleted - the set of resources derived from current CR Eventually, I think this is a perfect use case for the |
Oh TIL about that KEP. Big 👍 from me. That feature would be a big win for operators. One of the root causes of this problem is that namespace-scoped objects (the CR controlling a helm release) can't be owners of cluster-scoped objects or objects in other namespaces. If the CR was cluster-scoped:
So I'm wondering if we should investigate support for cluster-scoped CRs in the helm operator. It would be much easier to say: "Any helm chart that deploys cluster-scoped objects should be managed by a cluster-scoped CR". It seems like this would be fairly straightforward to implement, and the only big hurdle to overcome would be where to store the release secret (and that could likely be easily solved by a CLI flag, environment variable, or maybe even an annotation in the custom object). In fact, I was experimenting with a completely unrelated project recently and decided to try using this library to make a cluster-scoped object own and reconcile a helm release, and it generally seemed to work. This is all I had to do. That's clearly begging for a better interface, but its clear to me that its pretty do-able. |
While adding support for cluster-scoped CRs might be a useful feature, I don't think we should recommend to operator developers to make their CR cluster-scoped just because their helm chart happens to contains a couple of cluster-scoped resources. There are valid reasons to keep things in namespaces, access control and resource quotas being primary ones. So I don't think it counts as a solution to this issue. Incidentally, when designing the API of the project we use helm-operator-plugins for, one of the hardest part was deciding whether the CRs should be cluster- or namespace-scoped. One of our CRs represents a typical web app, which is perfectly suited to be namespaced-scoped. The hard part was what to do with a handful of cluster-scoped resources that the associated helm chart happened to create:
In the end we decided to go namespace-scoped and rely on finalizer to help us clean up resources that k8s GC would not handle. At that time we did not even know that the helm operator did not support cluster-scoped CRs, but it would have been another hint in that direction. After some time, we changed to using one of the stock SCC, and now that PSPs are on their way to removal, most of the reasons to go cluster-scoped are gone. If we had decided to go cluster-scoped we would be in an awkward place, since the scope of a CR can never be changed, even across API versions. In general, exactly because of the asymmetry you mentioned, it seems to me like going cluster-scoped is a slippery slope. Once you go down that route, you find that more and more things need to be cluster scoped. |
Since kubernetes/enhancements#2840 is still in KEP phase, I'd like to restart the discussion assuming we might not have it eventually after all.
I have a followup question: what happens currently, let's say in the worst possible corner case of operator controller being down for a moment and the CR getting changed or deleted in the meantime? I'm assuming we do not lose track of the operand resources, thanks to the owner references. So what is the impact actually? If none, then why do we need this secret anyway? I must be confused 🤔 |
The issue
Let's say we have an helm-based operator for a
Foo
kind, whose backing helm chart renders some namespace-scoped resources, as well as some cluster-scoped resources, say aValidatingWebhookConfiguration
called${foo-namespace-and-name}-foo-validation
.Now take the following sequence of events:
Setup
foohome
namespace,myfoo
instance ofFoo
uninstall-helm-release
finalizer tomyfoo
foohome
as well as thefoohome-myfoo-foo-validation
ValidatingWebhookConfiguration
object; this also creates the helm "release" secretNow we have two deletion scenarios
Deletion of CR alone
kubectl delete foo myfoo -n foohome
myfoo
myfoo
vanishesSo everything works as intended.
Deletion of the namespace
kubectl delete ns foohome
foohome
namespacemyfoo
deletion. But by then the release secret is gone. So all that the operator seems to do is logINFO controllers.Helm Release not found, removing finalizer
and remove the finalizer.foohome
namespace as wellIn this scenario, the
foohome-myfoo-foo-validation
stays around indefinitely.Moreover, if it was not for the disambiguating prefix on the cluster-scoped resource name, then when the user now retries steps (1) and (2) with a different name, the operator will fail to reconcile with an error such as:
I used "usually" in the title because this is a race, but a race that I found to be usually lost by the helm operator.
Ideas for fixing this
Put a finalizer on the helm release secret as well, and only remove it when about to remove the finalizer on the CR.
Alternative: when CR is being deleted, then instead of depending on the release secret, do a dry-run processing of the helm chart, and delete everything it spits out. Note that this not reliable, because the CR spec might have changed between the moment when the controller has last acted on it and now (leading to a change in the list of resources produced).
The text was updated successfully, but these errors were encountered: