Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chore] finalizer removal & teardown logic cleanup #1522

Merged
merged 19 commits into from
Jul 12, 2021
Merged

Conversation

shaneutt
Copy link
Contributor

@shaneutt shaneutt commented Jul 9, 2021

What this PR does / why we need it:

  • corrects issues with finalizers that's been holding over since early railgun prototypes
  • improves consistency on object deletions
  • adds tests to prove new finalizer logic works appropriately

Which issue this PR fixes

fixes #1515

@shaneutt shaneutt added priority/medium area/maintenance Cleanup, refactoring, and other maintenance improvements that don't change functionality. labels Jul 9, 2021
@shaneutt shaneutt self-assigned this Jul 9, 2021
@shaneutt shaneutt temporarily deployed to Configure ci July 9, 2021 16:10 Inactive
@shaneutt shaneutt linked an issue Jul 9, 2021 that may be closed by this pull request
2 tasks
@codecov
Copy link

codecov bot commented Jul 9, 2021

Codecov Report

Merging #1522 (7fdab82) into next (815a74c) will decrease coverage by 1.12%.
The diff coverage is 23.41%.

Impacted file tree graph

@@            Coverage Diff             @@
##             next    #1522      +/-   ##
==========================================
- Coverage   51.44%   50.31%   -1.13%     
==========================================
  Files          91       91              
  Lines        6316     6423     +107     
==========================================
- Hits         3249     3232      -17     
- Misses       2773     2884     +111     
- Partials      294      307      +13     
Flag Coverage Δ
integration-test 44.66% <23.45%> (-3.58%) ⬇️
unit-test 38.66% <2.77%> (-0.28%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...gun/hack/generators/controllers/networking/main.go 0.00% <ø> (ø)
railgun/internal/ctrlutils/utils.go 86.11% <ø> (+1.79%) ⬆️
...trollers/configuration/zz_generated_controllers.go 30.13% <21.71%> (-17.71%) ⬇️
pkg/store/store.go 45.32% <23.07%> (+4.65%) ⬆️
...n/internal/proxy/clientgo_cached_proxy_resolver.go 75.22% <100.00%> (+1.19%) ⬆️
pkg/parser/parser.go 83.26% <0.00%> (-1.26%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 815a74c...7fdab82. Read the comment docs.

@shaneutt shaneutt temporarily deployed to Configure ci July 9, 2021 16:22 Inactive
@shaneutt shaneutt temporarily deployed to Configure ci July 9, 2021 16:22 Inactive
@shaneutt shaneutt force-pushed the finalizer-cleanup branch from efe4b84 to 0d14ba4 Compare July 9, 2021 18:24
@shaneutt shaneutt temporarily deployed to Configure ci July 9, 2021 18:24 Inactive
@shaneutt shaneutt temporarily deployed to Configure ci July 9, 2021 18:24 Inactive
@shaneutt shaneutt added the do not merge let the author merge this, don't merge for them. label Jul 9, 2021
@shaneutt
Copy link
Contributor Author

shaneutt commented Jul 9, 2021

flagging as do not merge because before we merge this I want to do some rebasing.

@shaneutt shaneutt marked this pull request as ready for review July 9, 2021 18:36
@shaneutt shaneutt requested a review from a team as a code owner July 9, 2021 18:36
@shaneutt shaneutt requested a review from rainest July 9, 2021 18:36
@shaneutt shaneutt temporarily deployed to Configure ci July 9, 2021 18:42 Inactive
@shaneutt shaneutt temporarily deployed to Configure ci July 9, 2021 18:42 Inactive
@shaneutt shaneutt requested review from mflendrich and ccfishk July 9, 2021 19:35
@rainest
Copy link
Contributor

rainest commented Jul 9, 2021

tl;dr

  • This looks probably fine, and the changes other than the finalizer template code change itself look definitely fine.
  • While this fixes any issues with Secrets by simply not attaching finalizers to them, I can't see why they were affected before but other resources weren't. Would that issue not also affect Ingresses, which still have finalizers?
  • Do we think we need finalizers at all? After not really knowing about them, I've done research and discovered that they have ALL SORTS OF COMPLICATIONS. We didn't have them in 1.x and it doesn't appear that their use case is truly justified for us. Do we want to continue using them at all?

The commits not directly related to the finalizer stuff look fine, but I have many questions about the finalizers.

Do we know why this was a problem for Secrets specifically? While this does remove our finalizer from them, it remains on other resources (e.g. Ingresses), and I'd expect that those also block namespace deletion if not removed--AFAIK the problem is that K8S won't allow deletion of namespaces without deleting its contents, and while it cascades these deletes automatically, the cascaded deletes won't finish until all finalizers are cleared from those objects.

The original section of the reconciler code in the current tip of next did add our finalizer to everything, but it also apparently removed it from everything being deleted. Was there some reason that Secrets failed timestamp condition there, or consistently resulted in an error when we called err := r.Proxy.DeleteObject(obj)? The latter seems more possible (it's our code and it's more complex), but a brief review of that code indicates it's only deleting from cache, which should succeed.

That all said, this and the error Jimin encountered (which is caused by me not properly adding permissions in #1457) makes me question whether we need the finalizers at all. The kubebuilder example includes some logic we don't have to confirm the presence of the finalizer before doing things:

        if containsString(cronJob.GetFinalizers(), myFinalizerName) {
            // our finalizer is present, so lets handle any external dependency
            if err := r.deleteExternalResources(cronJob); err != nil {

but we don't: we simply delete all deleting resources from our cache before stripping the finalizer, without any check to see if the finalizer is present. We still do that in the proposed version, though with additional logic to requeue.

We won't remove our finalizer on resources that still include it until the proxy delete actually succeeds (because the requeue return precedes the finalizer clear), but we'll requeue indefinitely until the proxy delete does in fact succeed whether or not the finalizer is set.

The class annotation distinction between resources that get finalizers and those that don't seems rather arbitrary, as anything we reconcile influences Kong proxy configuration. For example, Ingresses (which need class annotations and will get a finalizer) result in routes, whereas KongPlugins attached to a Service (which do not need class annotations and will not get a finalizer) result in a plugin on a Kong service. It's not clear why we'd block resource deletion on one but not the other.

Kubernetes guidance on when and why to use finalizers is unfortunately rather unclear, but my rough takeaway is that you use them when:

  • You really want to avoid deleting something until some other condition is met (e.g. deleting a PV with an active PVC is almost certainly a bad idea).
  • You really want to ensure some action elsewhere (e.g. failing to delete a cloud provider load balancer bound to a LoadBalancer means it sits around costing you money until you realize deletion failed and destroy it manually).

For us:

  • We don't do anything like the first case currently, though we could do something like add a finalizer to a KongPlugin after we see an annotation referencing it on an Ingress: you would not be able to remove KongPlugins until you remove references to them.
  • Our deletions are like the second case, though the justification is quite weak IMO. We do (now) try to eventually remove the object from the cache (which will in turn purge it from config) and in the worst case, can fix cases where retries never succeed by just restarting the controller, which will build a fresh cache from extant resources only. Furthermore, removing an object from the cache is no guarantee that the config is not still active, since it's quite possible to remove from cache, regenerate a config, and fail to apply config because the proxy rejects it.

In general, the UX for finalizers also seems poor:

  1. Finalizers basically just hang kubectl (since the delete never finishes) or leave the resource in place until the finalizer clears (if you get bored and terminate the hung kubectl). It doesn't appear that there's any indication that the finalizer is what blocks it--there are no messages in kubectl delete output, even at V6, and no events in kubectl describe that refer to them)--beyond the presence of the finalizer in the YAML itself.
  2. Finalizers don't provide information about deletion failures on resources themselves. You have to know to go check the finalizer owner's logs to see why it's not being removed, which is what we'd instruct users to do anyway if config is stuck (which usually happens because config is rejected, not because resources are stuck in cache).
  3. It's furthermore not clear to me what should/does happen if you stop the controller and then want to remove something it finalized. It doesn't look like we remove the finalizer on shutdown, only on resource deletion. You're just stuck with those resources until you restart the controller.

@shaneutt
Copy link
Contributor Author

tl;dr

* This looks probably fine, and the changes other than the finalizer template code change itself look definitely fine.

* While this fixes any issues with Secrets by simply not attaching finalizers to them, I can't see why they were affected before but other resources weren't. Would that issue not also affect Ingresses, which still have finalizers?

This was happening due to a race condition: the tests weren't verifying that their namespaces were properly torn down, so the controller would be shut down before it could process all finalizers.

This is now fixed with the namespace teardown code.

* Do we think we need finalizers at all? After not really knowing about them, I've done research and discovered that they have ALL SORTS OF COMPLICATIONS. We didn't have them in 1.x and it doesn't appear that their use case is truly justified for us. Do we want to continue using them at all?

The commits not directly related to the finalizer stuff look fine, but I have many questions about the finalizers.

Do we know why this was a problem for Secrets specifically? While this does remove our finalizer from them, it remains on other resources (e.g. Ingresses), and I'd expect that those also block namespace deletion if not removed--AFAIK the problem is that K8S won't allow deletion of namespaces without deleting its contents, and while it cascades these deletes automatically, the cascaded deletes won't finish until all finalizers are cleared from those objects.

The original section of the reconciler code in the current tip of next did add our finalizer to everything, but it also apparently removed it from everything being deleted. Was there some reason that Secrets failed timestamp condition there, or consistently resulted in an error when we called err := r.Proxy.DeleteObject(obj)? The latter seems more possible (it's our code and it's more complex), but a brief review of that code indicates it's only deleting from cache, which should succeed.

That all said, this and the error Jimin encountered (which is caused by me not properly adding permissions in #1457) makes me question whether we need the finalizers at all. The kubebuilder example includes some logic we don't have to confirm the presence of the finalizer before doing things:

        if containsString(cronJob.GetFinalizers(), myFinalizerName) {
            // our finalizer is present, so lets handle any external dependency
            if err := r.deleteExternalResources(cronJob); err != nil {

but we don't: we simply delete all deleting resources from our cache before stripping the finalizer, without any check to see if the finalizer is present. We still do that in the proposed version, though with additional logic to requeue.

We won't remove our finalizer on resources that still include it until the proxy delete actually succeeds (because the requeue return precedes the finalizer clear), but we'll requeue indefinitely until the proxy delete does in fact succeed whether or not the finalizer is set.

The class annotation distinction between resources that get finalizers and those that don't seems rather arbitrary, as anything we reconcile influences Kong proxy configuration. For example, Ingresses (which need class annotations and will get a finalizer) result in routes, whereas KongPlugins attached to a Service (which do not need class annotations and will not get a finalizer) result in a plugin on a Kong service. It's not clear why we'd block resource deletion on one but not the other.

Kubernetes guidance on when and why to use finalizers is unfortunately rather unclear, but my rough takeaway is that you use them when:

* You really want to avoid deleting something until some other condition is met (e.g. deleting a PV with an active PVC is almost certainly a bad idea).

* You really want to ensure some action elsewhere (e.g. failing to delete a cloud provider load balancer bound to a LoadBalancer means it sits around costing you money until you realize deletion failed and destroy it manually).

For us:

* We don't do anything like the first case currently, though we could do something like add a finalizer to a KongPlugin after we see an annotation referencing it on an Ingress: you would not be able to remove KongPlugins until you remove references to them.

* Our deletions are like the second case, though the justification is quite weak IMO. We do (now) try to eventually remove the object from the cache (which will in turn purge it from config) and in the worst case, can fix cases where retries never succeed by just restarting the controller, which will build a fresh cache from extant resources only. Furthermore, removing an object from the cache is _no guarantee that the config is not still active_, since it's quite possible to remove from cache, regenerate a config, and fail to apply config because the proxy rejects it.

In general, the UX for finalizers also seems poor:

1. Finalizers basically just hang kubectl (since the delete never finishes) or leave the resource in place until the finalizer clears (if you get bored and terminate the hung kubectl). It doesn't appear that there's any indication that the finalizer is what blocks it--there are no messages in `kubectl delete` output, even at V6, and no events in `kubectl describe` that refer to them)--beyond the presence of the finalizer in the YAML itself.

2. Finalizers don't provide information about deletion failures on resources themselves. You have to know to go check the finalizer owner's logs to see why it's not being removed, which is what we'd instruct users to do anyway if config is stuck (which usually happens because config is rejected, not because resources are stuck in cache).

3. It's furthermore not clear to me what should/does happen if you stop the controller and then want to remove something it finalized. It doesn't look like we remove the finalizer on shutdown, only on resource deletion. You're just stuck with those resources until you restart the controller.

I totally feel you on this.

Long story short? When this was originally written the intention was to ensure the relevant configurations in the proxy were actually removed before letting the objects delete. Technical limitations regarding our integration with the Kong Admin API make this really hard to pull off right now so this didn't actually get implemented as originally intended.

So ultimately given the feasibility issues to make good use of them, I'll pull them all out, we can add them back later if we find solid use cases to do so. 👍

shaneutt added 3 commits July 9, 2021 22:37
These were requested changes in a previous code review
which I accidentally left in my git stash, so just getting
them committed now so they are in.
shaneutt added 2 commits July 9, 2021 22:37
Originally we added finalizers with the intention of
holding back object deletion until we were certain that
we had removed the relevant configuration from the Kong
Admin API. Ultimately at the time of writing it's not
feasible to get this done, so this removes a half of
the implementation which was causing us complexity without
any added value.
@shaneutt shaneutt force-pushed the finalizer-cleanup branch from f76ff39 to 1553b0f Compare July 10, 2021 03:15
@shaneutt shaneutt temporarily deployed to Configure ci July 10, 2021 03:15 Inactive
@shaneutt shaneutt temporarily deployed to Configure ci July 10, 2021 03:16 Inactive
@shaneutt
Copy link
Contributor Author

shaneutt commented Jul 10, 2021

@rainest I'm totally with you on finalizers.

Originally when they were added in the railgun prototype the intention was to ensure consistency: the contents that result from objects should be removed from the Kong Proxy state before removing relevant objects from Kubernetes. However, given some of the limitations that exist today, the desired consistency is not yet feasible.

All that said, I'm in agreement with you, check 1553b0f and let me know what you think 👍

@shaneutt shaneutt requested a review from ccfishk July 10, 2021 03:17
@shaneutt shaneutt changed the title Finalizer cleanup [chore] finalizer removal & teardown logic cleanup Jul 10, 2021
@shaneutt
Copy link
Contributor Author

gonna get things working and then re-open 👍

@shaneutt shaneutt closed this Jul 10, 2021
@shaneutt
Copy link
Contributor Author

Fixed and working locally now 👍

@shaneutt shaneutt reopened this Jul 11, 2021
@shaneutt shaneutt temporarily deployed to Configure ci July 11, 2021 03:37 Inactive
@shaneutt shaneutt temporarily deployed to Configure ci July 11, 2021 03:37 Inactive
@shaneutt shaneutt temporarily deployed to Configure ci July 11, 2021 18:59 Inactive
@shaneutt shaneutt temporarily deployed to Configure ci July 11, 2021 20:06 Inactive
@shaneutt shaneutt temporarily deployed to Configure ci July 12, 2021 17:43 Inactive
@shaneutt shaneutt temporarily deployed to Configure ci July 12, 2021 17:43 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/maintenance Cleanup, refactoring, and other maintenance improvements that don't change functionality. ci/license/unchanged do not merge let the author merge this, don't merge for them. priority/medium
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KIC 2.0 - Secret Finalizers Stuck
3 participants