🐛 Fix deadlock #1579

spjmurray · 2023-06-06T09:37:27Z

What this PR does / why we need it:

There's a race where the infrastructure can be deleted before the machines, but the deletion of machines is dependent on the infrastruture, and we get stuck deleting forever (unless you manually delete the machines from Nova and remove the finalizer). Simple fix is to defer deletion of infrastructure until the machines have been purged.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1578

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
if necessary:
- includes documentation
- adds unit tests

/hold

netlify · 2023-06-06T09:37:31Z

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name	Link
🔨 Latest commit	`85e51f1`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/6482dfd217f356000806310a
😎 Deploy Preview	https://deploy-preview-1579--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

k8s-ci-robot · 2023-06-06T09:37:36Z

Hi @spjmurray. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lentzi90 · 2023-06-06T10:03:13Z

/ok-to-test

spjmurray · 2023-06-06T10:06:32Z

Reviewers please note this is based on 0.7, so I'm expecting a 0.7.4 (fingers crossed)... If you want it against main, please speak up!

lentzi90 · 2023-06-06T10:07:56Z

I was just about to ask why is this PR for release-0.7. The normal process would be that you make the PR for main, then if it makes sense to backport, we make a cherry pick for the release branch and merge that once the original PR has been merged

spjmurray · 2023-06-06T10:08:38Z

Noted, I'll rebase...

lentzi90 · 2023-06-06T10:08:50Z

Otherwise the next minor release (v0.8.0) would still have the bug 😅

spjmurray · 2023-06-06T10:13:46Z

Yeah yeah 😸 I'm totally gonna blame the necessity to spin up a 0.7.3-hack1, not me, oh no

spjmurray · 2023-06-06T10:18:58Z

/retest

lentzi90 · 2023-06-06T11:16:10Z

Thanks for the effort so far!

I have a feeling this may be solved in a nicer way with finalizers. It seems the machine controller really needs the OpenStackCluster to do its job, so perhaps it should mark it with a finalizer then to make things official? We would then need to detect when the last machine is deleted and remove the finalizer at that time.

Not sure if it is a good idea or not but I would love to hear opinions 😄

spjmurray · 2023-06-06T13:34:37Z

Believe me, this is far less effort than having to manually clean up clusters N times a day 😄

So while I agree that finalizers would be a good solution, there are caveats. This reminded me of when I attempted, and failed, to fix the identity deletion problem https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1393/files#diff-5caa59e6bfbef79676beb32f8373e1fd6631055a73a592042c5912fcc06e8817R130 gotta be careful for race conditions to do with resource generations etc. Just a viewpoint from having tried to mess with them before.

spjmurray · 2023-06-06T13:37:43Z

Also detecting "I am the last machine" seems error prone to me, the ones attached to MDs should go first, then it looks like the kubeadm CP controller deletes the CP machines in order, so it shouldn't be too risky... but there's a lot of assumptions in there.

mdbooth · 2023-06-07T13:47:19Z

My initial thoughts are:

I agree with @lentzi90 that this looks a bit complicated
I agree with @spjmurray that this is way better than doing it manually

I would personally be ok landing this as long as we've determined there's no simpler existing solution.

We could use a finalizer, but as you say I think that gets complicated. You'd presumably need a separate finalizer for each individual machine, which doesn't feel right either.

Can we add a second ownerReference with blockOwnerDeletion? Would that work?

spjmurray · 2023-06-07T14:43:16Z

So, if I understand this right, every OSM would be "owned" by the OSC, that would prevent deletion of the OSC until all dependent OSMs are terminated 🤖 . The only side effect that is that deletion of the OSC would propagate down to the OSMs, outside of the control of the KCP, dunno how that'd react! That said, it has legs, so I'll give it a spin on the flipside, gotta enjoy that sun before the UK resorts back to torrential rain.

spjmurray · 2023-06-07T15:46:48Z

Meh, I found time @mdbooth 😸 It appears to work from a quick test, but I'll let you know more tomorrow. Not quite sure how it's going to play out with conversion webhooks and GVKs changing all the time. I'll let you muse on that...

lentzi90 · 2023-06-08T06:05:54Z

I have done some digging! We could probably take inspiration from CAPI for this. They use a finalizer, but not in the way I suggested. The control plane controller just checks that all machines are gone before removing its own finalizer.
What do you think about this approach?

Ref:
https://github.com/kubernetes-sigs/cluster-api/blob/6c6543448bee351430e48f39883265f409c44662/controlplane/kubeadm/internal/controllers/controller.go#L434-L445

jichenjc · 2023-06-08T07:55:48Z

THere are some comments in that function

// The implementation does not take non-control plane workloads into consideration. This may or may not change in the future.
// Please see https://github.com/kubernetes-sigs/cluster-api/issues/2064.

and I agree seems check the number of machines is the right way to go

spjmurray · 2023-06-08T09:06:12Z

I've updated to that methodology, and updated the commit. Gotta say it's ostensibly the same as my initial commit 😄 I dunno if adding the context back in, and making the delete function a receiver is a good thing or not... think that's why I initially had the check at the top level 🤷🏻 well, you have 3 commits to consider which is the best. Tests just fine again.

lentzi90

Looks good! Just squash the commits and I think this is good to go in 🙂

spjmurray · 2023-06-08T13:58:52Z

You know, I'm not so sure, I've seen a couple hangs now with the CAPI kubeadm control plane controller. Perhaps there's a race still where the load balancer/API vanish too early...

{"ts":1686232434559.7458,"caller":"cache/reflector.go:424","msg":"k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169: failed to list *v1.ClusterRole: Get \"https://185.47.227.202:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?resourceVersion=5519&timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\n","v":0}
{"ts":1686232434559.7974,"caller":"trace/trace.go:205","msg":"Trace[1708403150]: \"Reflector ListAndWatch\" name:k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169 (08-Jun-2023 13:53:44.559) (total time: 10000ms):\nTrace[1708403150]: ---\"Objects listed\" error:Get \"https://185.47.227.202:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?resourceVersion=5519&timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 10000ms (13:53:54.559)\nTrace[1708403150]: [10.000743907s] [10.000743907s] END\n","v":0}
{"ts":1686232434559.815,"caller":"cache/reflector.go:140","msg":"k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169: Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get \"https://185.47.227.202:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?resourceVersion=5519&timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\n"}

Restarting the service kicks it back into life. Which ain't perfect. Perhaps blocking deletion was the right approach, rather than blocking removal of the finalizer. I shall be back once I've tested some more.

spjmurray · 2023-06-08T14:20:33Z

Right, I am definitely happy that this is hang free, for now!

controllers/openstackcluster_controller.go

There's a race where the infrastructure can be deleted before the machines, but the deletion of machines is dependent on the infrastruture, and we get stuck deleting forever (unless you manually delete the machines from Nova and remove the finalizer). Simple fix is to defer deletion of infrastructure until the machines have been purged.

lentzi90

/approve

tobiasgiese · 2023-06-09T08:35:24Z

/lgtm
/approve

k8s-ci-robot · 2023-06-09T08:35:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lentzi90, spjmurray, tobiasgiese

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [lentzi90,tobiasgiese]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

lentzi90 · 2023-06-09T08:42:18Z

/hold cancel

tobiasgiese · 2023-06-09T10:00:23Z

Do we want to cherry-pick this fix?

lentzi90 · 2023-06-09T10:01:41Z

I think so
/cherry-pick release-0.7

k8s-infra-cherrypick-robot · 2023-06-09T10:02:21Z

@lentzi90: #1579 failed to apply on top of branch "release-0.7":

Applying: Fix Deletion Deadlock
Using index info to reconstruct a base tree...
M	controllers/openstackcluster_controller.go
Falling back to patching base and 3-way merge...
Auto-merging controllers/openstackcluster_controller.go
CONFLICT (content): Merge conflict in controllers/openstackcluster_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Fix Deletion Deadlock
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

I think so
/cherry-pick release-0.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lentzi90 · 2023-06-09T10:03:40Z

@spjmurray would you mind creating a PR for release-0.7 manually?

spjmurray · 2023-06-09T10:43:26Z

I knew this would happen 😄 10 minutes...

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 6, 2023

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 6, 2023

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 6, 2023

k8s-ci-robot requested review from lentzi90 and seanschneeweiss June 6, 2023 09:37

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 6, 2023

spjmurray closed this Jun 6, 2023

spjmurray reopened this Jun 6, 2023

spjmurray changed the base branch from main to release-0.7 June 6, 2023 09:42

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 6, 2023

spjmurray force-pushed the fix_deadlock branch from 2172ac3 to acfd9ed Compare June 6, 2023 10:00

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 6, 2023

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 6, 2023

spjmurray force-pushed the fix_deadlock branch from acfd9ed to 17b57c0 Compare June 6, 2023 10:14

spjmurray changed the base branch from release-0.7 to main June 6, 2023 10:14

lentzi90 reviewed Jun 8, 2023

View reviewed changes

spjmurray force-pushed the fix_deadlock branch from 0bfae68 to 02e3ba2 Compare June 8, 2023 09:16

spjmurray force-pushed the fix_deadlock branch from 02e3ba2 to faafe35 Compare June 8, 2023 14:19

tobiasgiese reviewed Jun 8, 2023

View reviewed changes

controllers/openstackcluster_controller.go Outdated Show resolved Hide resolved

spjmurray force-pushed the fix_deadlock branch from faafe35 to 85e51f1 Compare June 9, 2023 08:16

lentzi90 approved these changes Jun 9, 2023

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2023

k8s-ci-robot assigned tobiasgiese Jun 9, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 9, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 9, 2023

k8s-ci-robot merged commit 45523c8 into kubernetes-sigs:main Jun 9, 2023

spjmurray mentioned this pull request Jun 9, 2023

🐛 Fix Deletion Deadlock Part Deux #1583

Merged

3 tasks

spjmurray deleted the fix_deadlock branch June 9, 2023 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Fix deadlock #1579

🐛 Fix deadlock #1579

spjmurray commented Jun 6, 2023

netlify bot commented Jun 6, 2023 •

edited

Loading

k8s-ci-robot commented Jun 6, 2023

lentzi90 commented Jun 6, 2023

spjmurray commented Jun 6, 2023

lentzi90 commented Jun 6, 2023

spjmurray commented Jun 6, 2023

lentzi90 commented Jun 6, 2023

spjmurray commented Jun 6, 2023

spjmurray commented Jun 6, 2023

lentzi90 commented Jun 6, 2023

spjmurray commented Jun 6, 2023

spjmurray commented Jun 6, 2023

mdbooth commented Jun 7, 2023

spjmurray commented Jun 7, 2023

spjmurray commented Jun 7, 2023

lentzi90 commented Jun 8, 2023

jichenjc commented Jun 8, 2023

spjmurray commented Jun 8, 2023

lentzi90 left a comment

spjmurray commented Jun 8, 2023

spjmurray commented Jun 8, 2023

lentzi90 left a comment

tobiasgiese commented Jun 9, 2023 •

edited

Loading

k8s-ci-robot commented Jun 9, 2023

lentzi90 commented Jun 9, 2023

tobiasgiese commented Jun 9, 2023

lentzi90 commented Jun 9, 2023

k8s-infra-cherrypick-robot commented Jun 9, 2023

lentzi90 commented Jun 9, 2023

spjmurray commented Jun 9, 2023

🐛 Fix deadlock #1579

🐛 Fix deadlock #1579

Conversation

spjmurray commented Jun 6, 2023

netlify bot commented Jun 6, 2023 • edited Loading

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

k8s-ci-robot commented Jun 6, 2023

lentzi90 commented Jun 6, 2023

spjmurray commented Jun 6, 2023

lentzi90 commented Jun 6, 2023

spjmurray commented Jun 6, 2023

lentzi90 commented Jun 6, 2023

spjmurray commented Jun 6, 2023

spjmurray commented Jun 6, 2023

lentzi90 commented Jun 6, 2023

spjmurray commented Jun 6, 2023

spjmurray commented Jun 6, 2023

mdbooth commented Jun 7, 2023

spjmurray commented Jun 7, 2023

spjmurray commented Jun 7, 2023

lentzi90 commented Jun 8, 2023

jichenjc commented Jun 8, 2023

spjmurray commented Jun 8, 2023

lentzi90 left a comment

Choose a reason for hiding this comment

spjmurray commented Jun 8, 2023

spjmurray commented Jun 8, 2023

lentzi90 left a comment

Choose a reason for hiding this comment

tobiasgiese commented Jun 9, 2023 • edited Loading

k8s-ci-robot commented Jun 9, 2023

lentzi90 commented Jun 9, 2023

tobiasgiese commented Jun 9, 2023

lentzi90 commented Jun 9, 2023

k8s-infra-cherrypick-robot commented Jun 9, 2023

lentzi90 commented Jun 9, 2023

spjmurray commented Jun 9, 2023

netlify bot commented Jun 6, 2023 •

edited

Loading

tobiasgiese commented Jun 9, 2023 •

edited

Loading