Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fix deadlock #1579

Merged
merged 1 commit into from
Jun 9, 2023
Merged

Conversation

spjmurray
Copy link
Contributor

What this PR does / why we need it:

There's a race where the infrastructure can be deleted before the machines, but the deletion of machines is dependent on the infrastruture, and we get stuck deleting forever (unless you manually delete the machines from Nova and remove the finalizer). Simple fix is to defer deletion of infrastructure until the machines have been purged.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1578

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • if necessary:
    • includes documentation
    • adds unit tests

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 6, 2023
@netlify
Copy link

netlify bot commented Jun 6, 2023

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit 85e51f1
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/6482dfd217f356000806310a
😎 Deploy Preview https://deploy-preview-1579--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 6, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @spjmurray. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 6, 2023
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 6, 2023
@spjmurray spjmurray closed this Jun 6, 2023
@spjmurray spjmurray reopened this Jun 6, 2023
@spjmurray spjmurray changed the base branch from main to release-0.7 June 6, 2023 09:42
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 6, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 6, 2023
@lentzi90
Copy link
Contributor

lentzi90 commented Jun 6, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 6, 2023
@spjmurray
Copy link
Contributor Author

Reviewers please note this is based on 0.7, so I'm expecting a 0.7.4 (fingers crossed)... If you want it against main, please speak up!

@lentzi90
Copy link
Contributor

lentzi90 commented Jun 6, 2023

I was just about to ask why is this PR for release-0.7. The normal process would be that you make the PR for main, then if it makes sense to backport, we make a cherry pick for the release branch and merge that once the original PR has been merged

@spjmurray
Copy link
Contributor Author

Noted, I'll rebase...

@lentzi90
Copy link
Contributor

lentzi90 commented Jun 6, 2023

Otherwise the next minor release (v0.8.0) would still have the bug 😅

@spjmurray
Copy link
Contributor Author

Yeah yeah 😸 I'm totally gonna blame the necessity to spin up a 0.7.3-hack1, not me, oh no

@spjmurray spjmurray changed the base branch from release-0.7 to main June 6, 2023 10:14
@spjmurray
Copy link
Contributor Author

/retest

@lentzi90
Copy link
Contributor

lentzi90 commented Jun 6, 2023

Thanks for the effort so far!

I have a feeling this may be solved in a nicer way with finalizers. It seems the machine controller really needs the OpenStackCluster to do its job, so perhaps it should mark it with a finalizer then to make things official? We would then need to detect when the last machine is deleted and remove the finalizer at that time.

Not sure if it is a good idea or not but I would love to hear opinions 😄

@spjmurray
Copy link
Contributor Author

Believe me, this is far less effort than having to manually clean up clusters N times a day 😄

So while I agree that finalizers would be a good solution, there are caveats. This reminded me of when I attempted, and failed, to fix the identity deletion problem https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1393/files#diff-5caa59e6bfbef79676beb32f8373e1fd6631055a73a592042c5912fcc06e8817R130 gotta be careful for race conditions to do with resource generations etc. Just a viewpoint from having tried to mess with them before.

@spjmurray
Copy link
Contributor Author

Also detecting "I am the last machine" seems error prone to me, the ones attached to MDs should go first, then it looks like the kubeadm CP controller deletes the CP machines in order, so it shouldn't be too risky... but there's a lot of assumptions in there.

@mdbooth
Copy link
Contributor

mdbooth commented Jun 7, 2023

My initial thoughts are:

  • I agree with @lentzi90 that this looks a bit complicated
  • I agree with @spjmurray that this is way better than doing it manually

I would personally be ok landing this as long as we've determined there's no simpler existing solution.

We could use a finalizer, but as you say I think that gets complicated. You'd presumably need a separate finalizer for each individual machine, which doesn't feel right either.

Can we add a second ownerReference with blockOwnerDeletion? Would that work?

@spjmurray
Copy link
Contributor Author

So, if I understand this right, every OSM would be "owned" by the OSC, that would prevent deletion of the OSC until all dependent OSMs are terminated 🤖 . The only side effect that is that deletion of the OSC would propagate down to the OSMs, outside of the control of the KCP, dunno how that'd react! That said, it has legs, so I'll give it a spin on the flipside, gotta enjoy that sun before the UK resorts back to torrential rain.

@spjmurray
Copy link
Contributor Author

Meh, I found time @mdbooth 😸 It appears to work from a quick test, but I'll let you know more tomorrow. Not quite sure how it's going to play out with conversion webhooks and GVKs changing all the time. I'll let you muse on that...

@lentzi90
Copy link
Contributor

lentzi90 commented Jun 8, 2023

I have done some digging! We could probably take inspiration from CAPI for this. They use a finalizer, but not in the way I suggested. The control plane controller just checks that all machines are gone before removing its own finalizer.
What do you think about this approach?

Ref:
https://github.com/kubernetes-sigs/cluster-api/blob/6c6543448bee351430e48f39883265f409c44662/controlplane/kubeadm/internal/controllers/controller.go#L434-L445

@jichenjc
Copy link
Contributor

jichenjc commented Jun 8, 2023

THere are some comments in that function

// The implementation does not take non-control plane workloads into consideration. This may or may not change in the future.
// Please see https://github.com/kubernetes-sigs/cluster-api/issues/2064.

and I agree seems check the number of machines is the right way to go

@spjmurray
Copy link
Contributor Author

I've updated to that methodology, and updated the commit. Gotta say it's ostensibly the same as my initial commit 😄 I dunno if adding the context back in, and making the delete function a receiver is a good thing or not... think that's why I initially had the check at the top level 🤷🏻 well, you have 3 commits to consider which is the best. Tests just fine again.

Copy link
Contributor

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just squash the commits and I think this is good to go in 🙂

@spjmurray
Copy link
Contributor Author

You know, I'm not so sure, I've seen a couple hangs now with the CAPI kubeadm control plane controller. Perhaps there's a race still where the load balancer/API vanish too early...

{"ts":1686232434559.7458,"caller":"cache/reflector.go:424","msg":"k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169: failed to list *v1.ClusterRole: Get \"https://185.47.227.202:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?resourceVersion=5519&timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\n","v":0}
{"ts":1686232434559.7974,"caller":"trace/trace.go:205","msg":"Trace[1708403150]: \"Reflector ListAndWatch\" name:k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169 (08-Jun-2023 13:53:44.559) (total time: 10000ms):\nTrace[1708403150]: ---\"Objects listed\" error:Get \"https://185.47.227.202:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?resourceVersion=5519&timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 10000ms (13:53:54.559)\nTrace[1708403150]: [10.000743907s] [10.000743907s] END\n","v":0}
{"ts":1686232434559.815,"caller":"cache/reflector.go:140","msg":"k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169: Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get \"https://185.47.227.202:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?resourceVersion=5519&timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\n"}

Restarting the service kicks it back into life. Which ain't perfect. Perhaps blocking deletion was the right approach, rather than blocking removal of the finalizer. I shall be back once I've tested some more.

@spjmurray
Copy link
Contributor Author

Right, I am definitely happy that this is hang free, for now!

There's a race where the infrastructure can be deleted before the
machines, but the deletion of machines is dependent on the
infrastruture, and we get stuck deleting forever (unless you manually
delete the machines from Nova and remove the finalizer).  Simple fix is
to defer deletion of infrastructure until the machines have been purged.
Copy link
Contributor

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2023
@tobiasgiese
Copy link
Member

tobiasgiese commented Jun 9, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 9, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lentzi90, spjmurray, tobiasgiese

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [lentzi90,tobiasgiese]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lentzi90
Copy link
Contributor

lentzi90 commented Jun 9, 2023

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 9, 2023
@k8s-ci-robot k8s-ci-robot merged commit 45523c8 into kubernetes-sigs:main Jun 9, 2023
@tobiasgiese
Copy link
Member

Do we want to cherry-pick this fix?

@lentzi90
Copy link
Contributor

lentzi90 commented Jun 9, 2023

I think so
/cherry-pick release-0.7

@k8s-infra-cherrypick-robot

@lentzi90: #1579 failed to apply on top of branch "release-0.7":

Applying: Fix Deletion Deadlock
Using index info to reconstruct a base tree...
M	controllers/openstackcluster_controller.go
Falling back to patching base and 3-way merge...
Auto-merging controllers/openstackcluster_controller.go
CONFLICT (content): Merge conflict in controllers/openstackcluster_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Fix Deletion Deadlock
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

I think so
/cherry-pick release-0.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lentzi90
Copy link
Contributor

lentzi90 commented Jun 9, 2023

@spjmurray would you mind creating a PR for release-0.7 manually?

@spjmurray
Copy link
Contributor Author

I knew this would happen 😄 10 minutes...

@spjmurray spjmurray deleted the fix_deadlock branch June 9, 2023 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐛 Deletion Deadlock
7 participants