-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Fix deadlock #1579
🐛 Fix deadlock #1579
Conversation
✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
Hi @spjmurray. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
Reviewers please note this is based on 0.7, so I'm expecting a 0.7.4 (fingers crossed)... If you want it against main, please speak up! |
I was just about to ask why is this PR for |
Noted, I'll rebase... |
Otherwise the next minor release (v0.8.0) would still have the bug 😅 |
Yeah yeah 😸 I'm totally gonna blame the necessity to spin up a 0.7.3-hack1, not me, oh no |
/retest |
Thanks for the effort so far! I have a feeling this may be solved in a nicer way with finalizers. It seems the machine controller really needs the OpenStackCluster to do its job, so perhaps it should mark it with a finalizer then to make things official? We would then need to detect when the last machine is deleted and remove the finalizer at that time. Not sure if it is a good idea or not but I would love to hear opinions 😄 |
Believe me, this is far less effort than having to manually clean up clusters N times a day 😄 So while I agree that finalizers would be a good solution, there are caveats. This reminded me of when I attempted, and failed, to fix the identity deletion problem https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1393/files#diff-5caa59e6bfbef79676beb32f8373e1fd6631055a73a592042c5912fcc06e8817R130 gotta be careful for race conditions to do with resource generations etc. Just a viewpoint from having tried to mess with them before. |
Also detecting "I am the last machine" seems error prone to me, the ones attached to MDs should go first, then it looks like the kubeadm CP controller deletes the CP machines in order, so it shouldn't be too risky... but there's a lot of assumptions in there. |
My initial thoughts are:
I would personally be ok landing this as long as we've determined there's no simpler existing solution. We could use a finalizer, but as you say I think that gets complicated. You'd presumably need a separate finalizer for each individual machine, which doesn't feel right either. Can we add a second ownerReference with blockOwnerDeletion? Would that work? |
So, if I understand this right, every OSM would be "owned" by the OSC, that would prevent deletion of the OSC until all dependent OSMs are terminated 🤖 . The only side effect that is that deletion of the OSC would propagate down to the OSMs, outside of the control of the KCP, dunno how that'd react! That said, it has legs, so I'll give it a spin on the flipside, gotta enjoy that sun before the UK resorts back to torrential rain. |
Meh, I found time @mdbooth 😸 It appears to work from a quick test, but I'll let you know more tomorrow. Not quite sure how it's going to play out with conversion webhooks and GVKs changing all the time. I'll let you muse on that... |
I have done some digging! We could probably take inspiration from CAPI for this. They use a finalizer, but not in the way I suggested. The control plane controller just checks that all machines are gone before removing its own finalizer. |
THere are some comments in that function
and I agree seems check the number of machines is the right way to go |
I've updated to that methodology, and updated the commit. Gotta say it's ostensibly the same as my initial commit 😄 I dunno if adding the context back in, and making the delete function a receiver is a good thing or not... think that's why I initially had the check at the top level 🤷🏻 well, you have 3 commits to consider which is the best. Tests just fine again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just squash the commits and I think this is good to go in 🙂
You know, I'm not so sure, I've seen a couple hangs now with the CAPI kubeadm control plane controller. Perhaps there's a race still where the load balancer/API vanish too early...
Restarting the service kicks it back into life. Which ain't perfect. Perhaps blocking deletion was the right approach, rather than blocking removal of the finalizer. I shall be back once I've tested some more. |
Right, I am definitely happy that this is hang free, for now! |
There's a race where the infrastructure can be deleted before the machines, but the deletion of machines is dependent on the infrastruture, and we get stuck deleting forever (unless you manually delete the machines from Nova and remove the finalizer). Simple fix is to defer deletion of infrastructure until the machines have been purged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lentzi90, spjmurray, tobiasgiese The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
Do we want to cherry-pick this fix? |
I think so |
@lentzi90: #1579 failed to apply on top of branch "release-0.7":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@spjmurray would you mind creating a PR for release-0.7 manually? |
I knew this would happen 😄 10 minutes... |
What this PR does / why we need it:
There's a race where the infrastructure can be deleted before the machines, but the deletion of machines is dependent on the infrastruture, and we get stuck deleting forever (unless you manually delete the machines from Nova and remove the finalizer). Simple fix is to defer deletion of infrastructure until the machines have been purged.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #1578
Special notes for your reviewer:
TODOs:
/hold