-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion #65392
Comments
[MILESTONENOTIFIER] Milestone Issue: Up-to-date for process Issue Labels
|
Maybe I hitted a similar case but not the same. The general sequence as below :
The attach/detach controller works as expected. I use version |
@orainxiong is the terminationGracePeriod set inside the statefulset? |
|
@saad-ali can we have an update whether someone is working on this issue? |
I can take a look at this /assign |
Hi @saad-ali , @verult and @gnufied. I have a concern about how to fix this issue. Its good not to trust on podStatus.ContainerStatuses because in a crash worker scenario the kubelet is not responsive and the ContainerStatuses will remain Running for-ever (which prevent detach the POD). So adding DeletionGracePeriodSeconds as another trigger to force detach the PVCs of a none responding POD is good idea. BUT I think we should be very careful of force detach PVCs, because if its a split brain scenario (worker node was not really crashed, it was just a network issue) and the POD still running and writing IO to the PVC, detaching the PVC may lead to data corruption. I would like to propose saver implementation that from one hand will do detach for PODs that in crashed node scenario, but from other hand will be save enough also for split brain scenario. Thoughts? |
@gnufied and I discussed this offline. Please correct me if I'm wrong, Hemant, but your argument was that setting the annotation to true is equivalent to setting TerminationGracePeriod very high, which also "disables" this detach behavior by delaying it indefinitely. On the other hand, because the default TerminationGracePeriod is 30s, we can't safely cherry-pick this PR back to older versions. One option is to use an annotation + DeletionGracePeriod for all versions prior to 1.12, and only DeletionGracePeriod for 1.12 and later. |
One of the other things we could do is - only detach volumes from a node when it is known that it has been shutdown. So basically make use of newly introduced taint to detach the volume. This does mean that volumes never get detached from unresponsive or crashed kubelet nodes but that is probably fine. If user really wants to force the detach he can always do that via:
This will mean, we will have to roll this PR into #67977 and we obviously can't backport. Older versions of AWS, Openstack and Vsphere is not affected by this bug btw - because in those cloudproviders a node is deleted when shutdown. It is only recently that this behaviour was changed in those cloud providers and node object no longer is deleted and hence pods remain in "Unknown" phase and volume never gets detached. This bug affects versions as far back as 1.9 (or even before that). But cloudprovider behaviour is recent change. Another problem is - using shutdown to detect volume detach has been feature gated as an alpha feature, I do not like annotation option that much tbh. It is hard to document and easy to miss and since it is an API change, it also has to go through alpha, beta phases. Agreed that - #67977 is currently feature gated too but I do not see a strong reason of why that should be feature gated. cc @yastij |
@gnufied - some production cluster relies on quick reboots (i.e. reboot and still have disks attached). We might want to give them some bandwidth for transition. |
@yastij shouldn't pod eviction timeout(5 minute default) + expiry window before a node is considered lost and it stops sending heartbeats enough? There is also - grace period obviously. The reason I do not particularly like annotation also is because, detaching volumes from shutdown nodes should be a default behavior and should not require user intervention. If we can differentiate between shutdown nodes and unresponsive kubelets then we should use that information for detaching volumes and it should be default rather than something that user have to configure. |
I agree with @gnufied on annotations. the plan is to promote this as beta on 1.13. From what I’ve see the eviction timeout impacts all pods which might be tricky: for aggresive clusters this doesn’t work as pod timeout and node grace period are set lower than usual. This window (1 release) ensures that everything goes smoothly. |
Good arguments on both sides. On the one hand, we have to very careful not to cause data corruption. On the other hand, we don't want to get in to a state where their data is stuck and requires manual intervention to access from new workloads. I agree with @gnufied, we should avoid adding another annotation -- whenever there is a question like this, exposing the option to the end user is the easiest way out, but not necessarily the best. We should try to be smart and do the right thing automatically on behalf of the user. The right thing in this case is tricky. If a node is not coming back, we should move the disk even if it is not safely unmounted. But how do we differentiate that from a split brain? A timer based algorithm is pretty crude, but its the best approximation we currently have. And I'm not aware of any way to detect a split brain vs a dead node so I think we need to weigh the likely hood of the different scenarios happening and the worst case for each option and how willing we are to tolerate that. I'll leave it up to you guys to weigh those tradeoffs. |
@gnufied actually in openstack nodes are not deleted anymore. #59931 It basically means that in openstack if you shutdown instance which had pod with pvc, it cannot failover to another instance. However, #67977 will fix this issue I just talked to @yastij that we should use feature gate only to detach delay. Then it can detach volumes in openstack again, if feature gate turned on the delay is something like 1min and if not turned on the detach delay is something like 8-10minutes. You are correct like in AWS this problem does not exist but if #59930 is approved (like it should because other cloud behaviour and spec is that it should not delete nodes). Then aws will get this problem as well if #67977 is not approved. |
Some notes I took during the conference meeting today (@verult please add your notes as well):
|
Do we have any notion of the timeframe on resolving this issue? Code freeze is next week, at which point only critical/urgent bugs will be permitted into the milestone. Thank you! :) |
@shay-berman PR #67977 should still be on hold because without actual handling of pods from shutdown nodes the PR #67977 has no effect whatsoever. The volumes still will not be detached from shutdown nodes. |
Can you be more specific about what kinds of "various reasons" there are? It's not obvious to me and I want to understand this basic use case. |
@msau42 I can only speak for my own experience here, but I've seen behavior like this when the underlying hardware for a VM running a node fails. The cluster admin never has the opportunity to cordon/drain the node first in that scenario. This prevents the volume from being reattached to another node even if the cloud provider knows that the instance is powered down (and therefore that the volume definitely isn't in active use on that node) as to my knowledge there isn't a node taint that lets us signal this status to k8s. It's less of an issue if you have node autoscaling available (as the volume can be attached to another node as soon as the failed node is deleted), but when that's not an option, as far as I know any pods with attached PVs can't be rescheduled until the node comes back online or is removed from the cluster entirely. (It's been a while since I've run into this myself and looked into it though, if that's no longer the case I'd be glad to be wrong about this!) |
There's an ongoing work on a new solution to this problem: kubernetes/enhancements#1116 However, I'm still unsure how it will work with non-cloud clusters. @msau42 Machines sometimes fail... |
@adampl thanks for the link. Resolving the issue one way or another, even if limited to cloud provider nodes only, would be great. Do you know what k8s release it's targeted for? cc @smarterclayton |
@alena1108 - hopefully 1.16 |
@NickrenREN Hi, any info about the proposal for node fencing on cloud and BM? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
@jfrederickson @saad-ali @verult |
Hi All, Is this issue still valid ??? |
It's still a problem on 1.22, I didn't test on 1.23 yet. I made very rough draft of workaround script. It can cause data loss so be sure to test this before. #!/bin/bash
trap "exit 0" SIGINT SIGTERM
while true; do
echo "sleeping 5 sec..."
sleep 5
# automatically force remove pods in terminating state
kubectl get pods -A | awk '$4 ~/Terminating/' | awk '{print "kubectl --namespace "$1" delete pod "$2" --grace-period=0 --force"}' | xargs -I {} sh -c '{};' &
# next step is to delete volumeattachments to speed up recovery of pods instead of waiting 6 minutes
# https://github.com/ceph/ceph-csi/issues/740
node_offline=$(kubectl get node | awk '$2 ~/NotReady/' | awk '{print $1}' | head -n 1)
kubectl get volumeattachments.storage.k8s.io -o json | jq -r '.items[] | select (.spec.nodeName == "'$node_offline'") | .metadata.name' | xargs -I {} sh -c 'kubectl delete volumeattachments.storage.k8s.io {};' &
done |
We now have a new alpha feature "Nongraceful node shutdown" which can mostly help this situation. Please check out the KEP for details https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown |
/assign |
The non-graceful node shutdown feature is beta now: https://kubernetes.io/blog/2022/12/16/kubernetes-1-26-non-graceful-node-shutdown-beta/ You can use this feature to resolve the problem. |
/close If re-open if this is still an issue. |
@xing-yang: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
What you expected to happen:
When a pod using an attached volume is deleted (gracefully) but kubelet in the corresponding node is down, the AttachDetachController should assume the node is unrecoverable after a timeout (currently 6min) and forcefully detach the volume.
What happened:
The volume is never detached.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
This doesn't happen if the pod is force-deleted.
It's likely due to the last condition checked in this line. Once kubelet is down, the container status is no longer reported correctly. Inside the Attach Detach Controller, This function is called by the pod update informer handler, which sets whether the volume should be attached in the desired state of the world. On pod force deletion, the pod object is deleted immediately, and this triggers the pod delete informer handler, which doesn't call this function.
/sig storage
/cc @saad-ali @gnufied @jingxu97 @NickrenREN
/assign
The text was updated successfully, but these errors were encountered: