-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI node unpublish/unstage should avoid requiring server #10833
Comments
I don't know much about the cloud storage provider/plugins, but these are my thoughts comming from Linux HA Pacemaker. A lost node must be presumed to be in an undefined state. It could still run the workload (writing to the volumes) and we just lost connection or it could have caught fire and is gone forever. As far as I understand the CSI spec, only the node that published the volume can unpublish it so it is impossible for the cluster to gurantee that the volume is unpublished. In pacemaker the solution is to fence the node by forcefully cutting the connection to the storage (e.g. LUN masking) or outright killing it (hard poweroff, STONITH). After the fencing is succesfull the cluster knows the state of the node (it's dead) and it can reuse the volume (depending on the storage, the volume could be forcefully unpublished with a special command). Because nomad does not do fencing the only reasonable thing to do would be to try to repeatedly unpublish the volume with the help of the controller (it that is gone too, we are out of luck). The responsibility then lies with the storage provider (via the CSI plugin) to do the fencing itself. After the storage provider fenced the lost node the unpublish on the controller would succeed and nomad can start the lost tasks somewhere else. Until then the tasks would not run, the only safe option. In this case the last workflow would eventually succeed. As long as the server only acts after an ok from the controller it should be safe. |
Hi @NomAnor! Yes you've hit on my major criticism of the CSI specification, which is that the spec requires that node unpublish happens before controller unpublish. ref
But as you've noted, this doesn't actually work. We have to call the node unpublish RPC from the client where the node plugin is running, but there's no way to guarantee that the client that makes that RPC call to the node plugin can inform us of success. Which means we're going to violate the specification, and I think you're on the right track here as to how to do it:
So effectively what we'll do is call the node unpublish from the client side without coordinating with the server. If the client is still connected and can tell us that the node unpublish succeeded, that's great. If not, the server will retry the controller unpublish until it succeeds. I've got the start of this work going in #11892 |
Ther order of calls make sense because the client has to do some work to unpublish the volume (propably unmount). You can't safely unpublish it on the provider side (essentially cut access) before the client is finished. But some explanation what to do if the client can't unpublish any more would be helpful. To be faithful to the spec, if the client is still connected, the server should wait for the unpublish call to succeed. But otherwise the only choice is to give up or repeatedly try to unpublish on the controller until it succeeds. As a user I would expect the latter as that has the potential to automatically resolve the issue. With filesystems that allow shared access, a lost client might not be fatal and it can be mounted somewhere else without a prober unpublish. But my usecase would be shared raw storage with a normal (e.g. ext4) filesystem, shared access is a death sentence for the data. Without an unpublish, a publish somewhere else will fail. Fencing is then the last resort forcefull unpublish (that is done internally from the storage provider, not a concern for the scheduler). A period or backoff factor as a configuration option might be usefull. And even an option to disable that behaviour per plugin. Does someone know how kubernetes or in general other CSI plugins handle a lost client as far as unpublishing the volumes is concerned? |
That does assume that the controller unpublish can be done safely even if the node unpublish hasn't quite finished yet, but honestly at this point if a plugin doesn't tolerate that then it's just broken by design even if it meets the spec precisely. 😀
Yeah, we provided the That being said, I really like the idea of having a per-plugin configuration. We often think of the "cluster operator" and the "job submitter" as two different users in most orgs, and I think the vision was that plugins would be owned by the cluster operator (because they're privileged). Having the configuration on the plugin would let the operator own the correctness of stopping on disconnect.
The plugins generally can't handle a lost client because they don't know the client is lost. K8s has an internal The PVC has a status, which the attachdetach controller must use to checkpoint the detachment process similar to what we do in |
This issue has been addressed in #11892, although there's still some open items around concurrency in #8609 to work out. We'll be issuing a patch release shortly for that fix. I'm going to close this issue. Follow from #10927 (comment) for more information. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
When an allocation that claims a CSI volume stops, the
csi_hook.go
fires to tell the server to unpublish the claim viaCSIVolume.Unpublish
. The server makes an RPC call back to the client to tell the Node plugins on the client to unpublish and unstage the volume, and only once that RPC returns ok does it return ok to thecsi_hook
. The advantage of this is that we can checkpoint the state of the unpublish workflow on the server and resume it if it fails at any point. This workflow is shown below:source
A problem arises in cases where the client node becomes lost. In this case, the client can't reach the server and the
CSIVolume.Unpublish
fails. Hypothetically, we could have the server simply declare the node lost and attempt to send aClientCSI.ControllerDetachVolume
to the controller plugin (assuming it's not also lost!). But in practice this will always fail because the volume has not been unmounted from the client first. This appears to be the underlying issue in #10052. This workflow is shown below:source
The original design of CSI had the node plugin unpublish workflow happen client-side, without having to communicate with the server. This allows
stop_on_client_disconnect
to work usefully with CSI volumes. This was changed in #7596 (and revised in #8580) because we couldn't come up with a way of safely handling the case where a node plugin stopped on the client concurrently with a volume claiming allocation. This workflow also made it extremely hard to reason about the server state and we were able to make it comprehensible and checkpointed with the change. However, this left a bug in the workflow described above.A workflow like the one below would fix the bug but would create new challenges in ensuring correct behavior in the server. I don't have an answer for that complexity but wanted to make sure the problem space was well documented here.
source
The text was updated successfully, but these errors were encountered: