Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI node unpublish/unstage should avoid requiring server #10833

Closed
tgross opened this issue Jun 30, 2021 · 6 comments
Closed

CSI node unpublish/unstage should avoid requiring server #10833

tgross opened this issue Jun 30, 2021 · 6 comments

Comments

@tgross
Copy link
Member

tgross commented Jun 30, 2021

When an allocation that claims a CSI volume stops, the csi_hook.go fires to tell the server to unpublish the claim via CSIVolume.Unpublish. The server makes an RPC call back to the client to tell the Node plugins on the client to unpublish and unstage the volume, and only once that RPC returns ok does it return ok to the csi_hook. The advantage of this is that we can checkpoint the state of the unpublish workflow on the server and resume it if it fails at any point. This workflow is shown below:

wsd1-normal-stop

source

A problem arises in cases where the client node becomes lost. In this case, the client can't reach the server and the CSIVolume.Unpublish fails. Hypothetically, we could have the server simply declare the node lost and attempt to send a ClientCSI.ControllerDetachVolume to the controller plugin (assuming it's not also lost!). But in practice this will always fail because the volume has not been unmounted from the client first. This appears to be the underlying issue in #10052. This workflow is shown below:

wsd2-failed-node

source

The original design of CSI had the node plugin unpublish workflow happen client-side, without having to communicate with the server. This allows stop_on_client_disconnect to work usefully with CSI volumes. This was changed in #7596 (and revised in #8580) because we couldn't come up with a way of safely handling the case where a node plugin stopped on the client concurrently with a volume claiming allocation. This workflow also made it extremely hard to reason about the server state and we were able to make it comprehensible and checkpointed with the change. However, this left a bug in the workflow described above.

A workflow like the one below would fix the bug but would create new challenges in ensuring correct behavior in the server. I don't have an answer for that complexity but wanted to make sure the problem space was well documented here.

wsd3-proposed-fail

source

@NomAnor
Copy link
Contributor

NomAnor commented Jan 21, 2022

I don't know much about the cloud storage provider/plugins, but these are my thoughts comming from Linux HA Pacemaker.

A lost node must be presumed to be in an undefined state. It could still run the workload (writing to the volumes) and we just lost connection or it could have caught fire and is gone forever.

As far as I understand the CSI spec, only the node that published the volume can unpublish it so it is impossible for the cluster to gurantee that the volume is unpublished.

In pacemaker the solution is to fence the node by forcefully cutting the connection to the storage (e.g. LUN masking) or outright killing it (hard poweroff, STONITH). After the fencing is succesfull the cluster knows the state of the node (it's dead) and it can reuse the volume (depending on the storage, the volume could be forcefully unpublished with a special command).

Because nomad does not do fencing the only reasonable thing to do would be to try to repeatedly unpublish the volume with the help of the controller (it that is gone too, we are out of luck). The responsibility then lies with the storage provider (via the CSI plugin) to do the fencing itself. After the storage provider fenced the lost node the unpublish on the controller would succeed and nomad can start the lost tasks somewhere else. Until then the tasks would not run, the only safe option.

In this case the last workflow would eventually succeed. As long as the server only acts after an ok from the controller it should be safe.

@tgross
Copy link
Member Author

tgross commented Jan 24, 2022

Hi @NomAnor! Yes you've hit on my major criticism of the CSI specification, which is that the spec requires that node unpublish happens before controller unpublish. ref ControllerUnpublishVolume:

It MUST be called after all NodeUnstageVolume and NodeUnpublishVolume on the volume are called and succeed.

But as you've noted, this doesn't actually work. We have to call the node unpublish RPC from the client where the node plugin is running, but there's no way to guarantee that the client that makes that RPC call to the node plugin can inform us of success. Which means we're going to violate the specification, and I think you're on the right track here as to how to do it:

Because nomad does not do fencing the only reasonable thing to do would be to try to repeatedly unpublish the volume with the help of the controller (it that is gone too, we are out of luck). The responsibility then lies with the storage provider (via the CSI plugin) to do the fencing itself. After the storage provider fenced the lost node the unpublish on the controller would succeed and nomad can start the lost tasks somewhere else. Until then the tasks would not run, the only safe option.

So effectively what we'll do is call the node unpublish from the client side without coordinating with the server. If the client is still connected and can tell us that the node unpublish succeeded, that's great. If not, the server will retry the controller unpublish until it succeeds. I've got the start of this work going in #11892

@NomAnor
Copy link
Contributor

NomAnor commented Jan 24, 2022

Ther order of calls make sense because the client has to do some work to unpublish the volume (propably unmount). You can't safely unpublish it on the provider side (essentially cut access) before the client is finished.

But some explanation what to do if the client can't unpublish any more would be helpful.

To be faithful to the spec, if the client is still connected, the server should wait for the unpublish call to succeed. But otherwise the only choice is to give up or repeatedly try to unpublish on the controller until it succeeds. As a user I would expect the latter as that has the potential to automatically resolve the issue.

With filesystems that allow shared access, a lost client might not be fatal and it can be mounted somewhere else without a prober unpublish. But my usecase would be shared raw storage with a normal (e.g. ext4) filesystem, shared access is a death sentence for the data. Without an unpublish, a publish somewhere else will fail. Fencing is then the last resort forcefull unpublish (that is done internally from the storage provider, not a concern for the scheduler).

A period or backoff factor as a configuration option might be usefull. And even an option to disable that behaviour per plugin.

Does someone know how kubernetes or in general other CSI plugins handle a lost client as far as unpublishing the volumes is concerned?

@tgross
Copy link
Member Author

tgross commented Jan 26, 2022

To be faithful to the spec, if the client is still connected, the server should wait for the unpublish call to succeed. But otherwise the only choice is to give up or repeatedly try to unpublish on the controller until it succeeds. As a user I would expect the latter as that has the potential to automatically resolve the issue.

That does assume that the controller unpublish can be done safely even if the node unpublish hasn't quite finished yet, but honestly at this point if a plugin doesn't tolerate that then it's just broken by design even if it meets the spec precisely. 😀

With filesystems that allow shared access, a lost client might not be fatal and it can be mounted somewhere else without a prober unpublish. But my usecase would be shared raw storage with a normal (e.g. ext4) filesystem, shared access is a death sentence for the data. Without an unpublish, a publish somewhere else will fail. Fencing is then the last resort forcefull unpublish (that is done internally from the storage provider, not a concern for the scheduler).

A period or backoff factor as a configuration option might be usefull. And even an option to disable that behaviour per plugin.

Yeah, we provided the stop_after_client_disconnect to help with this scenario, but the current Unpublish workflow actually breaks its usefulness for CSI. #11892 will fix that.

That being said, I really like the idea of having a per-plugin configuration. We often think of the "cluster operator" and the "job submitter" as two different users in most orgs, and I think the vision was that plugins would be owned by the cluster operator (because they're privileged). Having the configuration on the plugin would let the operator own the correctness of stopping on disconnect.

Does someone know how kubernetes or in general other CSI plugins handle a lost client as far as unpublishing the volumes is concerned?

The plugins generally can't handle a lost client because they don't know the client is lost. K8s has an internal attachdetach controller loop that watches for updates to pods, nodes, and persistent volume claims, and ultimately calls this reconcile method. It looks like K8s triggers the node detach from the server and eventually gives up if it fails too many times. I did a little digging and found a blog post that suggests the typical troubleshooting at that point is to have the operator manually detach the volume in the cloud provider, but that completely ignores that you typically can't do that if the volume is still mounted and in use on the client.

The PVC has a status, which the attachdetach controller must use to checkpoint the detachment process similar to what we do in CSI.Unpublish (although I can't find it), because otherwise there's no obvious way for the K8s to coordinate with the controller RPC workflow. In K8s the controller RPCs are all handled with out-of-tree sidecars currently (ex. https://github.com/kubernetes-csi/external-provisioner) because their implementation isn't ready to be moved in-tree (I'm not sure if this is a political or technical constraint though).

@tgross tgross self-assigned this Jan 26, 2022
@tgross tgross added this to the 1.2.5 milestone Jan 28, 2022
@tgross
Copy link
Member Author

tgross commented Jan 28, 2022

This issue has been addressed in #11892, although there's still some open items around concurrency in #8609 to work out. We'll be issuing a patch release shortly for that fix. I'm going to close this issue.

Follow from #10927 (comment) for more information.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants