Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: multi-node-multi-writer fails with an operation with the given Volume ID already exists #15197

Open
JohnKiller opened this issue Nov 10, 2022 · 1 comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug

Comments

@JohnKiller
Copy link

Nomad version

Nomad v1.4.2

Operating system and Environment details

Linux 5.10.0-18-amd64 SMP Debian 5.10.140-1 (2022-09-02) x86_64 GNU/Linux
Ceph 17.2.4
ceph-csi plugin 3.7.2

Issue

Using ceph-csi with cephfs since it supports multi-node-multi-writer. When the allocations get placed on the same node and gets started too close toghether, the plugin errors with an operation with the given Volume ID already exists.

I've opened an issue with the plugin authors ceph/ceph-csi#3511

They say:

As per the CSI specification NodeStage should make sure the volume is mounted on the given node only once if already mounted we should return succecss https://github.com/container-storage-interface/spec/blob/master/spec.md#nodestagevolume. To avoid consequences of mounting same volume twice we have a lock per volume at the csi driver. you will see operation already exist error until ongoing first request completes.

Reproduction steps

This error can be reproduced in two ways:

  • If you have multiple jobs using the same volume, during a node drain they will be reallocated. If the reallocation happens on the same node, they will be started too close together and one will fail to start.
  • Or just use a job with count high enough

Job file

job "site-test" {
	datacenters = ["dc1"]
	type = "service"

	group "test" {
		count = 10

		volume "testvolume" {
			type = "csi"
			source = "testvolume"
			access_mode = "multi-node-multi-writer"
			attachment_mode = "file-system"
		}

		task "test" {
			driver = "docker"
			
			config {
				image = "omitted"

				ports = ["http"]
			}

			volume_mount {
				volume = "testvolume"
				destination = "/var/www/"
			}

			resources {
				memory = 128
			}
		}

		network {
			mbits = 10
			port "http" {
				to = 80
				host_network = "private"
			}
		}
	}
}
@tgross
Copy link
Member

tgross commented Nov 23, 2022

Hi @JohnKiller! I took a quick look at this and we already have logic in our mounter to avoid re-staging a volume that's already been staged on a given node (see volume.go#L165-L168). But you're right that we're not currently serializing requests in our CSI client (see client.go#L752).

But Nomad is out-of-spec on CSI concurrency:

In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call “in-flight” per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The plugin SHOULD handle this as gracefully as possible. The error code ABORTED MAY be returned by the plugin in this case (see the Error Scheme section for details).

So we can detect a "already staged" condition but not a "staging at the same time" condition. And I guess Ceph has chosen to return aborted here, which means they're meeting the minimum spec for graceful handling (as in, at least they're not crashing). We'll need to fix this for sure, probably by having to beef-up our client to allow for a queue of requests that collapses NodeStaged but keeps multiple NodePublish waiting.

Thanks for opening this issue @JohnKiller! I'll mark it for roadmapping.

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Nov 23, 2022
matthdsm added a commit to nextflow-io/nf-nomad that referenced this issue Aug 28, 2024
This should fix a concurrency issue with the CSI driver
ceph/ceph-csi#3511
hashicorp/nomad#15197
abhi18av added a commit to nextflow-io/nf-nomad that referenced this issue Aug 28, 2024
* Allow 1 restart per task

This should fix a concurrency issue with the CSI driver
ceph/ceph-csi#3511
hashicorp/nomad#15197

* expose the reschedule and restart config vars

* remove unused import

---------

Co-authored-by: Jorge <jorge@edn.es>
Co-authored-by: Abhinav Sharma <abhi18av@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Projects
None yet
Development

No branches or pull requests

2 participants