CSI: `multi-node-multi-writer` fails with `an operation with the given Volume ID already exists` #15197

JohnKiller · 2022-11-10T11:19:43Z

Nomad version

Nomad v1.4.2

Operating system and Environment details

Linux 5.10.0-18-amd64 SMP Debian 5.10.140-1 (2022-09-02) x86_64 GNU/Linux
Ceph 17.2.4
ceph-csi plugin 3.7.2

Issue

Using ceph-csi with cephfs since it supports multi-node-multi-writer. When the allocations get placed on the same node and gets started too close toghether, the plugin errors with an operation with the given Volume ID already exists.

I've opened an issue with the plugin authors ceph/ceph-csi#3511

They say:

As per the CSI specification NodeStage should make sure the volume is mounted on the given node only once if already mounted we should return succecss https://github.com/container-storage-interface/spec/blob/master/spec.md#nodestagevolume. To avoid consequences of mounting same volume twice we have a lock per volume at the csi driver. you will see operation already exist error until ongoing first request completes.

Reproduction steps

This error can be reproduced in two ways:

If you have multiple jobs using the same volume, during a node drain they will be reallocated. If the reallocation happens on the same node, they will be started too close together and one will fail to start.
Or just use a job with count high enough

Job file

job "site-test" {
	datacenters = ["dc1"]
	type = "service"

	group "test" {
		count = 10

		volume "testvolume" {
			type = "csi"
			source = "testvolume"
			access_mode = "multi-node-multi-writer"
			attachment_mode = "file-system"
		}

		task "test" {
			driver = "docker"
			
			config {
				image = "omitted"

				ports = ["http"]
			}

			volume_mount {
				volume = "testvolume"
				destination = "/var/www/"
			}

			resources {
				memory = 128
			}
		}

		network {
			mbits = 10
			port "http" {
				to = 80
				host_network = "private"
			}
		}
	}
}

The text was updated successfully, but these errors were encountered:

tgross · 2022-11-23T21:13:09Z

Hi @JohnKiller! I took a quick look at this and we already have logic in our mounter to avoid re-staging a volume that's already been staged on a given node (see volume.go#L165-L168). But you're right that we're not currently serializing requests in our CSI client (see client.go#L752).

But Nomad is out-of-spec on CSI concurrency:

In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call “in-flight” per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The plugin SHOULD handle this as gracefully as possible. The error code ABORTED MAY be returned by the plugin in this case (see the Error Scheme section for details).

So we can detect a "already staged" condition but not a "staging at the same time" condition. And I guess Ceph has chosen to return aborted here, which means they're meeting the minimum spec for graceful handling (as in, at least they're not crashing). We'll need to fix this for sure, probably by having to beef-up our client to allow for a queue of requests that collapses NodeStaged but keeps multiple NodePublish waiting.

Thanks for opening this issue @JohnKiller! I'll mark it for roadmapping.

This should fix a concurrency issue with the CSI driver ceph/ceph-csi#3511 hashicorp/nomad#15197

* Allow 1 restart per task This should fix a concurrency issue with the CSI driver ceph/ceph-csi#3511 hashicorp/nomad#15197 * expose the reschedule and restart config vars * remove unused import --------- Co-authored-by: Jorge <jorge@edn.es> Co-authored-by: Abhinav Sharma <abhi18av@outlook.com>

JohnKiller added the type/bug label Nov 10, 2022

tgross added the theme/storage label Nov 21, 2022

tgross mentioned this issue Nov 23, 2022

sysbatch/system type jobs fail to be scheduled when using a multi-node-multi-writer CSI volume #15094

Closed

tgross modified the milestone: 1.4.4 Nov 23, 2022

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Nov 23, 2022

matthdsm added a commit to nextflow-io/nf-nomad that referenced this issue Aug 28, 2024

Allow 1 restart per task

2651820

This should fix a concurrency issue with the CSI driver ceph/ceph-csi#3511 hashicorp/nomad#15197

matthdsm mentioned this issue Aug 28, 2024

Allow 1 restart per task nextflow-io/nf-nomad#82

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: `multi-node-multi-writer` fails with `an operation with the given Volume ID already exists` #15197

CSI: `multi-node-multi-writer` fails with `an operation with the given Volume ID already exists` #15197

JohnKiller commented Nov 10, 2022

tgross commented Nov 23, 2022

CSI: multi-node-multi-writer fails with an operation with the given Volume ID already exists #15197

CSI: multi-node-multi-writer fails with an operation with the given Volume ID already exists #15197

Comments

JohnKiller commented Nov 10, 2022

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file

tgross commented Nov 23, 2022

CSI: `multi-node-multi-writer` fails with `an operation with the given Volume ID already exists` #15197

CSI: `multi-node-multi-writer` fails with `an operation with the given Volume ID already exists` #15197