Referential integrity issue with CSI volumes. #10052

the-maldridge · 2021-02-19T23:22:51Z

Nomad version

Client: Nomad v1.0.1 (a480eed0815c54612856d9115a34bb1d1a773e8c)
Server: Nomad v1.0.0 (cfca6405ad9b5f66dffc8843e3d16f92f3bedb43)

Operating system and Environment details

Operating System: Resinstack built at terraform module version v0.0.1 with nomad and consul ACLs enabled in default deny mode. 3 node server group and one worker node. Physical hardware.

Issue

While working with CSI volumes I have a volume now that is convinced it is attached (nomad volume deregister refuses to remove it), but nomad does not show the volume as allocated in the web interface or the CLI:

$ nomad volume status minio_demo
ID                   = minio_demo
Name                 = miniodemo
External ID          = foo
Plugin ID            = rclone0
Provider             = csi-rclone
Version              = v1.2.8
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 1
Nodes Expected       = 1
Access Mode          = multi-node-multi-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

$ nomad volume deregister minio_demo
Error deregistering volume: Unexpected response code: 500 (volume in use: minio_demo)

I've attached a raft snapshot as this appears to be corruption in the state nomad is maintaining around volumes.

Reproduction steps

I'm not sure yet what exactly triggered this. I suspect that it was a combination of losing the client that had the disk mounted and nomad not GC'ing that when the node came back with a different node ID but the same hostname.

Job file (if appropriate)

I can provide the job files I was using, but I don't think they affected this.

See attached raft snapshot.
nomad-state-20210219-1613776106.snap.zip

This is in a sandbox cluster, I can try really destructive things in this cluster if there's stuff that would be worth looking into.

The text was updated successfully, but these errors were encountered:

tgross · 2021-02-22T14:31:00Z

Hi @the-maldridge!

I'm not sure yet what exactly triggered this. I suspect that it was a combination of losing the client that had the disk mounted and nomad not GC'ing that when the node came back with a different node ID but the same hostname.

That seems likely. We've definitely had some bugs around this before. Thanks for the raft snapshot; that's going to be super-helpful in hunting this down. I'll dig into this and get back to you.

There's a nomad volume detach command added in 1.0.0 that may be able to release the volume claim. It may also further help debugging if you were to get debug logs from the leader when that command is run.

the-maldridge · 2021-02-22T18:28:09Z

@tgross Thanks for the speedy reply. I tried to use nomad volume detach but as nomad doesn't think the volume is attached to anything it can't be detached. As this was a sandbox cluster it didn't matter and I could wipe the cluster state, but obviously in prod that wouldn't be acceptable.

goedelsoup · 2021-03-04T02:46:18Z

i've hit this a few times relocating a volume to a different namespace since 1.x. forcing the detachment worked for me, but wasn't the desired operation. this didn't arise until i moved our clusters to namespaces, so the open sourcing of namespaces may have been a source of bugs here. volumes attached to the default namespace definitely cause issues when upgrading through the latest series of releases.

the-maldridge · 2021-03-04T19:56:21Z

Interesting. I was doing this in my test cluster which only uses the default namespace, I have not observed this in my production cluster where the volume resides in another namespace. I think you might be onto something.

tgross · 2021-03-05T15:33:45Z

I got a chance to dig into that snapshot. First I did the following to dump it out to JSON:

nomad operator snapshot restore nomad-state-20210219-1613776106.snap
nomad operator raft _state debug.json

Then if I look at the volume's claims against the live allocations, I find that the volume has 3 claims in the "taken" state (ref CSIVolumeClaimState):

$ jq '.CSIVolumes[0].WriteClaims' < debug.json
{
  "25895740-3b71-1931-27a1-9481869ddd16": {
    "AllocationID": "25895740-3b71-1931-27a1-9481869ddd16",
    "NodeID": "8ec54170-6dd2-f4ee-07d0-f1d65a793b3e",
    "ExternalNodeID": "",
    "Mode": 1,
    "State": 0
  },
  "49ea1978-266f-fbe0-22e3-a1deb9f44454": {
    "AllocationID": "49ea1978-266f-fbe0-22e3-a1deb9f44454",
    "NodeID": "e002113b-4854-3e9e-2d5a-152409e9d858",
    "ExternalNodeID": "",
    "Mode": 1,
    "State": 0
  },
  "647bcf3a-c969-e80f-c1f9-70ff0dd4cad7": {
    "AllocationID": "647bcf3a-c969-e80f-c1f9-70ff0dd4cad7",
    "NodeID": "e002113b-4854-3e9e-2d5a-152409e9d858",
    "ExternalNodeID": "",
    "Mode": 1,
    "State": 0
  }
}

But if we look at the allocations, we can see none of these allocations exist anymore in raft:

$ jq '.Allocs[].ID' < debug.json
"0a12c855-2963-c66a-58ed-1cdb275aee4d"
"fd686668-0ed9-a4e8-3de0-f8a4ba9c4801"

This is actually expected in the case of garbage collection and something we've accounted for when we release claims. So why are we unable to remove the claims? Let's compare the node IDs that exist against the write claims:

$ jq '.Nodes[].ID' < debug.json
"04a3c346-132e-cba0-82c4-3222a50abd34"
"7d9c55f2-d263-1c52-a78b-0daceae95c31"
"9cc73dc5-9222-1933-346d-8f076db02ef8"
"e4446831-77bc-b843-0e88-a05c888b8943"
"e83b5018-b78e-0b11-4bb4-0341717ca1a2"

None of the nodes for the write claims exist in raft either, so Nomad can't even find the node to send an RPC to in order to release the claim on the volume during the Unpublish workflow. The "obvious" fix is to allow Nomad to give up on the claim and drop it if the allocation is gone and the node is also gone. But this will definitely cause a problem in the case of lost nodes. If the node is lost and never comes back, we don't actually know that we can unmount the volume from that node. For example, imagine the following scenario:

Nomad is running on AWS EC2 with a mounted EBS volume.
The Nomad service is stopped, and the server marks the client as down and reschedules the allocations.
At this point, the allocation's task is still running on the host (unless stop_on_client_disconnect is set), and the volume is still mounted to it.
When we try to reap the claim, we can't reach the node plugin.
If we ignore that problem and send the unpublish to the controller plugin anyways, the volume won't be detached from the instance. The AWS API will get stuck as the device is "busy", unless the user comes along and manually force-detaches the volume out-of-band (the CSI plugin doesn't support this as far as I know).

It's probably worth me looking at whether using stop_on_client_disconnect on volume-consuming tasks is the right move here in order at least release the volume mount client-side. And then that opens the door to making the unpublish workflow more tolerant of missing nodes.

ivoronin · 2021-06-26T07:31:35Z

It seems I am having the same issue. I'm doing a PoC of Nomad cluster running stateful workloads on preemptible short-lived client instances. Unfortunately, after few client node replacements, stateful jobs become stuck:

# nomad job eval mytask
==> Monitoring evaluation "47417b6a"
    Evaluation triggered by job "mytask"
==> Monitoring evaluation "47417b6a"
    Evaluation within deployment: "ff73dc0c"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "47417b6a" finished with status "complete" but failed to place all allocations:
    Task Group "mytask" (failed to place 1 allocation):
      * Class "worker": 3 nodes excluded by filter
      * Constraint "CSI volume mytask-data has exhausted its available writer claims": 3 nodes excluded by filter
    Evaluation "d63faaf4" waiting for additional capacity to place remainder

The only way to unstuck them is to stop and purge the jobs and deregister (-force) volumes.

Is there a way to:

Confirm that I'm facing this exact issue?
Clean up volume allocations on permanently lost nodes manually to avoid recreating volume registrations and jobs?

PS:
The issue is easily reproducible in my environment, can provide additional data if needed.

m1keil · 2021-07-09T14:55:49Z

Hitting similar issues in our environment.
I don't know if it's the cause or not, but we usually rotate the EC2 instances by doing instance refresh on the AutoScalingGroup in EC2. We don't have any drain hooks in place, so the instance just dies. I can see that in some cases the job can never recover because the volume is marked as if it's still attached.

It will detach just fine, but when trying to de-register the volume, nomad will error that it is still in use. Got to use the -force option.

tgross · 2022-01-28T19:23:50Z

This issue has been debugged in #10927 and patched with #11890 #11932 #11931 #11892. We'll be issuing a patch release shortly for that fix. I'm going to close this issue.

Follow from #10927 (comment) for more information.

github-actions · 2022-10-12T02:44:24Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross self-assigned this Feb 22, 2021

tgross added theme/storage type/bug labels Feb 22, 2021

tgross added the stage/needs-discussion label Jun 3, 2021

tgross removed their assignment Jun 3, 2021

tgross mentioned this issue Jun 30, 2021

CSI node unpublish/unstage should avoid requiring server #10833

Closed

tgross mentioned this issue Dec 13, 2021

Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter #10927

Closed

This was referenced Jan 20, 2022

CSI: resolve invalid claim states #11890

Merged

CSI: update leader's ACL in volumewatcher #11891

Merged

CSI: node unmount from the client before unpublish RPC #11892

Merged

tgross self-assigned this Jan 26, 2022

tgross added this to the 1.2.5 milestone Jan 28, 2022

tgross closed this as completed Jan 28, 2022

tgross mentioned this issue Feb 28, 2022

Wrong expected count for CSI controller and node - various volume issues #12142

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Referential integrity issue with CSI volumes. #10052

Referential integrity issue with CSI volumes. #10052

the-maldridge commented Feb 19, 2021

tgross commented Feb 22, 2021

the-maldridge commented Feb 22, 2021

goedelsoup commented Mar 4, 2021

the-maldridge commented Mar 4, 2021

tgross commented Mar 5, 2021

ivoronin commented Jun 26, 2021 •

edited

Loading

m1keil commented Jul 9, 2021 •

edited

Loading

tgross commented Jan 28, 2022

github-actions bot commented Oct 12, 2022

Referential integrity issue with CSI volumes. #10052

Referential integrity issue with CSI volumes. #10052

Comments

the-maldridge commented Feb 19, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

tgross commented Feb 22, 2021

the-maldridge commented Feb 22, 2021

goedelsoup commented Mar 4, 2021

the-maldridge commented Mar 4, 2021

tgross commented Mar 5, 2021

ivoronin commented Jun 26, 2021 • edited Loading

m1keil commented Jul 9, 2021 • edited Loading

tgross commented Jan 28, 2022

github-actions bot commented Oct 12, 2022

ivoronin commented Jun 26, 2021 •

edited

Loading

m1keil commented Jul 9, 2021 •

edited

Loading