Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Referential integrity issue with CSI volumes. #10052

Closed
the-maldridge opened this issue Feb 19, 2021 · 9 comments
Closed

Referential integrity issue with CSI volumes. #10052

the-maldridge opened this issue Feb 19, 2021 · 9 comments

Comments

@the-maldridge
Copy link

Nomad version

Client: Nomad v1.0.1 (a480eed0815c54612856d9115a34bb1d1a773e8c)
Server: Nomad v1.0.0 (cfca6405ad9b5f66dffc8843e3d16f92f3bedb43)

Operating system and Environment details

Operating System: Resinstack built at terraform module version v0.0.1 with nomad and consul ACLs enabled in default deny mode. 3 node server group and one worker node. Physical hardware.

Issue

While working with CSI volumes I have a volume now that is convinced it is attached (nomad volume deregister refuses to remove it), but nomad does not show the volume as allocated in the web interface or the CLI:

$ nomad volume status minio_demo
ID                   = minio_demo
Name                 = miniodemo
External ID          = foo
Plugin ID            = rclone0
Provider             = csi-rclone
Version              = v1.2.8
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 1
Nodes Expected       = 1
Access Mode          = multi-node-multi-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

$ nomad volume deregister minio_demo
Error deregistering volume: Unexpected response code: 500 (volume in use: minio_demo)

I've attached a raft snapshot as this appears to be corruption in the state nomad is maintaining around volumes.

Reproduction steps

I'm not sure yet what exactly triggered this. I suspect that it was a combination of losing the client that had the disk mounted and nomad not GC'ing that when the node came back with a different node ID but the same hostname.

Job file (if appropriate)

I can provide the job files I was using, but I don't think they affected this.

See attached raft snapshot.
nomad-state-20210219-1613776106.snap.zip

This is in a sandbox cluster, I can try really destructive things in this cluster if there's stuff that would be worth looking into.

@tgross
Copy link
Member

tgross commented Feb 22, 2021

Hi @the-maldridge!

I'm not sure yet what exactly triggered this. I suspect that it was a combination of losing the client that had the disk mounted and nomad not GC'ing that when the node came back with a different node ID but the same hostname.

That seems likely. We've definitely had some bugs around this before. Thanks for the raft snapshot; that's going to be super-helpful in hunting this down. I'll dig into this and get back to you.

There's a nomad volume detach command added in 1.0.0 that may be able to release the volume claim. It may also further help debugging if you were to get debug logs from the leader when that command is run.

@the-maldridge
Copy link
Author

@tgross Thanks for the speedy reply. I tried to use nomad volume detach but as nomad doesn't think the volume is attached to anything it can't be detached. As this was a sandbox cluster it didn't matter and I could wipe the cluster state, but obviously in prod that wouldn't be acceptable.

@goedelsoup
Copy link

i've hit this a few times relocating a volume to a different namespace since 1.x. forcing the detachment worked for me, but wasn't the desired operation. this didn't arise until i moved our clusters to namespaces, so the open sourcing of namespaces may have been a source of bugs here. volumes attached to the default namespace definitely cause issues when upgrading through the latest series of releases.

@the-maldridge
Copy link
Author

Interesting. I was doing this in my test cluster which only uses the default namespace, I have not observed this in my production cluster where the volume resides in another namespace. I think you might be onto something.

@tgross
Copy link
Member

tgross commented Mar 5, 2021

I got a chance to dig into that snapshot. First I did the following to dump it out to JSON:

nomad operator snapshot restore nomad-state-20210219-1613776106.snap
nomad operator raft _state debug.json

Then if I look at the volume's claims against the live allocations, I find that the volume has 3 claims in the "taken" state (ref CSIVolumeClaimState):

$ jq '.CSIVolumes[0].WriteClaims' < debug.json
{
  "25895740-3b71-1931-27a1-9481869ddd16": {
    "AllocationID": "25895740-3b71-1931-27a1-9481869ddd16",
    "NodeID": "8ec54170-6dd2-f4ee-07d0-f1d65a793b3e",
    "ExternalNodeID": "",
    "Mode": 1,
    "State": 0
  },
  "49ea1978-266f-fbe0-22e3-a1deb9f44454": {
    "AllocationID": "49ea1978-266f-fbe0-22e3-a1deb9f44454",
    "NodeID": "e002113b-4854-3e9e-2d5a-152409e9d858",
    "ExternalNodeID": "",
    "Mode": 1,
    "State": 0
  },
  "647bcf3a-c969-e80f-c1f9-70ff0dd4cad7": {
    "AllocationID": "647bcf3a-c969-e80f-c1f9-70ff0dd4cad7",
    "NodeID": "e002113b-4854-3e9e-2d5a-152409e9d858",
    "ExternalNodeID": "",
    "Mode": 1,
    "State": 0
  }
}

But if we look at the allocations, we can see none of these allocations exist anymore in raft:

$ jq '.Allocs[].ID' < debug.json
"0a12c855-2963-c66a-58ed-1cdb275aee4d"
"fd686668-0ed9-a4e8-3de0-f8a4ba9c4801"

This is actually expected in the case of garbage collection and something we've accounted for when we release claims. So why are we unable to remove the claims? Let's compare the node IDs that exist against the write claims:

$ jq '.Nodes[].ID' < debug.json
"04a3c346-132e-cba0-82c4-3222a50abd34"
"7d9c55f2-d263-1c52-a78b-0daceae95c31"
"9cc73dc5-9222-1933-346d-8f076db02ef8"
"e4446831-77bc-b843-0e88-a05c888b8943"
"e83b5018-b78e-0b11-4bb4-0341717ca1a2"

None of the nodes for the write claims exist in raft either, so Nomad can't even find the node to send an RPC to in order to release the claim on the volume during the Unpublish workflow. The "obvious" fix is to allow Nomad to give up on the claim and drop it if the allocation is gone and the node is also gone. But this will definitely cause a problem in the case of lost nodes. If the node is lost and never comes back, we don't actually know that we can unmount the volume from that node. For example, imagine the following scenario:

  • Nomad is running on AWS EC2 with a mounted EBS volume.
  • The Nomad service is stopped, and the server marks the client as down and reschedules the allocations.
  • At this point, the allocation's task is still running on the host (unless stop_on_client_disconnect is set), and the volume is still mounted to it.
  • When we try to reap the claim, we can't reach the node plugin.
  • If we ignore that problem and send the unpublish to the controller plugin anyways, the volume won't be detached from the instance. The AWS API will get stuck as the device is "busy", unless the user comes along and manually force-detaches the volume out-of-band (the CSI plugin doesn't support this as far as I know).

It's probably worth me looking at whether using stop_on_client_disconnect on volume-consuming tasks is the right move here in order at least release the volume mount client-side. And then that opens the door to making the unpublish workflow more tolerant of missing nodes.

@tgross tgross removed their assignment Jun 3, 2021
@ivoronin
Copy link

ivoronin commented Jun 26, 2021

It seems I am having the same issue. I'm doing a PoC of Nomad cluster running stateful workloads on preemptible short-lived client instances. Unfortunately, after few client node replacements, stateful jobs become stuck:

# nomad job eval mytask
==> Monitoring evaluation "47417b6a"
    Evaluation triggered by job "mytask"
==> Monitoring evaluation "47417b6a"
    Evaluation within deployment: "ff73dc0c"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "47417b6a" finished with status "complete" but failed to place all allocations:
    Task Group "mytask" (failed to place 1 allocation):
      * Class "worker": 3 nodes excluded by filter
      * Constraint "CSI volume mytask-data has exhausted its available writer claims": 3 nodes excluded by filter
    Evaluation "d63faaf4" waiting for additional capacity to place remainder

The only way to unstuck them is to stop and purge the jobs and deregister (-force) volumes.

Is there a way to:

  1. Confirm that I'm facing this exact issue?
  2. Clean up volume allocations on permanently lost nodes manually to avoid recreating volume registrations and jobs?

PS:
The issue is easily reproducible in my environment, can provide additional data if needed.

@m1keil
Copy link

m1keil commented Jul 9, 2021

Hitting similar issues in our environment.
I don't know if it's the cause or not, but we usually rotate the EC2 instances by doing instance refresh on the AutoScalingGroup in EC2. We don't have any drain hooks in place, so the instance just dies. I can see that in some cases the job can never recover because the volume is marked as if it's still attached.

It will detach just fine, but when trying to de-register the volume, nomad will error that it is still in use. Got to use the -force option.

@tgross
Copy link
Member

tgross commented Jan 28, 2022

This issue has been debugged in #10927 and patched with #11890 #11932 #11931 #11892. We'll be issuing a patch release shortly for that fix. I'm going to close this issue.

Follow from #10927 (comment) for more information.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants