-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Referential integrity issue with CSI volumes. #10052
Comments
Hi @the-maldridge!
That seems likely. We've definitely had some bugs around this before. Thanks for the raft snapshot; that's going to be super-helpful in hunting this down. I'll dig into this and get back to you. There's a |
@tgross Thanks for the speedy reply. I tried to use |
i've hit this a few times relocating a volume to a different namespace since 1.x. forcing the detachment worked for me, but wasn't the desired operation. this didn't arise until i moved our clusters to namespaces, so the open sourcing of namespaces may have been a source of bugs here. volumes attached to the default namespace definitely cause issues when upgrading through the latest series of releases. |
Interesting. I was doing this in my test cluster which only uses the default namespace, I have not observed this in my production cluster where the volume resides in another namespace. I think you might be onto something. |
I got a chance to dig into that snapshot. First I did the following to dump it out to JSON:
Then if I look at the volume's claims against the live allocations, I find that the volume has 3 claims in the "taken" state (ref $ jq '.CSIVolumes[0].WriteClaims' < debug.json
{
"25895740-3b71-1931-27a1-9481869ddd16": {
"AllocationID": "25895740-3b71-1931-27a1-9481869ddd16",
"NodeID": "8ec54170-6dd2-f4ee-07d0-f1d65a793b3e",
"ExternalNodeID": "",
"Mode": 1,
"State": 0
},
"49ea1978-266f-fbe0-22e3-a1deb9f44454": {
"AllocationID": "49ea1978-266f-fbe0-22e3-a1deb9f44454",
"NodeID": "e002113b-4854-3e9e-2d5a-152409e9d858",
"ExternalNodeID": "",
"Mode": 1,
"State": 0
},
"647bcf3a-c969-e80f-c1f9-70ff0dd4cad7": {
"AllocationID": "647bcf3a-c969-e80f-c1f9-70ff0dd4cad7",
"NodeID": "e002113b-4854-3e9e-2d5a-152409e9d858",
"ExternalNodeID": "",
"Mode": 1,
"State": 0
}
} But if we look at the allocations, we can see none of these allocations exist anymore in raft:
This is actually expected in the case of garbage collection and something we've accounted for when we release claims. So why are we unable to remove the claims? Let's compare the node IDs that exist against the write claims: $ jq '.Nodes[].ID' < debug.json
"04a3c346-132e-cba0-82c4-3222a50abd34"
"7d9c55f2-d263-1c52-a78b-0daceae95c31"
"9cc73dc5-9222-1933-346d-8f076db02ef8"
"e4446831-77bc-b843-0e88-a05c888b8943"
"e83b5018-b78e-0b11-4bb4-0341717ca1a2" None of the nodes for the write claims exist in raft either, so Nomad can't even find the node to send an RPC to in order to release the claim on the volume during the
It's probably worth me looking at whether using |
It seems I am having the same issue. I'm doing a PoC of Nomad cluster running stateful workloads on preemptible short-lived client instances. Unfortunately, after few client node replacements, stateful jobs become stuck:
The only way to unstuck them is to stop and purge the jobs and deregister ( Is there a way to:
PS: |
Hitting similar issues in our environment. It will detach just fine, but when trying to de-register the volume, nomad will error that it is still in use. Got to use the -force option. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Client:
Nomad v1.0.1 (a480eed0815c54612856d9115a34bb1d1a773e8c)
Server:
Nomad v1.0.0 (cfca6405ad9b5f66dffc8843e3d16f92f3bedb43)
Operating system and Environment details
Operating System: Resinstack built at terraform module version v0.0.1 with nomad and consul ACLs enabled in default deny mode. 3 node server group and one worker node. Physical hardware.
Issue
While working with CSI volumes I have a volume now that is convinced it is attached (
nomad volume deregister
refuses to remove it), but nomad does not show the volume as allocated in the web interface or the CLI:I've attached a raft snapshot as this appears to be corruption in the state nomad is maintaining around volumes.
Reproduction steps
I'm not sure yet what exactly triggered this. I suspect that it was a combination of losing the client that had the disk mounted and nomad not GC'ing that when the node came back with a different node ID but the same hostname.
Job file (if appropriate)
I can provide the job files I was using, but I don't think they affected this.
See attached raft snapshot.
nomad-state-20210219-1613776106.snap.zip
This is in a sandbox cluster, I can try really destructive things in this cluster if there's stuff that would be worth looking into.
The text was updated successfully, but these errors were encountered: