-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter #10927
Comments
[Nomad 1.1.2] I have similar issues using the AWS EBS plugin. And I wholeheartedly share the sentiment that it's really hard to reproduce or debug, which is why I've been reluctant to opening an issue. In my case Nomad workers are spot instances which come and go. Non CSI jobs get rescheduled fine, but those which have CSI volumes attached tend to go red on the first reschedule and then eventually succeed. It's almost like the CSI mechanism needs more time to do it's thing, before the job gets restarted... |
I have been investigating it for a while. I don't know what Nomad does with unused volumes after the job is stopped. Does Nomad instruct the plugin to unmount the volumes from nodes? What if it fails to do so then? Maybe a timeout, or more log messages, or anything should be added in
|
We have the very same issue with Nomad 1.1.3 and AWS EBS CSI Plugin v1.2.0. The volume is unmounted currently in AWS and shows |
Well this is bad, as I don't think the issue is solved by Nomad garbage collector. When a volume hangs in one Nomad node, Nomad will just allocate the job to another node. And when it hangs there, it will allocate the job to another node again. Imagine if we have 100 Nomad nodes and the volumes stuck in some of them. |
I am seeing the same issues using the |
We've faced this issue as well. Seems like Nomad fails to realise that there are not allocations present for the volume. From what we could make out, both AWS and the CSI plugin report that the volume is available to mount:
And the AWS console reports the volume is "Available". Whereas in the Nomad UI, the "Storage" tab reports multiple
One thing I have noticed is that on a newly registered CSI volume, the metadata shows up as:
And while on the volume that is "Schedulable" (not really, in terms of Nomad) it shows up as:
Could be pointing to something, perhaps? The only solution that has worked for us so far is to |
Thank you all for the detailed information (keep them coming if you have more!). It seems like this issue happens sporadically when allocations that use a volume are restarted. I will try to create a high churn environment and see if I can reproduce it. |
Hi everyone, just a quick update. I've been running a periodic job for a couple of days now, and so far I haven't seen any issues. In hindsight I should've probably used a For those who have seen this issue, do you have any kind of Thanks! |
Hey @lgfa29 you can try out the following job. It's a basic job "prometheus" {
datacenters = ["dc1"]
type = "service"
group "monitoring" {
count = 2
constraint {
operator = "distinct_hosts"
value = "true"
}
volume "data" {
type = "csi"
source = "prometheus-disk"
attachment_mode = "file-system"
access_mode = "single-node-writer"
per_alloc = true
}
network {
port "http" {
static = 9090
}
}
service {
name = "prometheus2"
tags = ["prometheus2"]
task = "prometheus"
port = "http"
check {
type = "http"
port = "http"
path = "/-/ready"
interval = "10s"
timeout = "5s"
}
}
task "prometheus" {
driver = "docker"
user = "root"
volume_mount {
volume = "data"
destination = "/prometheus"
}
resources {
memory = 1024
cpu = 1024
}
template {
data = <<EOT
---
global:
scrape_interval: 10s
external_labels:
__replica__: "{{ env "NOMAD_ALLOC_ID" }}"
scrape_configs:
- job_name: "prometheus"
scrape_interval: 10s
consul_sd_configs:
- server: "{{ env "attr.unique.network.ip-address" }}:8500"
services:
- prometheus2
relabel_configs:
- source_labels: ["__meta_consul_node"]
regex: "(.*)"
target_label: "node"
replacement: "$1"
- source_labels: ["__meta_consul_service_id"]
regex: "(.*)"
target_label: "instance"
replacement: "$1"
- source_labels: ["__meta_consul_dc"]
regex: "(.*)"
target_label: "datacenter"
replacement: "$1"
# Nomad metrics
- job_name: "nomad_metrics"
consul_sd_configs:
- server: "{{ env "attr.unique.network.ip-address" }}:8500"
services: ["nomad-client", "nomad"]
relabel_configs:
- source_labels: ["__meta_consul_tags"]
regex: "(.*)http(.*)"
action: keep
- source_labels: ["__meta_consul_service"]
regex: "(.*)"
target_label: "job"
replacement: "$1"
- source_labels: ["__meta_consul_node"]
regex: "(.*)"
target_label: "node"
replacement: "$1"
- source_labels: ["__meta_consul_service_id"]
regex: "(.*)"
target_label: "instance"
replacement: "$1"
- source_labels: ["__meta_consul_dc"]
regex: "(.*)"
target_label: "datacenter"
replacement: "$1"
scrape_interval: 5s
metrics_path: /v1/metrics
params:
format: ["prometheus"]
# Consul metrics
- job_name: "consul_metrics"
consul_sd_configs:
- server: "{{ env "attr.unique.network.ip-address" }}:8500"
services: ["consul-agent"]
relabel_configs:
- source_labels: ["__meta_consul_tags"]
regex: "(.*)http(.*)"
action: keep
- source_labels: ["__meta_consul_service"]
regex: "(.*)"
target_label: "job"
replacement: "$1"
- source_labels: ["__meta_consul_node"]
regex: "(.*)"
target_label: "node"
replacement: "$1"
- source_labels: ["__meta_consul_service_id"]
regex: "(.*)"
target_label: "instance"
replacement: "$1"
- source_labels: ["__meta_consul_dc"]
regex: "(.*)"
target_label: "datacenter"
replacement: "$1"
scrape_interval: 5s
metrics_path: /v1/agent/metrics
params:
format: ["prometheus"]
EOT
destination = "local/prometheus.yml"
}
config {
image = "quay.io/prometheus/prometheus"
ports = ["http"]
args = [
"--config.file=${NOMAD_TASK_DIR}/prometheus.yml",
"--log.level=info",
"--storage.tsdb.retention.time=1d",
"--storage.tsdb.path=/prometheus",
"--web.console.libraries=/usr/share/prometheus/console_libraries",
"--web.console.templates=/usr/share/prometheus/consoles"
]
}
}
}
} |
Mine has.
Also you can try to fail the job. For example due to pull fail or running fail to the point the deployment exceeds deadline. This is what really makes the error frequently show up. I also have Yes, please try with Also I have two master nodes. Don't know if this contributes to this but does anyone use two master nodes too for this? Or maybe more? Maybe a race condition error between master nodes? Split brain perhaps? Also, here's mine
And my hcl file used to create all the volumes with
|
Thanks for the sample jobs @JanMa and @gregory112. It seems like job "random-fail" {
datacenters = ["dc1"]
type = "service"
group "random-fail" {
volume "ebs-vol" {
type = "csi"
read_only = false
source = "ebs-vol"
attachment_mode = "file-system"
access_mode = "single-node-writer"
mount_options {
fs_type = "ext4"
mount_flags = ["noatime"]
}
}
task "random-fail" {
driver = "docker"
config {
image = "alpine:3.14"
command = "/bin/ash"
args = ["/local/script.sh"]
}
template {
data = <<EOF
#!/usr/bin/env bash
while true;
do
echo "Rolling the dice..."
n=$(($RANDOM % 10))
echo "Got ${n}!"
if [[ 0 -eq ${n} ]];
then
echo "Bye :wave:"
exit 1;
fi
echo "'Til the next round."
sleep 10;
done
EOF
destination = "local/script.sh"
}
volume_mount {
volume = "ebs-vol"
destination = "/volume"
read_only = false
}
}
}
} |
For some reason nomad thinks that volume is still in use while it is not. nomad volume deregister returns "Error deregistering volume: Unexpected response code: 500 (rpc error: volume in use: nessus)". nomad volume deregister -force followed by nomad system gc and then registering it again seems to help. |
This nasty workaround seems to be working for DigitalOcean. If your task restarts frequently, it will spam your cluster with jobs, so be careful with that.
# ...
task "reregister_volume" {
lifecycle {
hook = "poststop"
sidecar = false
}
driver = "docker"
config {
image = "alpine:3.14"
entrypoint = [
"/bin/sh",
"-eufc",
<<-EOF
apk add curl
curl --fail --data '@-' -X POST \
"http://$${attr.unique.network.ip-address}:4646/v1/job/reregister-volume/dispatch" <<-EndOfData
{
"Meta": {
"csi_id": "${csi_id_prefix}[$${NOMAD_ALLOC_INDEX}]",
"csi_id_uri_component": "${csi_id_prefix}%5B$${NOMAD_ALLOC_INDEX}%5D",
"volume_name": "${volume_name_prefix}$${NOMAD_ALLOC_INDEX}",
"volume_plugin_id": "${volume_plugin_id}"
}
}
EndOfData
EOF
]
}
}
# ...
job "reregister-volume" {
type = "batch"
parameterized {
payload = "forbidden"
meta_required = ["csi_id", "csi_id_uri_component", "volume_name", "volume_plugin_id"]
}
group "reregister" {
task "reregister" {
driver = "docker"
config {
image = "alpine:3.14"
entrypoint = [
"/bin/sh",
"-eufc",
<<-EOF
sleep 5 # Wait for job to stop
apk add jq curl
echo "CSI_ID=$${NOMAD_META_CSI_ID}"
echo "CSI_ID_URI_COMPONENT=$${NOMAD_META_CSI_ID_URI_COMPONENT}"
echo "VOLUME_NAME=$${NOMAD_META_VOLUME_NAME}"
echo "VOLUME_PLUGIN_ID=$${NOMAD_META_VOLUME_PLUGIN_ID}"
n=0
until [ "$n" -ge 15 ]; do
echo "> Checking if volume exists (attempt $n)"
curl --fail -X GET \
"http://$${attr.unique.network.ip-address}:4646/v1/volumes?type=csi" \
| jq -e '. | map(.ID == "$${NOMAD_META_CSI_ID}") | any | not' && break
n=$((n+1))
sleep 1
echo
echo '> Force detachign volume'
curl --fail -X DELETE \
"http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}?force=true" \
|| echo ' Detaching failed'
done
if [ "$n" -ge 15 ]; then
echo ' Deregister failed too many times, giving up'
exit 0
else
echo ' Deregister complete'
fi
echo
echo '> Fetching external volume ID'
VOLUME_JSON=$(
curl --fail -X GET \
-H "Authorization: Bearer ${digitalocean_token_persistent}" \
"https://api.digitalocean.com/v2/volumes?name=$${NOMAD_META_VOLUME_NAME}" \
| jq '.volumes[0]'
)
VOLUME_ID=$(
echo "$VOLUME_JSON" | jq -r '.id'
)
VOLUME_REGION=$(
echo "$VOLUME_JSON" | jq -r '.region.slug'
)
VOLUME_DROPLET_ID=$(
echo "$VOLUME_JSON" | jq -r '.droplet_ids[0] // empty'
)
echo "VOLUME_ID=$VOLUME_ID"
echo "VOLUME_REGION=$VOLUME_ID"
echo "VOLUME_DROPLET_ID=$VOLUME_DROPLET_ID"
if [ ! -z "$VOLUME_DROPLET_ID" ]; then
echo
echo '> Detaching volume on DigitalOcean'
curl --fail -X POST \
-H "Authorization: Bearer ${digitalocean_token_persistent}" \
-d "{\"type\": \"detach\", \"droplet_id\": \"$VOLUME_DROPLET_ID\", \"region\": \"$VOLUME_REGION\"}" \
"https://api.digitalocean.com/v2/volumes/$VOLUME_ID/actions"
fi
echo
echo '> Re-registering volume'
curl --fail --data '@-' -X PUT \
"http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}" <<-EndOfVolume
{
"Volumes": [
{
"ID": "$${NOMAD_META_CSI_ID}",
"Name": "$${NOMAD_META_VOLUME_NAME}",
"ExternalID": "$VOLUME_ID",
"PluginID": "$${NOMAD_META_VOLUME_PLUGIN_ID}",
"RequestedCapabilities": [{
"AccessMode": "single-node-writer",
"AttachmentMode": "file-system"
}]
}
]
}
EndOfVolume
echo
echo '> Reading volume'
curl --fail -X GET \
"http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}"
echo
echo 'Finished'
EOF
]
}
}
}
} |
Same issue on my cluster with EFS plugin (docker: amazon/aws-efs-csi-driver:v1.3.3) One thing needs to be mentioned is that, right after force deregistering the volume, my Node Health recovered from (4/5) to (5/5). EFS plugin cannot be deployed to that node before deregistering. |
[Nomad v1.1.3] We have the exact same issue with the Linstor CSI driver (https://linbit.com/linstor/). Now, it works a few times and nomad does a proper job, but after a random number of re-schedules it fails with the same error message and the same behavior as described in this issue. In the Nomad UI volume list the volume has an allocation set, but when we go into the detailed view there is no allocs present. Basically the same behavior that @Thunderbottom reported. |
Yep, I'm running a really basic setup with just one server/client and one client and this happens to me too, I need to |
Just a kind reminder for those who are using AWS EFS: especially noresvport flag. Or it will cause infinite time wait io (100% wa for top) once tcp connection breaks. We didn't observe any more problem after correcting the mount options.
|
Same issue on our v1.1.4 cluster using Ceph RBD and cephfs CSI plugins |
Ok, the race described above turns out to be pretty straightforward. In the client's Why does this code path get the mode from the volume and not the claim? Because the claim written by the GC job in
|
|
I've done some testing of #11892 and that's in good shape as well, and I've walked some team mates through the rest of the PRs so that we can get them reviewed and merged one-by-one. Unfortunately I've run into a non-CSI issue around the |
The 4 patches for this issue (which also covers #10052 and #10833 and #8734) have been merged to Thanks again for your patience, folks. |
We thank you dude! I can finally recommend nomad to people without a BUT, i've always wanted be able to do that. |
This is awesome work @tgross! Highly appreciate the transparency and the depth at which you kept the community updated, this thread should become the gold standard for Organization <--> Community interactions! |
Nomad 1.2.5 has shipped with the patches described here. We still have a few important CSI issues to close out before we can call CSI GA-ready, but it should be safe to close this issue now. Please let us know if you run into the issue with the current versions of Nomad. Thanks! |
@tgross I recently upgrade to v1.2.6 and I'm still hitting the issue. After a few days of the job running, attempting to deploy ends up with the dreaded "CSI volume has exhausted its available writer claims and is claimed by a garbage collected allocation". Followed by a |
@tgross I have still managed to reproduce with v1.3.0-beta.1
|
A bit more detail would be helpful. Which task allocation do you mean? The plugin or the allocation that mounted the volume? I would not have expected the allocation that mounted the volume to be actually gone because we block for a very long time retrying the unmount unless the client node was also shut down as well. Did the node plugin for that node get restored to service? If so, the expectation is that the volumewatcher loop will eventually be able to detach it. Or is the client host itself gone too?
Unfortunately many storage providers will return errors if we do this. For example, imagine an AWS EBS volume attached to an EC2 instance where the Nomad client is shut down. We can never physically detach the volume until the client is restored, so all attempts to attach it elsewhere will fail unrecoverably. That's why we provide a |
@tgross I mean the allocation that mounted the volume. It was gone because of a node restart and the rescheduled task remained in the stuck state since thursday, so approx. 4 days. During the whole time the plugin state was shown as healthy in the ui, so nomad was not detecting that there was no actual connection to the backend. I guess ceph-csi just handles all Probe and GetCapabilities calls locally. The network issue was eventually fixed, but the volume remained stuck.
I haven't yet had much success with detach, because for me it normally just blocks and doesn't give out any information. For a case such as the one I described it is also hard to know which node the volume was attached to previously after the allocation has been gced. Would it even work if the node has been restarted? But yes, if such a case can't be handled automatically it would be good if a volume could be unconditionally detached by the operator even if the node has been restarted or is gone altogether. Also the deregister/reregister dance is impossible if the volume has been created with terraform, as the volume will be deleted automatically if the resource is removed. |
Ok, so that would also have restarted the Node plugin tasks as well. When the node restarted, did the Node plugin tasks come back? The UI is maybe a little misleading here -- it shows the plugin as healthy if we can schedule volumes with it, not when 100% of plugins are healthy (otherwise we'd block volume operations during plugin deployments). As for being stuck for 4 days... the evaluation for the rescheduled task hits a retry limit fairly quickly, so I suspect it's not "stuck" so much as "gave up." Once we've got the claim taken care of, re-running the job (or using
It blocks because it's a synchronous operation; it has to make RPC calls to the Node plugin and to the Controller plugin. You should see logs on the Node plugin (first) and then the Controller plugin (second). If both the Node plugin and Controller plugin are live, that should work and you should be able to look at logs in both plugins (and likely the Nomad leader as well) to see what's going on there.
So long as it restarted and has a running Node plugin, yes that should still work just fine. But fair point about knowing which node the volume was attached to previously. |
(hit submit too soon 😀 )
It's not possible for us to unconditionally detach but in theory we could unconditionally free a claim. That's what |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.1.2 (60638a0)
Operating system and Environment details
Ubuntu 20.04 LTS
Issue
Cannot re-plan jobs due to CSI volumes being claimed.
I have seen many variations about this issue. I don't know how to debug it. I use ceph-csi plugin to deploy system job on my two Nomad nodes. This result in two controllers and two ceph-csi nodes. I then create a few volumes using
nomad volume create
command. I then create a job with three tasks that use three volumes. Sometimes, after a while the job may fail, and I stop it. After that when I try to replan the exact same job I get that error.What confuses me is the warning. It differs every time I run
job plan
. First I sawThen, runnig
job plan
again a few seconds after, I gotThen again,
I have three groups: zookeeper1, zookeeper2, and zookeeper3, each using two volumes (data and datalog). I will just assume from this log that all volumes are non-reclaimable.
This is the output of
nomad volume status
.It says that they are schedulable. This is the output of
nomad volume status zookeeper1-datalog
:It says there, there are no allocations placed.
Reproduction steps
This is unfortunately flaky. But most likely happen due to job failing and then stopped and then replanned. This persists even after I purge the job with
nomad job stop -purge
. No, doingnomad system gc
,nomad system reconcile summary
, or restarting Nomad does not work.Expected Result
Should be able to reclaim the volume again without having to detach or deregister -force and register again. I created the volumes using
nomad volume create
so those volumes have their external IDs all generated. There are 6 volumes and 2 nodes, I don't want to type detach 12 times everytime this happens (this happens so frequently).Actual Result
See error logs above.
Job file (if appropriate)
I have three groups (zookeeper1, zookeeper2, zookeeper3) each having volume stanza like this (each with their own volumes, this one is for zookeeper2):
All groups have
count = 1
.The text was updated successfully, but these errors were encountered: