Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter #10927

Closed
gregory112 opened this issue Jul 22, 2021 · 74 comments · Fixed by #11932
Closed
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Milestone

Comments

@gregory112
Copy link

gregory112 commented Jul 22, 2021

Nomad version

Nomad v1.1.2 (60638a0)

Operating system and Environment details

Ubuntu 20.04 LTS

Issue

Cannot re-plan jobs due to CSI volumes being claimed.
I have seen many variations about this issue. I don't know how to debug it. I use ceph-csi plugin to deploy system job on my two Nomad nodes. This result in two controllers and two ceph-csi nodes. I then create a few volumes using nomad volume create command. I then create a job with three tasks that use three volumes. Sometimes, after a while the job may fail, and I stop it. After that when I try to replan the exact same job I get that error.

What confuses me is the warning. It differs every time I run job plan. First I saw

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-data has exhausted its available writer claims": 2 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-data has exhausted its available writer claims": 2 nodes excluded by filter

Then, runnig job plan again a few seconds after, I got

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-datalog has exhausted its available writer claims": 2 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-datalog has exhausted its available writer claims": 2 nodes excluded by filter

Then again,

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-data has exhausted its available writer claims": 1 nodes excluded by filter
    * Constraint "CSI volume zookeeper1-datalog has exhausted its available writer claims": 1 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-datalog has exhausted its available writer claims": 2 nodes excluded by filter

I have three groups: zookeeper1, zookeeper2, and zookeeper3, each using two volumes (data and datalog). I will just assume from this log that all volumes are non-reclaimable.

This is the output of nomad volume status.

Container Storage Interface
ID                           Name                Plugin ID  Schedulable  Access Mode
zookeeper1-data     zookeeper1-data     ceph-csi   true         single-node-writer
zookeeper1-datalog  zookeeper1-datalog  ceph-csi   true         single-node-writer
zookeeper2-data     zookeeper2-data     ceph-csi   true         single-node-writer
zookeeper2-datalog  zookeeper2-datalog  ceph-csi   true         single-node-writer
zookeeper3-data     zookeeper3-data     ceph-csi   true         <none>
zookeeper3-datalog  zookeeper3-datalog  ceph-csi   true         <none>

It says that they are schedulable. This is the output of nomad volume status zookeeper1-datalog:

ID                   = zookeeper1-datalog
Name                 = zookeeper1-datalog
External ID          = 0001-0024-72f28a72-0434-4045-be3a-b5165287253f-0000000000000003-72ec315b-e9f5-11eb-8af7-0242ac110002
Plugin ID            = ceph-csi
Provider             = cephfs.nomad.example.com
Version              = v3.3.1
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 2
Nodes Expected       = 2
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

It says there, there are no allocations placed.

Reproduction steps

This is unfortunately flaky. But most likely happen due to job failing and then stopped and then replanned. This persists even after I purge the job with nomad job stop -purge. No, doing nomad system gc, nomad system reconcile summary, or restarting Nomad does not work.

Expected Result

Should be able to reclaim the volume again without having to detach or deregister -force and register again. I created the volumes using nomad volume create so those volumes have their external IDs all generated. There are 6 volumes and 2 nodes, I don't want to type detach 12 times everytime this happens (this happens so frequently).

Actual Result

See error logs above.

Job file (if appropriate)

I have three groups (zookeeper1, zookeeper2, zookeeper3) each having volume stanza like this (each with their own volumes, this one is for zookeeper2):

    volume "data" {
      type = "csi"
      read_only = false
      source = "zookeeper2-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "zookeeper2-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

All groups have count = 1.

@bfqrst
Copy link

bfqrst commented Jul 26, 2021

[Nomad 1.1.2]

I have similar issues using the AWS EBS plugin. And I wholeheartedly share the sentiment that it's really hard to reproduce or debug, which is why I've been reluctant to opening an issue. In my case Nomad workers are spot instances which come and go. Non CSI jobs get rescheduled fine, but those which have CSI volumes attached tend to go red on the first reschedule and then eventually succeed. It's almost like the CSI mechanism needs more time to do it's thing, before the job gets restarted...

@gregory112
Copy link
Author

gregory112 commented Jul 26, 2021

I have been investigating it for a while.
A manual solution that works is to deregister the volume with -force, or at least detaching the volume with nomad volume detach. Deregistering means you need to register the volume again and if you used nomad volume create as I did, you will have to get the external ID and reregister the volume again with the same external ID. Now nomad volume detach is probably the most straightforward approach. I learn that what it actually do is to unmount the mount point of the particular volume in the particular node. And this sometimes hang. When hanging, I see an umount process in the node hanging indefinitely. Network connectivity is never reliable, so I think cases like this needs attention.

I don't know what Nomad does with unused volumes after the job is stopped. Does Nomad instruct the plugin to unmount the volumes from nodes? What if it fails to do so then? Maybe a timeout, or more log messages, or anything should be added in nomad alloc status.

umount hangs often especially in the case of mounting network file systems like this. And when umount hangs, nomad volume detach hangs too. The solution in my case is to actually kill the umount process in nodes and retry with umount -f, but I don't know if Nomad can do this, as I think this might be managed by the CSI plugin. But due to the variations that exist in a lot different CSI plugins, I think cases like this should be handled by Nomad somehow.

@konsti
Copy link

konsti commented Aug 14, 2021

We have the very same issue with Nomad 1.1.3 and AWS EBS CSI Plugin v1.2.0.
We need to revert to host_volumes as this issue threatens our cluster health when important stateful services like Prometheus can't restart correctly.

The volume is unmounted currently in AWS and shows state = "Available", so it's clearly an internal Nomad state issue.

@gregory112
Copy link
Author

Well this is bad, as I don't think the issue is solved by Nomad garbage collector. When a volume hangs in one Nomad node, Nomad will just allocate the job to another node. And when it hangs there, it will allocate the job to another node again. Imagine if we have 100 Nomad nodes and the volumes stuck in some of them.

@JanMa
Copy link
Contributor

JanMa commented Aug 16, 2021

I am seeing the same issues using the gcp-compute-persistent-disk-csi-driver plugin and Nomad v1.1.1. Interestingly the Read Volume API endpoint shows the correct amount of writers but the List Volumes endpoint shows too many CurrentWriters. I guess this is what's stopping the reassignment of the volume

@Thunderbottom
Copy link

We've faced this issue as well. Seems like Nomad fails to realise that there are not allocations present for the volume.

From what we could make out, both AWS and the CSI plugin report that the volume is available to mount:

$ nomad volume status -verbose -namespace=workspace workspace
ID                   = workspace
Name                 = workspace
External ID          = vol-xxxxxxxxxxxxx
Plugin ID            = aws-ebs0
Provider             = ebs.csi.aws.com
Version              = v0.10.1
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 2
Nodes Expected       = 2
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = workspace

And the AWS console reports the volume is "Available".

Whereas in the Nomad UI, the "Storage" tab reports multiple #Allocs for the aforementioned volume, and clicking on the volume shows that there are neither read nor write mounts, which potentially points to my hunch that Nomad somehow isn't aware that the volume has been detached by the driver and is not in use by any of the allocations. Here's a gist of logs from the CSI controller for the same:

I0816 06:26:49.482851       1 controller.go:329] ControllerUnpublishVolume: called with args {VolumeId:vol-XXXXXXXXXX NodeId:i-YYYYYYYYYYY Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0816 06:26:56.194278       1 controller.go:346] ControllerUnpublishVolume: volume vol-XXXXXXXXXX detached from node i-YYYYYYYYYYY

One thing I have noticed is that on a newly registered CSI volume, the metadata shows up as:

Access Mode          = <none>
Attachment Mode      = <none>

And while on the volume that is "Schedulable" (not really, in terms of Nomad) it shows up as:

Access Mode          = single-node-writer
Attachment Mode      = file-system

Could be pointing to something, perhaps?

The only solution that has worked for us so far is to -force deregister the volume and re-register it. Would love to have this issue solved. Let me know if I can help with any more information.

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 17, 2021

Thank you all for the detailed information (keep them coming if you have more!).

It seems like this issue happens sporadically when allocations that use a volume are restarted. I will try to create a high churn environment and see if I can reproduce it.

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 19, 2021

Hi everyone, just a quick update. I've been running a periodic job for a couple of days now, and so far I haven't seen any issues.

In hindsight I should've probably used a service job since they are likely more common they have some different scheduling logic when compared to batch.

For those who have seen this issue, do you have any kind of update block set? This could affect how new allocations are created and I want to make sure I have a good reproduction environment.

Thanks!

@JanMa
Copy link
Contributor

JanMa commented Aug 19, 2021

Hey @lgfa29 you can try out the following job. It's a basic prometheus deployment without any special update block.
I regularly have issues when trying to update this for, for example to add more resources.

job "prometheus" {
  datacenters = ["dc1"]
  type = "service"

  group "monitoring" {
    count = 2

    constraint {
      operator = "distinct_hosts"
      value = "true"
    }

    volume "data" {
      type            = "csi"
      source          = "prometheus-disk"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"
      per_alloc       = true
    }

    network {
      port "http" {
        static = 9090
      }
    }

    service {
      name = "prometheus2"
      tags = ["prometheus2"]
      task = "prometheus"
      port = "http"

      check {
        type = "http"
        port = "http"
        path = "/-/ready"
        interval = "10s"
        timeout = "5s"
      }
    }

    task "prometheus" {
      driver = "docker"
      user = "root"

      volume_mount {
        volume      = "data"
        destination = "/prometheus"
      }

      resources {
        memory = 1024
        cpu = 1024
      }

      template {
        data = <<EOT
---
global:
  scrape_interval: 10s
  external_labels:
    __replica__: "{{ env "NOMAD_ALLOC_ID" }}"
scrape_configs:
  - job_name: "prometheus"
    scrape_interval: 10s
    consul_sd_configs:
      - server: "{{ env "attr.unique.network.ip-address" }}:8500"
        services:
          - prometheus2
    relabel_configs:
      - source_labels: ["__meta_consul_node"]
        regex: "(.*)"
        target_label: "node"
        replacement: "$1"
      - source_labels: ["__meta_consul_service_id"]
        regex: "(.*)"
        target_label: "instance"
        replacement: "$1"
      - source_labels: ["__meta_consul_dc"]
        regex: "(.*)"
        target_label: "datacenter"
        replacement: "$1"
  # Nomad metrics
  - job_name: "nomad_metrics"
    consul_sd_configs:
      - server: "{{ env "attr.unique.network.ip-address" }}:8500"
        services: ["nomad-client", "nomad"]
    relabel_configs:
      - source_labels: ["__meta_consul_tags"]
        regex: "(.*)http(.*)"
        action: keep
      - source_labels: ["__meta_consul_service"]
        regex: "(.*)"
        target_label: "job"
        replacement: "$1"
      - source_labels: ["__meta_consul_node"]
        regex: "(.*)"
        target_label: "node"
        replacement: "$1"
      - source_labels: ["__meta_consul_service_id"]
        regex: "(.*)"
        target_label: "instance"
        replacement: "$1"
      - source_labels: ["__meta_consul_dc"]
        regex: "(.*)"
        target_label: "datacenter"
        replacement: "$1"
    scrape_interval: 5s
    metrics_path: /v1/metrics
    params:
      format: ["prometheus"]
  # Consul metrics
  - job_name: "consul_metrics"
    consul_sd_configs:
      - server: "{{ env "attr.unique.network.ip-address" }}:8500"
        services: ["consul-agent"]
    relabel_configs:
      - source_labels: ["__meta_consul_tags"]
        regex: "(.*)http(.*)"
        action: keep
      - source_labels: ["__meta_consul_service"]
        regex: "(.*)"
        target_label: "job"
        replacement: "$1"
      - source_labels: ["__meta_consul_node"]
        regex: "(.*)"
        target_label: "node"
        replacement: "$1"
      - source_labels: ["__meta_consul_service_id"]
        regex: "(.*)"
        target_label: "instance"
        replacement: "$1"
      - source_labels: ["__meta_consul_dc"]
        regex: "(.*)"
        target_label: "datacenter"
        replacement: "$1"
    scrape_interval: 5s
    metrics_path: /v1/agent/metrics
    params:
      format: ["prometheus"]
EOT
        destination = "local/prometheus.yml"
      }
      config {
        image = "quay.io/prometheus/prometheus"
        ports = ["http"]
        args = [
          "--config.file=${NOMAD_TASK_DIR}/prometheus.yml",
          "--log.level=info",
          "--storage.tsdb.retention.time=1d",
          "--storage.tsdb.path=/prometheus",
          "--web.console.libraries=/usr/share/prometheus/console_libraries",
          "--web.console.templates=/usr/share/prometheus/consoles"
        ]
      }
    }
  }
}

@gregory112
Copy link
Author

gregory112 commented Aug 19, 2021

For those who have seen this issue, do you have any kind of update block set? This could affect how new allocations are created and I want to make sure I have a good reproduction environment.

Mine has.

  update {
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "15m"
    progress_deadline = "20m"
  }

Also you can try to fail the job. For example due to pull fail or running fail to the point the deployment exceeds deadline. This is what really makes the error frequently show up. I also have service block set up to provide health checks.

Yes, please try with service job. What made the error show up in my case is task failure, and stopping and replanning the task. Replanning without stopping can sometimes cause the issue to appear too, for example updating the image version used by the job.

Also I have two master nodes. Don't know if this contributes to this but does anyone use two master nodes too for this? Or maybe more? Maybe a race condition error between master nodes? Split brain perhaps?

Also, here's mine

variables {
  mycluster_zookeeper_image = "zookeeper:3.6"
}

job "mycluster-zookeeper" {
  region = "global"
  datacenters = ["dc1"]
  type = "service"

  update {
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "15m"
    progress_deadline = "20m"
  }

  group "zookeeper1" {
    count = 1

    restart {
      interval = "10m"
      attempts = 2
      delay    = "10s"
      mode     = "fail"
    }

    network {
      mode = "cni/weave"
    }

    service {
      name = "mycluster-zookeeper1-peer"
      port = "2888"
      address_mode = "alloc"
    }

    service {
      name = "mycluster-zookeeper1-leader"
      port = "3888"
      address_mode = "alloc"
    }

    service {
      name = "mycluster-zookeeper1-client"
      port = "2181"
      address_mode = "alloc"
    }

    volume "data" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper1-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper1-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "zookeeper" {
      driver = "docker"

      volume_mount {
        volume = "data"
        destination = "/data"
        read_only = false
      }
      volume_mount {
        volume = "datalog"
        destination = "/datalog"
        read_only = false
      }

      config {
        image = var.mycluster_zookeeper_image
        image_pull_timeout = "10m"
      }

      template {
        destination = "local/zookeeper.env"
        data = <<EOH
ZOO_SERVERS=server.1:0.0.0.0:2888:3888;2181 server.2:{{ with service "mycluster-zookeeper2-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.3:{{ with service "mycluster-zookeeper3-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181
EOH
        env = true
      }

      env {
        ZOO_MY_ID = "1"
        ZOO_CFG_EXTRA = "minSessionTimeout=10000"
      }

      resources {
        cpu    = 500
        memory = 512
      }
    }
  }

  group "zookeeper2" {
    count = 1

    restart {
      interval = "10m"
      attempts = 2
      delay    = "10s"
      mode     = "fail"
    }

    network {
      mode = "cni/weave"
    }

    service {
      name = "mycluster-zookeeper2-peer"
      port = "2888"
      address_mode = "alloc"
    }

    service {
      name = "mycluster-zookeeper2-leader"
      port = "3888"
      address_mode = "alloc"
    }

    service {
      name = "mycluster-zookeeper2-client"
      port = "2181"
      address_mode = "alloc"
    }

    volume "data" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper2-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper2-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "zookeeper" {
      driver = "docker"

      volume_mount {
        volume = "data"
        destination = "/data"
        read_only = false
      }
      volume_mount {
        volume = "datalog"
        destination = "/datalog"
        read_only = false
      }

      config {
        image = var.mycluster_zookeeper_image
        image_pull_timeout = "10m"
      }

      template {
        destination = "local/zookeeper.env"
        data = <<EOH
ZOO_SERVERS=server.1:{{ with service "mycluster-zookeeper1-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.2:0.0.0.0:2888:3888;2181 server.3:{{ with service "mycluster-zookeeper3-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181
EOH
        env = true
      }

      env {
        ZOO_MY_ID = "2"
        ZOO_CFG_EXTRA = "minSessionTimeout=10000"
      }

      resources {
        cpu    = 500
        memory = 512
      }
    }
  }

  group "zookeeper3" {
    count = 1

    restart {
      interval = "10m"
      attempts = 2
      delay    = "10s"
      mode     = "fail"
    }

    network {
      mode = "cni/weave"
    }

    service {
      name = "mycluster-zookeeper3-peer"
      port = "2888"
      address_mode = "alloc"

      #connect {
      #  sidecar_service {
      #    disable_default_tcp_check = true
      #
      #    proxy {
      #      upstreams {
      #        destination_name = "mycluster-zookeeper1-peer"
      #        local_bind_port = "12888"
      #      }
      #      upstreams {
      #        destination_name = "mycluster-zookeeper2-peer"
      #        local_bind_port = "22888"
      #      }
      #    }
      #  }
      #}
    }

    service {
      name = "mycluster-zookeeper3-leader"
      port = "3888"
      address_mode = "alloc"

      #connect {
      #  sidecar_service {
      #    disable_default_tcp_check = true
      #
      #    proxy {
      #      upstreams {
      #        destination_name = "mycluster-zookeeper1-leader"
      #        local_bind_port = "13888"
      #      }
      #      upstreams {
      #        destination_name = "mycluster-zookeeper2-leader"
      #        local_bind_port = "23888"
      #      }
      #    }
      #  }
      #}
    }

    service {
      name = "mycluster-zookeeper3-client"
      port = "2181"
      address_mode = "alloc"

      #connect {
      #  sidecar_service {
      #    disable_default_tcp_check = true
      #    #proxy {
      #    #  upstreams {
      #    #    destination_name = "mycluster-zookeeper1-client"
      #    #    local_bind_port = "12181"
      #    #  }
      #    #  upstreams {
      #    #    destination_name = "mycluster-zookeeper2-client"
      #    #    local_bind_port = "22181"
      #    #  }
      #    #  upstreams {
      #    #    destination_name = "mycluster-zookeeper3-client"
      #    #    local_bind_port = "32181"
      #    #  }
      #    #}
      #  }
      #}
    }

    volume "data" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper3-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper3-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "zookeeper" {
      driver = "docker"

      volume_mount {
        volume = "data"
        destination = "/data"
        read_only = false
      }
      volume_mount {
        volume = "datalog"
        destination = "/datalog"
        read_only = false
      }

      config {
        image = var.mycluster_zookeeper_image
        image_pull_timeout = "10m"
      }

      template {
        destination = "local/zookeeper.env"
        data = <<EOH
ZOO_SERVERS=server.1:{{ with service "mycluster-zookeeper1-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.2:{{ with service "mycluster-zookeeper2-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.3:0.0.0.0:2888:3888;2181
EOH
        env = true
      }

      env {
        ZOO_MY_ID = "3"
        ZOO_CFG_EXTRA = "minSessionTimeout=10000"
      }

      resources {
        cpu    = 500
        memory = 512
      }
    }
  }
}

And my hcl file used to create all the volumes with nomad volume create:

id = "mycluster-zookeeper1-data"
name = "zookeeper1-data"
type = "csi"
plugin_id = "ceph-csi"
capacity_min = "1G"
capacity_max = "10G"

capability {
  access_mode = "single-node-writer"
  attachment_mode = "file-system"
}

mount_options {
  fs_type = "ext4"
  mount_flags = ["noatime"]
}

secrets {
  adminID = "redacted"
  adminKey = "redacted"
}

parameters {
  clusterID = "redacted"
  fsName = "myfsname"
  mounter = "kernel"
}

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 19, 2021

Thanks for the sample jobs @JanMa and @gregory112.

It seems like update is not part of the problem, so I create this job that has a 10% chance of failing. I will leave it running and see if the issue is triggered.

job "random-fail" {
  datacenters = ["dc1"]
  type        = "service"

  group "random-fail" {
    volume "ebs-vol" {
      type            = "csi"
      read_only       = false
      source          = "ebs-vol"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "random-fail" {
      driver = "docker"

      config {
        image   = "alpine:3.14"
        command = "/bin/ash"
        args    = ["/local/script.sh"]
      }

      template {
        data = <<EOF
#!/usr/bin/env bash

while true;
do
  echo "Rolling the dice..."

  n=$(($RANDOM % 10))
  echo "Got ${n}!"

  if  [[ 0 -eq ${n} ]];
  then
    echo "Bye :wave:"
    exit 1;
  fi

  echo "'Til the next round."
  sleep 10;
done
EOF

        destination = "local/script.sh"
      }

      volume_mount {
        volume      = "ebs-vol"
        destination = "/volume"
        read_only   = false
      }
    }
  }
}

@raven-oscar
Copy link

For some reason nomad thinks that volume is still in use while it is not. nomad volume deregister returns "Error deregistering volume: Unexpected response code: 500 (rpc error: volume in use: nessus)". nomad volume deregister -force followed by nomad system gc and then registering it again seems to help.

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 24, 2021

I think I was able to reproduce this with a self-updating job:

job "update" {
  datacenters = ["dc1"]
  type        = "service"

  group "update" {
    volume "ebs-vol" {
      type            = "csi"
      read_only       = false
      source          = "ebs-vol2"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "update" {
      driver = "docker"

      config {
        image   = "alpine:3.14"
        command = "/bin/sh"
        args    = ["/local/script.sh"]
      }

      template {
        data = <<EOF
#!/usr/bin/env bash

apk add curl jq

while true;
do
  sleep 10
  jobspec=$(curl http://172.17.0.1:4646/v1/job/${NOMAD_JOB_ID})
  cpu=$(echo $jobspec | jq '.TaskGroups[0].Tasks[0].Resources.CPU')
  if [ $cpu -eq 500 ]
  then
    cpu=600
  else
    cpu=500
  fi
  new_jobspec=$(echo $jobspec | jq ".TaskGroups[0].Tasks[0].Resources.CPU = ${cpu}")
  echo $new_jobspec | jq '{"Job":.}' | curl -H "Content-Type: application/json" -X POST --data @- http:/172.17.0.1:4646/v1/job/${NOMAD_JOB_ID}
done
EOF

        destination = "local/script.sh"
      }

      volume_mount {
        volume      = "ebs-vol"
        destination = "/volume"
        read_only   = false
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

This job will usually fail first, and then work.
image

It's not exactly the same error message, but maybe it's related:

failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: CSI.ControllerAttachVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = Could not attach volume "vol-0fa6e88e4c3b55d29" to node "i-02396ab34ed318944": attachment of disk "vol-0fa6e88e4c3b55d29" failed, expected device to be attached but was detaching

Looking at the volume details page, I see the previous allocation still listed under Write Allocations for a while, even after it's marked as Complete. I don't know if this is related, since another alloc is able to start, but may be be a pointer in the right direction
image

@lgfa29 lgfa29 added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Aug 24, 2021
@NathanFlurry
Copy link

This nasty workaround seems to be working for DigitalOcean. If your task restarts frequently, it will spam your cluster with jobs, so be careful with that.

your-job.nomad.tpl:

# ...
		task "reregister_volume" {
			lifecycle {
				hook = "poststop"
				sidecar = false
			}

			driver = "docker"

			config {
				image = "alpine:3.14"
				entrypoint = [
					"/bin/sh",
					"-eufc",
					<<-EOF
					apk add curl

					curl --fail --data '@-' -X POST \
						"http://$${attr.unique.network.ip-address}:4646/v1/job/reregister-volume/dispatch" <<-EndOfData
						{
							"Meta": {
								"csi_id": "${csi_id_prefix}[$${NOMAD_ALLOC_INDEX}]",
								"csi_id_uri_component": "${csi_id_prefix}%5B$${NOMAD_ALLOC_INDEX}%5D",
								"volume_name": "${volume_name_prefix}$${NOMAD_ALLOC_INDEX}",
								"volume_plugin_id": "${volume_plugin_id}"
							}
						}
						EndOfData
					EOF
				]
			}
		}
# ...

register-volume.nomad.tpl:

job "reregister-volume" {
	type = "batch"
	
	parameterized {
		payload = "forbidden"
		meta_required = ["csi_id", "csi_id_uri_component", "volume_name", "volume_plugin_id"]
	}

	group "reregister" {
		task "reregister" {
			driver = "docker"

			config {
				image = "alpine:3.14"
				entrypoint = [
					"/bin/sh",
					"-eufc",
					<<-EOF
					sleep 5 # Wait for job to stop

					apk add jq curl

					echo "CSI_ID=$${NOMAD_META_CSI_ID}"
					echo "CSI_ID_URI_COMPONENT=$${NOMAD_META_CSI_ID_URI_COMPONENT}"
					echo "VOLUME_NAME=$${NOMAD_META_VOLUME_NAME}"
					echo "VOLUME_PLUGIN_ID=$${NOMAD_META_VOLUME_PLUGIN_ID}"

					n=0
					until [ "$n" -ge 15 ]; do
						echo "> Checking if volume exists (attempt $n)"
						curl --fail -X GET \
							"http://$${attr.unique.network.ip-address}:4646/v1/volumes?type=csi" \
							| jq -e '. | map(.ID == "$${NOMAD_META_CSI_ID}") | any | not' && break
						n=$((n+1))

						sleep 1
						echo
						echo '> Force detachign volume'
						curl --fail -X DELETE \
							"http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}?force=true" \
							|| echo '  Detaching failed'
					done
					if [ "$n" -ge 15 ]; then
						echo '  Deregister failed too many times, giving up'
						exit 0
					else
						echo '  Deregister complete'
					fi

					echo
					echo '> Fetching external volume ID'
					VOLUME_JSON=$(
						curl --fail -X GET \
							-H "Authorization: Bearer ${digitalocean_token_persistent}" \
							"https://api.digitalocean.com/v2/volumes?name=$${NOMAD_META_VOLUME_NAME}" \
							| jq '.volumes[0]'
					)
					VOLUME_ID=$(
						echo "$VOLUME_JSON" | jq -r '.id'
					)
					VOLUME_REGION=$(
						echo "$VOLUME_JSON" | jq -r '.region.slug'
					)
					VOLUME_DROPLET_ID=$(
						echo "$VOLUME_JSON" | jq -r '.droplet_ids[0] // empty'
					)
					echo "VOLUME_ID=$VOLUME_ID"
					echo "VOLUME_REGION=$VOLUME_ID"
					echo "VOLUME_DROPLET_ID=$VOLUME_DROPLET_ID"

					if [ ! -z "$VOLUME_DROPLET_ID" ]; then
						echo
						echo '> Detaching volume on DigitalOcean'
						curl --fail -X POST \
							-H "Authorization: Bearer ${digitalocean_token_persistent}" \
							-d "{\"type\": \"detach\", \"droplet_id\": \"$VOLUME_DROPLET_ID\", \"region\": \"$VOLUME_REGION\"}" \
							"https://api.digitalocean.com/v2/volumes/$VOLUME_ID/actions"
					fi

					echo
					echo '> Re-registering volume'
					curl --fail --data '@-' -X PUT \
						"http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}" <<-EndOfVolume
						{
							"Volumes": [
								{
									"ID": "$${NOMAD_META_CSI_ID}",
									"Name": "$${NOMAD_META_VOLUME_NAME}",
									"ExternalID": "$VOLUME_ID",
									"PluginID": "$${NOMAD_META_VOLUME_PLUGIN_ID}",
									"RequestedCapabilities": [{
										"AccessMode": "single-node-writer",
										"AttachmentMode": "file-system"
									}]
								}
							]
						}
						EndOfVolume

					echo
					echo '> Reading volume'
					curl --fail -X GET \
						"http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}"

					echo
					echo 'Finished'
					EOF
				]
			}
		}
	}
}

@latifrons
Copy link

Same issue on my cluster with EFS plugin (docker: amazon/aws-efs-csi-driver:v1.3.3)
It usually happens after a restart of some node. No Write Allocations, No Read Allocations, however failed to mount because CSI volume has exhausted its available writer claims.

One thing needs to be mentioned is that, right after force deregistering the volume, my Node Health recovered from (4/5) to (5/5). EFS plugin cannot be deployed to that node before deregistering.

@aroundthfur
Copy link

aroundthfur commented Sep 1, 2021

[Nomad v1.1.3]

We have the exact same issue with the Linstor CSI driver (https://linbit.com/linstor/).
Our use-case is to be able to fail-over a job when one server or a DC dies. We are still in the implementation phase so we were intentionally killing servers where nomad jobs are running to reproduce the fail-over use-cases.

Now, it works a few times and nomad does a proper job, but after a random number of re-schedules it fails with the same error message and the same behavior as described in this issue. In the Nomad UI volume list the volume has an allocation set, but when we go into the detailed view there is no allocs present. Basically the same behavior that @Thunderbottom reported.

@MagicRB
Copy link

MagicRB commented Sep 12, 2021

Yep, I'm running a really basic setup with just one server/client and one client and this happens to me too, I need to -force deregister the volume and then re-register it to fix it

@latifrons
Copy link

Just a kind reminder for those who are using AWS EFS:
Make sure you are using the suggested mount options given by AWS.
https://docs.aws.amazon.com/efs/latest/ug/efs-mount-helper.html

especially noresvport flag.

Or it will cause infinite time wait io (100% wa for top) once tcp connection breaks.

We didn't observe any more problem after correcting the mount options.

Same issue on my cluster with EFS plugin (docker: amazon/aws-efs-csi-driver:v1.3.3)
It usually happens after a restart of some node. No Write Allocations, No Read Allocations, however failed to mount because CSI volume has exhausted its available writer claims.

One thing needs to be mentioned is that, right after force deregistering the volume, my Node Health recovered from (4/5) to (5/5). EFS plugin cannot be deployed to that node before deregistering.

@tomiles
Copy link

tomiles commented Sep 15, 2021

Same issue on our v1.1.4 cluster using Ceph RBD and cephfs CSI plugins

@tgross
Copy link
Member

tgross commented Jan 25, 2022

Ok, the race described above turns out to be pretty straightforward. In the client's (*csiHook) Postrun() method, we make an unpublish RPC that includes a claim in the CSIVolumeClaimStateUnpublishing state and using the mode from the client. But then in the (*CSIVolume) Unpublish RPC handler, we query the volume from the state store (because we only get an ID from the client). And when we make the client RPC for the node unpublish step, we use the current volume's view of the mode (ref nodeUnpublishVolumeImpl). If the volume's mode has been changed before the old allocations can have their claims released, then we end up making a CSI RPC that will never succeed.

Why does this code path get the mode from the volume and not the claim? Because the claim written by the GC job in (*CoreScheduler) csiVolumeClaimGC doesn't have a mode. Instead it just writes a claim in the unpublishing state to ensure the volumewatcher detects a "past claim" change and reaps all the claims on the volumes.

  • With the patch in CSI: resolve invalid claim states #11890 we should be able to safely ignore the case of the GC claim because all the past claims will be correct, so it should be safe to update nodeUnpublishVolumeImpl to always use the claims mode (maybe falling back to the volume's mode if unset?).
  • We'll also fix most cases of this via CSI: node unmount from the client before unpublish RPC #11892 because we'll drive the node unpublish step from the client which has the "real" view of what mode the volume was actually mounted with.
  • I'll take a quick pass at auditing other uses of vol.AccessMode to ensure we're not making this mistake elsewhere.

@tgross
Copy link
Member

tgross commented Jan 25, 2022

@tgross
Copy link
Member

tgross commented Jan 26, 2022

I've done some testing of #11892 and that's in good shape as well, and I've walked some team mates through the rest of the PRs so that we can get them reviewed and merged one-by-one. Unfortunately I've run into a non-CSI issue around the stop_after_client_disconnect flag #11943 which is going to impact a lot of CSI use cases (we created that flag specifically for stateful workloads). But we should be able to ship an incremental patch release with a bunch of the work done so far. Stay tuned for updates on when that may land as the PRs get merged.

@tgross
Copy link
Member

tgross commented Jan 28, 2022

The 4 patches for this issue (which also covers #10052 and #10833 and #8734) have been merged to main. We're working up a plan for a patch release shortly, at which point I'll close out this issue. Note this won't yet fix the issues in #8609 but it does make that an automatically recoverable situation. I'm going to be focusing heavily on knocking out the rest of the major CSI issues over the next month or two after we've shipped this.

Thanks again for your patience, folks.

@MagicRB
Copy link

MagicRB commented Jan 28, 2022

We thank you dude! I can finally recommend nomad to people without a BUT, i've always wanted be able to do that.

@RickyGrassmuck
Copy link
Contributor

This is awesome work @tgross! Highly appreciate the transparency and the depth at which you kept the community updated, this thread should become the gold standard for Organization <--> Community interactions!

@tgross
Copy link
Member

tgross commented Feb 1, 2022

Nomad 1.2.5 has shipped with the patches described here. We still have a few important CSI issues to close out before we can call CSI GA-ready, but it should be safe to close this issue now. Please let us know if you run into the issue with the current versions of Nomad. Thanks!

@m1keil
Copy link

m1keil commented Mar 16, 2022

@tgross I recently upgrade to v1.2.6 and I'm still hitting the issue.

After a few days of the job running, attempting to deploy ends up with the dreaded "CSI volume has exhausted its available writer claims and is claimed by a garbage collected allocation". Followed by a nomad volume deregister --force ritual.

@tgross
Copy link
Member

tgross commented Mar 16, 2022

@m1keil the problem isn't entirely eliminated until 1.3.0 (which will get backported). #12113 should cover the remaining cases.

@iSchluff
Copy link

iSchluff commented Apr 25, 2022

@tgross I have still managed to reproduce with v1.3.0-beta.1
I think this could be related to a firewall issue I had. The node plugins couldn't reach the storage backend anymore and so the mount was still around but the task allocation was long gone. Whats the expected behavior here? Maybe the volume allocation should be released after a long timeout has expired?

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
registry    1       0         0        0       0         0     0

Placement Failure
Task Group "registry":
  * Constraint "CSI volume ceph-registry has exhausted its available writer claims and is claimed by a garbage collected allocation 45c884a0-e629-b180-eb44-c3c7a5d5e7b7; waiting for claim to be released": 3 nodes excluded by filter

Latest Deployment
ID          = 687ebec8
Status      = running
Description = Deployment is running
# nomad volume status ceph-registry
ID                   = ceph-registry
Name                 = ceph-registry
External ID          = 0001-0024-22b9524c-4a2d-4a56-9f3d-221ee05905a2-0000000000000001-af535f2b-9f9a-11ec-bf1c-0242ac110003
Plugin ID            = ceph-csi
Provider             = cephfs.csi.ceph.com
Version              = v3.6.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 4
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

@tgross
Copy link
Member

tgross commented Apr 25, 2022

The node plugins couldn't reach the storage backend anymore and so the mount was still around but the task allocation was long gone. Whats the expected behavior here?

A bit more detail would be helpful. Which task allocation do you mean? The plugin or the allocation that mounted the volume? I would not have expected the allocation that mounted the volume to be actually gone because we block for a very long time retrying the unmount unless the client node was also shut down as well.

Did the node plugin for that node get restored to service? If so, the expectation is that the volumewatcher loop will eventually be able to detach it. Or is the client host itself gone too?

Maybe the volume allocation should be released after a long timeout has expired?

Unfortunately many storage providers will return errors if we do this. For example, imagine an AWS EBS volume attached to an EC2 instance where the Nomad client is shut down. We can never physically detach the volume until the client is restored, so all attempts to attach it elsewhere will fail unrecoverably. That's why we provide a nomad volume detach and nomad volume delete command, so that the operator can signal that the volume was detached out-of-band.

@iSchluff
Copy link

iSchluff commented Apr 25, 2022

@tgross I mean the allocation that mounted the volume. It was gone because of a node restart and the rescheduled task remained in the stuck state since thursday, so approx. 4 days.

During the whole time the plugin state was shown as healthy in the ui, so nomad was not detecting that there was no actual connection to the backend. I guess ceph-csi just handles all Probe and GetCapabilities calls locally.

The network issue was eventually fixed, but the volume remained stuck.

That's why we provide a nomad volume detach and nomad volume delete command, so that the operator can signal that the volume was detached out-of-band.

I haven't yet had much success with detach, because for me it normally just blocks and doesn't give out any information. For a case such as the one I described it is also hard to know which node the volume was attached to previously after the allocation has been gced. Would it even work if the node has been restarted?

But yes, if such a case can't be handled automatically it would be good if a volume could be unconditionally detached by the operator even if the node has been restarted or is gone altogether.
In my case I needed to deregister/reregister the volume again.

Also the deregister/reregister dance is impossible if the volume has been created with terraform, as the volume will be deleted automatically if the resource is removed.

@tgross
Copy link
Member

tgross commented Apr 25, 2022

I mean the allocation that mounted the volume. It was gone because of a node restart and the rescheduled task remained in the stuck state since thursday, so approx. 4 days.

During the whole time the plugin state was shown as healthy in the ui, so nomad was not detecting that there was no actual connection to the backend. I guess ceph-csi just handles all Probe and GetCapabilities calls locally.

Ok, so that would also have restarted the Node plugin tasks as well. When the node restarted, did the Node plugin tasks come back? The UI is maybe a little misleading here -- it shows the plugin as healthy if we can schedule volumes with it, not when 100% of plugins are healthy (otherwise we'd block volume operations during plugin deployments).

As for being stuck for 4 days... the evaluation for the rescheduled task hits a retry limit fairly quickly, so I suspect it's not "stuck" so much as "gave up." Once we've got the claim taken care of, re-running the job (or using nomad job eval) should force a retry. But let's figure out the claim problem first. 😀

I haven't yet had much success with detach, because for me it normally just blocks and doesn't give out any information.

It blocks because it's a synchronous operation; it has to make RPC calls to the Node plugin and to the Controller plugin. You should see logs on the Node plugin (first) and then the Controller plugin (second). If both the Node plugin and Controller plugin are live, that should work and you should be able to look at logs in both plugins (and likely the Nomad leader as well) to see what's going on there.

For a case such as the one I described it is also hard to know which node the volume was attached to previously after the allocation has been gced. Would it even work if the node has been restarted?

So long as it restarted and has a running Node plugin, yes that should still work just fine. But fair point about knowing which node the volume was attached to previously.

@tgross
Copy link
Member

tgross commented Apr 25, 2022

(hit submit too soon 😀 )

But yes, if such a case can't be handled automatically it would be good if a volume could be unconditionally detached by the operator even if the node has been restarted or is gone altogether.

It's not possible for us to unconditionally detach but in theory we could unconditionally free a claim. That's what deregister -force is designed to do, but yeah I realize that's not going to work out so well with Terraform.

@github-actions
Copy link

github-actions bot commented Oct 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.