volume-attach-limit argument doesn't work in 1.20 #1174

sultanovich · 2022-02-21T14:11:45Z

/triage support

What happened?

Since K8s keeps trying to add volumes when it reaches the limit allowed by the AWS instance, try using the volume-attach-limit argument to find a workaround while troubleshooting.

How to reproduce it (as minimally and precisely as possible)?

It is possible to reproduce by setting the argument and trying to create more volumes than the maximum configured as seen in the following example.

Name:                   ebs-csi-controller
Namespace:              kube-system
CreationTimestamp:      Mon, 07 Jun 2021 06:31:13 +0000
Labels:                 app.kubernetes.io/name=aws-ebs-csi-driver
Annotations:            deployment.kubernetes.io/revision: 8
Selector:               app=ebs-csi-controller,app.kubernetes.io/name=aws-ebs-csi-driver
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=ebs-csi-controller
                    app.kubernetes.io/name=aws-ebs-csi-driver
  Service Account:  ebs-csi-controller-sa
  Containers:
   ebs-plugin:
    Image:      k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0
    Port:       9808/TCP
    Host Port:  0/TCP
    Args:
      --endpoint=$(CSI_ENDPOINT)
      --logtostderr
      --v=2
      --k8s-tag-cluster-id=cloud-dev-cluster-mix
      --volume-attach-limit=10

Environment

Kubernetes Version:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.7", GitCommit:"bfb38f707bc4a8edfcd73472ec3d96b500b8b781", GitTreeState:"clean", BuildDate:"2020-08-12T20:27:48Z", GoVersion:"go1.13.14", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

CSI-EBS Driver Version:

[sulta@dev [~] 14:08:21 ~] $ kubectl --kubeconfig=/home/centos/dev.kube -n kube-system describe deployments ebs-csi-controller | grep Image:
    Image:      k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0
    Image:      k8s.gcr.io/sig-storage/csi-provisioner:v2.1.1
    Image:      k8s.gcr.io/sig-storage/csi-attacher:v3.1.0
    Image:      k8s.gcr.io/sig-storage/csi-snapshotter:v3.0.3
    Image:      k8s.gcr.io/sig-storage/csi-resizer:v1.0.0
    Image:      k8s.gcr.io/sig-storage/livenessprobe:v2.2.0
[sulta@dev [~] 14:08:27 ~] $

Previous consultation without solution in the slack channel
https://kubernetes.slack.com/archives/C09NXKJKA/p1645218261237229

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-02-21T14:11:47Z

@sultanovich: The label(s) triage/support cannot be applied, because the repository doesn't have them.

In response to this:

/triage support

What happened?

Since K8s keeps trying to add volumes when it reaches the limit allowed by the AWS instance, try using the volume-attach-limit argument to find a workaround while troubleshooting.

How to reproduce it (as minimally and precisely as possible)?

It is possible to reproduce by setting the argument and trying to create more volumes than the maximum configured as seen in the following example.

Name:                   ebs-csi-controller
Namespace:              kube-system
CreationTimestamp:      Mon, 07 Jun 2021 06:31:13 +0000
Labels:                 app.kubernetes.io/name=aws-ebs-csi-driver
Annotations:            deployment.kubernetes.io/revision: 8
Selector:               app=ebs-csi-controller,app.kubernetes.io/name=aws-ebs-csi-driver
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
 Labels:           app=ebs-csi-controller
                   app.kubernetes.io/name=aws-ebs-csi-driver
 Service Account:  ebs-csi-controller-sa
 Containers:
  ebs-plugin:
   Image:      k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0
   Port:       9808/TCP
   Host Port:  0/TCP
   Args:
     --endpoint=$(CSI_ENDPOINT)
     --logtostderr
     --v=2
     --k8s-tag-cluster-id=cloud-dev-cluster-mix
     --volume-attach-limit=10

Environment

Kubernetes Version:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.7", GitCommit:"bfb38f707bc4a8edfcd73472ec3d96b500b8b781", GitTreeState:"clean", BuildDate:"2020-08-12T20:27:48Z", GoVersion:"go1.13.14", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

CSI-EBS Driver Version:

[sulta@dev [~] 14:08:21 ~] $ kubectl --kubeconfig=/home/centos/dev.kube -n kube-system describe deployments ebs-csi-controller | grep Image:
   Image:      k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0
   Image:      k8s.gcr.io/sig-storage/csi-provisioner:v2.1.1
   Image:      k8s.gcr.io/sig-storage/csi-attacher:v3.1.0
   Image:      k8s.gcr.io/sig-storage/csi-snapshotter:v3.0.3
   Image:      k8s.gcr.io/sig-storage/csi-resizer:v1.0.0
   Image:      k8s.gcr.io/sig-storage/livenessprobe:v2.2.0
[sulta@dev [~] 14:08:27 ~] $

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sultanovich · 2022-02-23T04:09:06Z

/kind/support
kind/support
/triage/ kind/support

Keep checking the repository and I have verified that there is an example indicating that the volume-attach-limit argument should be located in the ebs-plugin container section, as I have tried to do in my tests.

      containers:
        - name: ebs-plugin
          securityContext:
            privileged: true
          image: {{ printf "%s:%s" .Values.image.repository (default (printf "v%s" .Chart.AppVersion) (toString .Values.image.tag)) }}
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          args:
            - node
            - --endpoint=$(CSI_ENDPOINT)
            {{- with .Values.node.volumeAttachLimit }}
            - --volume-attach-limit={{ . }}
            {{- end }}
            - --logtostderr
            - --v={{ .Values.node.logLevel }}

Any idea why this argument didn't work?

sultanovich · 2022-02-23T19:16:07Z

Continue testing and I have been able to validate that the volume-attach-limit argument works correctly now. The configuration must be applied in the DaemonSet ebs-csi-node, not in the Deployment ebs-csi-controller.

Editing the DaemonSet the configuration works correctly:

[Sultanovich@Dev [~] 18:49:21 ~] $ ./count_vol.sh 
ip-10-2-123-153.us-east-2.compute.internal 0
ip-10-3-67-58.us-east-2.compute.internal 6
[Sultanovich@Dev [~] 18:49:38 ~] $ 



[Sultanovich@Dev [~] 18:51:10 ~] $ kubectl-d get daemonset ebs-csi-node -n kube-system -o yaml | egrep 'volume-attach-limit|name: ebs-plugin'
        - --volume-attach-limit=16
        name: ebs-plugin
[Sultanovich@Dev [~] 18:51:15 ~] $

[Sultanovich@Dev [~] 18:51:17 ~] $ kubectl-d -n vol-limit-test get pvc  | wc -l
No resources found in vol-limit-test namespace.
0
[Sultanovich@Dev [~] 18:51:50 ~] $  
[Sultanovich@Dev [~] 18:52:42 ~] $ kubectl-d -n vol-limit-test apply -f many_pods.yaml
service/wordpress-mysql created
persistentvolumeclaim/vol-limit-test created
deployment.apps/wordpress-mysql created
pod/mypod created
persistentvolumeclaim/vol-limit-test-1 created
pod/vol-limit-test-2 created
service/wordpress created
persistentvolumeclaim/wp-pv-claim created
deployment.apps/wordpress created
persistentvolumeclaim/vol-limit-test-2 created
pod/vol-limit-test-3 created
persistentvolumeclaim/vol-limit-test-3 created
pod/vol-limit-test-4 created
persistentvolumeclaim/vol-limit-test-4 created
pod/vol-limit-test-5 created
persistentvolumeclaim/vol-limit-test-5 created
pod/vol-limit-test-6 created
persistentvolumeclaim/vol-limit-test-6 created
pod/vol-limit-test-7 created
persistentvolumeclaim/vol-limit-test-7 created
pod/vol-limit-test-8 created
persistentvolumeclaim/vol-limit-test-8 created
pod/vol-limit-test-9 created
persistentvolumeclaim/vol-limit-test-9 created
pod/vol-limit-test-70 created
persistentvolumeclaim/vol-limit-test-70 created
pod/vol-limit-test-80 created
persistentvolumeclaim/vol-limit-test-80 created
pod/vol-limit-test-90 created
persistentvolumeclaim/vol-limit-test-90 created
pod/vol-limit-test-700 created
persistentvolumeclaim/vol-limit-test-700 created
pod/vol-limit-test-800 created
persistentvolumeclaim/vol-limit-test-800 created
pod/vol-limit-test-900 created
persistentvolumeclaim/vol-limit-test-900 created
pod/vol-limit-test-701 created
persistentvolumeclaim/vol-limit-test-701 created
pod/vol-limit-test-801 created
persistentvolumeclaim/vol-limit-test-801 created
pod/vol-limit-test-901 created
persistentvolumeclaim/vol-limit-test-901 created
pod/vol-limit-test-7000 created
persistentvolumeclaim/vol-limit-test-7000 created
pod/vol-limit-test-8000 created
persistentvolumeclaim/vol-limit-test-8000 created
pod/vol-limit-test-9000 created
persistentvolumeclaim/vol-limit-test-9000 created
pod/vol-limit-test-800000 created
persistentvolumeclaim/vol-limit-test-800000 created
pod/vol-limit-test-90000 created
persistentvolumeclaim/vol-limit-test-90000 created
pod/vol-limit-test-700000 created
persistentvolumeclaim/vol-limit-test-700000 created
[Sultanovich@Dev [~] 18:52:59 ~] $

[Sultanovich@Dev [~] 18:53:04 ~] $ kubectl-d -n vol-limit-test get pvc  | wc -l
27
[Sultanovich@Dev [~] 18:53:08 ~] $ 
[Sultanovich@Dev [~] 18:53:10 ~] $ 
[Sultanovich@Dev [~] 18:53:37 ~] $ ./count_vol.sh 
ip-10-2-123-153.us-east-2.compute.internal 16
ip-10-3-67-58.us-east-2.compute.internal 14

[Sultanovich@Dev [~] 18:56:02 ~] $ kubectl-d get daemonset ebs-csi-node -n kube-system -o yaml | egrep 'volume-attach-limit|name: ebs-plugin'
        - --volume-attach-limit=16
        name: ebs-plugin
[Sultanovich@Dev [~] 18:56:05 ~] $

aglees · 2022-02-24T13:08:12Z

We're trying out reducing the --volume-attach-limit via the Helm Chart without so much success: we're setting a deliberately low number (17), so very similar to you.

One thing that we do have is pods with more than one volume. I wonder whether is a factor?

aws-ebs-csi-driver:v1.5.0

I'm using a JQ parse to report results, and am seeing many nodes with much higher counts of volumes in use.

kubectl get nodes -o json | jq '.items[] | {"nodeName": .metadata.name, "zone": .metadata.labels."topology.kubernetes.io/zone", "volumesInUse": .status.volumesInUse | length, "volumesAttached": .status.volumesAttached | length }'

For example, this r5.2xlarge node:

{
  "nodeName": "ip-10-229-90-163.eu-west-1.compute.internal",
  "zone": "eu-west-1b",
  "volumesInUse": 26,
  "volumesAttached": 26,
}

from kubectl get pods:

containers:
  - args:
    - node
    - --endpoint=$(CSI_ENDPOINT)
    - --volume-attach-limit=17
    - --logtostderr
    - --v=2
    env:
    - name: CSI_ENDPOINT
      value: unix:/csi/csi.sock
    - name: CSI_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    image: {REDACTED}/k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.5.0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /healthz
        port: healthz
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 3
    name: ebs-plugin
    ports:
    - containerPort: 9808
      name: healthz
      protocol: TCP
    resources:
      limits:
        cpu: 100m
        memory: 512Mi
      requests:
        cpu: 20m
        memory: 64Mi
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/kubelet
      mountPropagation: Bidirectional
      name: kubelet-dir
    - mountPath: /csi
      name: plugin-dir
    - mountPath: /dev
      name: device-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qvcx8
      readOnly: true

torredil · 2022-03-04T19:48:53Z

@sultanovich @aglees Thanks for noting that the configuration must be applied in the DaemonSet ebs-csi-node. Reaching out to get confirmation that volume-attach-limit is working as intended. Please let me know if further support is needed from our end.

sultanovich · 2022-03-15T14:58:52Z

Hi @torredil sorry for the delay we continue this issue on this thread after I created this issue.

I just have a question for the volume-attach-limit argument. If I have one or more nodes that already have more pods than the volume-attach-limit argument, does this setting move these pods to new nodes or does it only apply to future pods?, What would be the expected behavior?

stevehipwell · 2022-03-15T16:19:20Z

@sultanovich based on the code, how Kubernetes generally works, and #1163 I doubt changing this setting will have any impact on already scheduled pods.

torredil · 2022-03-15T17:09:26Z

I agree with @stevehipwell's assessment. The volume-attach-limit argument will not cause existing pods to be rescheduled.

sultanovich · 2022-03-15T20:29:35Z

Excellent, thank you very much for the confirmation @stevehipwell / @torredil . I'm going to do some additional checking in the afternoon and then post if my findings are relevant to the issue so you can close it later if you wish.

sultanovich mentioned this issue Feb 21, 2022

instance volume limits: workloads no longer attach ebs volumes #1163

Closed

torredil mentioned this issue Apr 21, 2022

REQUEST: New membership for torredil kubernetes/org#3385

Closed

9 tasks

torredil closed this as completed May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

volume-attach-limit argument doesn't work in 1.20 #1174

volume-attach-limit argument doesn't work in 1.20 #1174

sultanovich commented Feb 21, 2022 •

edited

Loading

k8s-ci-robot commented Feb 21, 2022

sultanovich commented Feb 23, 2022 •

edited

Loading

sultanovich commented Feb 23, 2022

aglees commented Feb 24, 2022 •

edited

Loading

torredil commented Mar 4, 2022 •

edited

Loading

sultanovich commented Mar 15, 2022

stevehipwell commented Mar 15, 2022

torredil commented Mar 15, 2022

sultanovich commented Mar 15, 2022

volume-attach-limit argument doesn't work in 1.20 #1174

volume-attach-limit argument doesn't work in 1.20 #1174

Comments

sultanovich commented Feb 21, 2022 • edited Loading

k8s-ci-robot commented Feb 21, 2022

sultanovich commented Feb 23, 2022 • edited Loading

sultanovich commented Feb 23, 2022

aglees commented Feb 24, 2022 • edited Loading

torredil commented Mar 4, 2022 • edited Loading

sultanovich commented Mar 15, 2022

stevehipwell commented Mar 15, 2022

torredil commented Mar 15, 2022

sultanovich commented Mar 15, 2022

sultanovich commented Feb 21, 2022 •

edited

Loading

sultanovich commented Feb 23, 2022 •

edited

Loading

aglees commented Feb 24, 2022 •

edited

Loading

torredil commented Mar 4, 2022 •

edited

Loading