Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volume-attach-limit argument doesn't work in 1.20 #1174

Closed
sultanovich opened this issue Feb 21, 2022 · 9 comments
Closed

volume-attach-limit argument doesn't work in 1.20 #1174

sultanovich opened this issue Feb 21, 2022 · 9 comments

Comments

@sultanovich
Copy link

sultanovich commented Feb 21, 2022

/triage support

What happened?

Since K8s keeps trying to add volumes when it reaches the limit allowed by the AWS instance, try using the volume-attach-limit argument to find a workaround while troubleshooting.

How to reproduce it (as minimally and precisely as possible)?

It is possible to reproduce by setting the argument and trying to create more volumes than the maximum configured as seen in the following example.

Name:                   ebs-csi-controller
Namespace:              kube-system
CreationTimestamp:      Mon, 07 Jun 2021 06:31:13 +0000
Labels:                 app.kubernetes.io/name=aws-ebs-csi-driver
Annotations:            deployment.kubernetes.io/revision: 8
Selector:               app=ebs-csi-controller,app.kubernetes.io/name=aws-ebs-csi-driver
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=ebs-csi-controller
                    app.kubernetes.io/name=aws-ebs-csi-driver
  Service Account:  ebs-csi-controller-sa
  Containers:
   ebs-plugin:
    Image:      k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0
    Port:       9808/TCP
    Host Port:  0/TCP
    Args:
      --endpoint=$(CSI_ENDPOINT)
      --logtostderr
      --v=2
      --k8s-tag-cluster-id=cloud-dev-cluster-mix
      --volume-attach-limit=10

Environment

Kubernetes Version:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.7", GitCommit:"bfb38f707bc4a8edfcd73472ec3d96b500b8b781", GitTreeState:"clean", BuildDate:"2020-08-12T20:27:48Z", GoVersion:"go1.13.14", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

CSI-EBS Driver Version:

[sulta@dev [~] 14:08:21 ~] $ kubectl --kubeconfig=/home/centos/dev.kube -n kube-system describe deployments ebs-csi-controller | grep Image:
    Image:      k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0
    Image:      k8s.gcr.io/sig-storage/csi-provisioner:v2.1.1
    Image:      k8s.gcr.io/sig-storage/csi-attacher:v3.1.0
    Image:      k8s.gcr.io/sig-storage/csi-snapshotter:v3.0.3
    Image:      k8s.gcr.io/sig-storage/csi-resizer:v1.0.0
    Image:      k8s.gcr.io/sig-storage/livenessprobe:v2.2.0
[sulta@dev [~] 14:08:27 ~] $

Previous consultation without solution in the slack channel
https://kubernetes.slack.com/archives/C09NXKJKA/p1645218261237229

@k8s-ci-robot
Copy link
Contributor

@sultanovich: The label(s) triage/support cannot be applied, because the repository doesn't have them.

In response to this:

/triage support

What happened?

Since K8s keeps trying to add volumes when it reaches the limit allowed by the AWS instance, try using the volume-attach-limit argument to find a workaround while troubleshooting.

How to reproduce it (as minimally and precisely as possible)?

It is possible to reproduce by setting the argument and trying to create more volumes than the maximum configured as seen in the following example.

Name:                   ebs-csi-controller
Namespace:              kube-system
CreationTimestamp:      Mon, 07 Jun 2021 06:31:13 +0000
Labels:                 app.kubernetes.io/name=aws-ebs-csi-driver
Annotations:            deployment.kubernetes.io/revision: 8
Selector:               app=ebs-csi-controller,app.kubernetes.io/name=aws-ebs-csi-driver
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
 Labels:           app=ebs-csi-controller
                   app.kubernetes.io/name=aws-ebs-csi-driver
 Service Account:  ebs-csi-controller-sa
 Containers:
  ebs-plugin:
   Image:      k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0
   Port:       9808/TCP
   Host Port:  0/TCP
   Args:
     --endpoint=$(CSI_ENDPOINT)
     --logtostderr
     --v=2
     --k8s-tag-cluster-id=cloud-dev-cluster-mix
     --volume-attach-limit=10

Environment

Kubernetes Version:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.7", GitCommit:"bfb38f707bc4a8edfcd73472ec3d96b500b8b781", GitTreeState:"clean", BuildDate:"2020-08-12T20:27:48Z", GoVersion:"go1.13.14", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

CSI-EBS Driver Version:

[sulta@dev [~] 14:08:21 ~] $ kubectl --kubeconfig=/home/centos/dev.kube -n kube-system describe deployments ebs-csi-controller | grep Image:
   Image:      k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0
   Image:      k8s.gcr.io/sig-storage/csi-provisioner:v2.1.1
   Image:      k8s.gcr.io/sig-storage/csi-attacher:v3.1.0
   Image:      k8s.gcr.io/sig-storage/csi-snapshotter:v3.0.3
   Image:      k8s.gcr.io/sig-storage/csi-resizer:v1.0.0
   Image:      k8s.gcr.io/sig-storage/livenessprobe:v2.2.0
[sulta@dev [~] 14:08:27 ~] $

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sultanovich
Copy link
Author

sultanovich commented Feb 23, 2022

/kind/support
kind/support
/triage/ kind/support

Keep checking the repository and I have verified that there is an example indicating that the volume-attach-limit argument should be located in the ebs-plugin container section, as I have tried to do in my tests.

      containers:
        - name: ebs-plugin
          securityContext:
            privileged: true
          image: {{ printf "%s:%s" .Values.image.repository (default (printf "v%s" .Chart.AppVersion) (toString .Values.image.tag)) }}
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          args:
            - node
            - --endpoint=$(CSI_ENDPOINT)
            {{- with .Values.node.volumeAttachLimit }}
            - --volume-attach-limit={{ . }}
            {{- end }}
            - --logtostderr
            - --v={{ .Values.node.logLevel }}

Any idea why this argument didn't work?

@sultanovich
Copy link
Author

Continue testing and I have been able to validate that the volume-attach-limit argument works correctly now. The configuration must be applied in the DaemonSet ebs-csi-node, not in the Deployment ebs-csi-controller.

image

Editing the DaemonSet the configuration works correctly:

[Sultanovich@Dev [~] 18:49:21 ~] $ ./count_vol.sh 
ip-10-2-123-153.us-east-2.compute.internal 0
ip-10-3-67-58.us-east-2.compute.internal 6
[Sultanovich@Dev [~] 18:49:38 ~] $ 



[Sultanovich@Dev [~] 18:51:10 ~] $ kubectl-d get daemonset ebs-csi-node -n kube-system -o yaml | egrep 'volume-attach-limit|name: ebs-plugin'
        - --volume-attach-limit=16
        name: ebs-plugin
[Sultanovich@Dev [~] 18:51:15 ~] $ 
[Sultanovich@Dev [~] 18:51:17 ~] $ kubectl-d -n vol-limit-test get pvc  | wc -l
No resources found in vol-limit-test namespace.
0
[Sultanovich@Dev [~] 18:51:50 ~] $  
[Sultanovich@Dev [~] 18:52:42 ~] $ kubectl-d -n vol-limit-test apply -f many_pods.yaml
service/wordpress-mysql created
persistentvolumeclaim/vol-limit-test created
deployment.apps/wordpress-mysql created
pod/mypod created
persistentvolumeclaim/vol-limit-test-1 created
pod/vol-limit-test-2 created
service/wordpress created
persistentvolumeclaim/wp-pv-claim created
deployment.apps/wordpress created
persistentvolumeclaim/vol-limit-test-2 created
pod/vol-limit-test-3 created
persistentvolumeclaim/vol-limit-test-3 created
pod/vol-limit-test-4 created
persistentvolumeclaim/vol-limit-test-4 created
pod/vol-limit-test-5 created
persistentvolumeclaim/vol-limit-test-5 created
pod/vol-limit-test-6 created
persistentvolumeclaim/vol-limit-test-6 created
pod/vol-limit-test-7 created
persistentvolumeclaim/vol-limit-test-7 created
pod/vol-limit-test-8 created
persistentvolumeclaim/vol-limit-test-8 created
pod/vol-limit-test-9 created
persistentvolumeclaim/vol-limit-test-9 created
pod/vol-limit-test-70 created
persistentvolumeclaim/vol-limit-test-70 created
pod/vol-limit-test-80 created
persistentvolumeclaim/vol-limit-test-80 created
pod/vol-limit-test-90 created
persistentvolumeclaim/vol-limit-test-90 created
pod/vol-limit-test-700 created
persistentvolumeclaim/vol-limit-test-700 created
pod/vol-limit-test-800 created
persistentvolumeclaim/vol-limit-test-800 created
pod/vol-limit-test-900 created
persistentvolumeclaim/vol-limit-test-900 created
pod/vol-limit-test-701 created
persistentvolumeclaim/vol-limit-test-701 created
pod/vol-limit-test-801 created
persistentvolumeclaim/vol-limit-test-801 created
pod/vol-limit-test-901 created
persistentvolumeclaim/vol-limit-test-901 created
pod/vol-limit-test-7000 created
persistentvolumeclaim/vol-limit-test-7000 created
pod/vol-limit-test-8000 created
persistentvolumeclaim/vol-limit-test-8000 created
pod/vol-limit-test-9000 created
persistentvolumeclaim/vol-limit-test-9000 created
pod/vol-limit-test-800000 created
persistentvolumeclaim/vol-limit-test-800000 created
pod/vol-limit-test-90000 created
persistentvolumeclaim/vol-limit-test-90000 created
pod/vol-limit-test-700000 created
persistentvolumeclaim/vol-limit-test-700000 created
[Sultanovich@Dev [~] 18:52:59 ~] $ 
[Sultanovich@Dev [~] 18:53:04 ~] $ kubectl-d -n vol-limit-test get pvc  | wc -l
27
[Sultanovich@Dev [~] 18:53:08 ~] $ 
[Sultanovich@Dev [~] 18:53:10 ~] $ 
[Sultanovich@Dev [~] 18:53:37 ~] $ ./count_vol.sh 
ip-10-2-123-153.us-east-2.compute.internal 16
ip-10-3-67-58.us-east-2.compute.internal 14

[Sultanovich@Dev [~] 18:56:02 ~] $ kubectl-d get daemonset ebs-csi-node -n kube-system -o yaml | egrep 'volume-attach-limit|name: ebs-plugin'
        - --volume-attach-limit=16
        name: ebs-plugin
[Sultanovich@Dev [~] 18:56:05 ~] $ 

@aglees
Copy link

aglees commented Feb 24, 2022

We're trying out reducing the --volume-attach-limit via the Helm Chart without so much success: we're setting a deliberately low number (17), so very similar to you.

One thing that we do have is pods with more than one volume. I wonder whether is a factor?

aws-ebs-csi-driver:v1.5.0

I'm using a JQ parse to report results, and am seeing many nodes with much higher counts of volumes in use.

kubectl get nodes -o json | jq '.items[] | {"nodeName": .metadata.name, "zone": .metadata.labels."topology.kubernetes.io/zone", "volumesInUse": .status.volumesInUse | length, "volumesAttached": .status.volumesAttached | length }'

For example, this r5.2xlarge node:

{
  "nodeName": "ip-10-229-90-163.eu-west-1.compute.internal",
  "zone": "eu-west-1b",
  "volumesInUse": 26,
  "volumesAttached": 26,
}

from kubectl get pods:

containers:
  - args:
    - node
    - --endpoint=$(CSI_ENDPOINT)
    - --volume-attach-limit=17
    - --logtostderr
    - --v=2
    env:
    - name: CSI_ENDPOINT
      value: unix:/csi/csi.sock
    - name: CSI_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    image: {REDACTED}/k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.5.0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /healthz
        port: healthz
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 3
    name: ebs-plugin
    ports:
    - containerPort: 9808
      name: healthz
      protocol: TCP
    resources:
      limits:
        cpu: 100m
        memory: 512Mi
      requests:
        cpu: 20m
        memory: 64Mi
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/kubelet
      mountPropagation: Bidirectional
      name: kubelet-dir
    - mountPath: /csi
      name: plugin-dir
    - mountPath: /dev
      name: device-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-qvcx8
      readOnly: true

@torredil
Copy link
Member

torredil commented Mar 4, 2022

@sultanovich @aglees Thanks for noting that the configuration must be applied in the DaemonSet ebs-csi-node. Reaching out to get confirmation that volume-attach-limit is working as intended. Please let me know if further support is needed from our end.

@sultanovich
Copy link
Author

Hi @torredil sorry for the delay we continue this issue on this thread after I created this issue.

I just have a question for the volume-attach-limit argument. If I have one or more nodes that already have more pods than the volume-attach-limit argument, does this setting move these pods to new nodes or does it only apply to future pods?, What would be the expected behavior?

@stevehipwell
Copy link
Contributor

@sultanovich based on the code, how Kubernetes generally works, and #1163 I doubt changing this setting will have any impact on already scheduled pods.

@torredil
Copy link
Member

I agree with @stevehipwell's assessment. The volume-attach-limit argument will not cause existing pods to be rescheduled.

@sultanovich
Copy link
Author

Excellent, thank you very much for the confirmation @stevehipwell / @torredil . I'm going to do some additional checking in the afternoon and then post if my findings are relevant to the issue so you can close it later if you wish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants