Driver name s3.csi.aws.com not found in the list of registered CSI drivers #107

jmateusppay · 2023-12-12T18:55:58Z

/kind bug

What happened?

When mounting the volume on the pod, it cannot locate the drive

Warning FailedMount 12s (x8 over 76s) kubelet MountVolume.MountDevice failed for volume "s3-pv" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name s3.csi.aws.com not found in the list of registered CSI drivers

What you expected to happen?

That it can normally mount the volume without failure

How to reproduce it (as minimally and precisely as possible)?

Apply the example yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: s3-pv
spec:
  capacity:
    storage: 120Gi # ignored, required
  accessModes:
    - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
  mountOptions:
    - allow-delete
    - region us-east-1
  csi:
    driver: s3.csi.aws.com # required
    volumeHandle: s3-csi-driver-volume
    volumeAttributes:
      bucketName: s3-csi-driver-private
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: s3-claim
spec:
  accessModes:
    - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
  storageClassName: "" # required for static provisioning
  resources:
    requests:
      storage: 120Gi # ignored, required
  volumeName: s3-pv
---
apiVersion: v1
kind: Pod
metadata:
  name: s3-app
spec:
  containers:
    - name: app
      image: centos
      command: ["/bin/sh"]
      args: ["-c", "echo 'Hello from the container!' >> /data/$(date -u).txt; tail -f /dev/null"]
      volumeMounts:
        - name: persistent-storage
          mountPath: /data
  volumes:
    - name: persistent-storage
      persistentVolumeClaim:
        claimName: s3-claim

Anything else we need to know?:

 kc get pvc
NAME       STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
s3-claim   Bound    s3-pv    120Gi      RWX                           4m47s

kc get pv
s3-pv       120Gi      RWX            Retain          Bound    kube-system/s3-claim

 kubectl get csidriver
NAME                         ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
ebs.csi.aws.com              true             false            false             <unset>         false               Persistent   48d
efs.csi.aws.com              false            false            false             <unset>         false               Persistent   48d
s3.csi.aws.com               false            false            false             <unset>         false               Persistent   135m

 kc get storageClass
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2             kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  203d
gp3 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  203d

Is it necessary to create a new storage class?

kc logs pod/s3-csi-node-r2w75
Defaulted container "s3-plugin" out of: s3-plugin, node-driver-registrar, liveness-probe, install-mountpoint (init)
I1212 16:35:21.276393       1 driver.go:61] Driver version: 1.1.0, Git commit: c681ab1f19ccba5976e3263f0e3df65718750369, build date: 2023-12-05T19:47:03Z, nodeID: ip-0-00-0-00.ec2.internal, mount-s3 version: 1.3.1
I1212 16:35:21.282921       1 mount_linux.go:285] 'umount /tmp/kubelet-detect-safe-umount3132235530' failed with: exit status 32, output: umount: /tmp/kubelet-detect-safe-umount3132235530: must be superuser to unmount.
I1212 16:35:21.282946       1 mount_linux.go:287] Detected umount with unsafe 'not mounted' behavior
I1212 16:35:21.289423       1 driver.go:83] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I1212 16:35:21.289599       1 driver.go:113] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I1212 16:35:22.113470       1 node.go:204] NodeGetInfo: called with args

kc describe sa s3-csi-driver-sa
Name:                s3-csi-driver-sa
Namespace:           kube-system
Labels:              app.kubernetes.io/component=csi-driver
                     app.kubernetes.io/instance=aws-mountpoint-s3-csi-driver
                     app.kubernetes.io/managed-by=EKS
                     app.kubernetes.io/name=aws-mountpoint-s3-csi-driver
Annotations:         eks.amazonaws.com/role-arn: arn:aws:iam::0000000:role/TMP_AmazonEKS_S3_CSI_DriverRole
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>

Environment

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:47:38Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.15-eks-4f4795d", GitCommit:"9587e521d190ecb7ce201993ceea41955ed4a556", GitTreeState:"clean", BuildDate:"2023-10-20T23:22:38Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.27) and server (1.25) exceeds the supported minor version skew of +/-1

Driver version: v1.1.0-eksbuild.1

The text was updated successfully, but these errors were encountered:

marcheyer · 2023-12-13T07:45:25Z

Did you install this EKS Addon? Then the CSI driver is not installed. But this works

apiVersion: storage.k8s.io/v1 
kind: CSIDriver
metadata:
  name: s3.csi.aws.com
spec:
  attachRequired: false

jmateusppay · 2023-12-13T13:03:19Z

Did you install this EKS Addon? Then the CSI driver is not installed. But this works
apiVersion: storage.k8s.io/v1 
kind: CSIDriver
metadata:
  name: s3.csi.aws.com
spec:
  attachRequired: false

Yes, I installed it via EKS addons, the drive was already installed.

The strange thing is just that, even with it installed I get the error:

... driver name s3.csi.aws.com not found in the list of registered CSI drivers

static_provisioning git:(main) ✗ kc get CSIDriver
NAME                         ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
csi.oneagent.dynatrace.com   false            true             false             <unset>         false               Ephemeral    175d
ebs.csi.aws.com              true             false            false             <unset>         false               Persistent   48d
efs.csi.aws.com              false            false            false             <unset>         false               Persistent   48d
s3.csi.aws.com               false            false            false             <unset>         false               Persistent   20h

➜  static_provisioning git:(main) ✗ kc get CSIDriver s3.csi.aws.com -o yaml
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  creationTimestamp: "2023-12-12T16:35:19Z"
  labels:
    app.kubernetes.io/component: csi-driver
    app.kubernetes.io/instance: aws-mountpoint-s3-csi-driver
    app.kubernetes.io/managed-by: EKS
    app.kubernetes.io/name: aws-mountpoint-s3-csi-driver
  name: s3.csi.aws.com
  resourceVersion: "222744296"
  uid: 9703b843-5d47-4545-b5e2-00387ec1c2d0
spec:
  attachRequired: false
  fsGroupPolicy: ReadWriteOnceWithFSType
  podInfoOnMount: false
  requiresRepublish: false
  storageCapacity: false
  volumeLifecycleModes:

marcheyer · 2023-12-13T16:18:37Z

If i understand it correctly then the CSIDriver is in place but the resource can't find it.
Then i think it is rather a problem with your api-server then with the csi driver itself.
But i don't have a clou how to solve this. Sorry.

patrickpa · 2023-12-14T13:56:08Z

Hey, I have noticed this issue as well. Usually it happens when the pod with mounted volume spawns before the s3-csi pod in a recently spawned node. But, usually, after a while my pod resumes normally.

This is an error I have had before using other CSIs, such as FSx CSI, and they have a mechanism with startup taints to prevent pods to start before the CSI pod is present.

This is the reference for FSx CSI taint:

https://github.com/kubernetes-sigs/aws-fsx-csi-driver/blob/master/docs/install.md#configure-node-startup-taint

Hope this helps somehow! 😃

chuanwen-wu · 2023-12-16T13:34:12Z

I found the same error in some nodes, not all the nodes:

Normal Scheduled 18s default-scheduler Successfully assigned default/app-s3-84c5c995cf-k9v8g to ip-10-0-50-37.us-west-2.compute.internal
Warning FailedMount 2s (x6 over 18s) kubelet MountVolume.MountDevice failed for volume "s3-pv" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name s3.csi.aws.com not found in the list of registered CSI drivers

I found that by run:

kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-mountpoint-s3-csi-driver

NAME READY STATUS RESTARTS AGE
s3-csi-node-4t6kt 3/3 Running 0 76m
s3-csi-node-586lj 3/3 Running 0 76m
s3-csi-node-5hfzr 3/3 Running 0 57m

But I have 5 nodes, not only 3. I found that the error only occur in the node with some custom taints, so I delete these taints off the Node, then it works.

dlakhaws · 2024-01-11T23:34:06Z

Closing the issue for now, feel free to re-open if this issue persists.

vara-bonthu · 2024-02-08T02:13:07Z

I have encountered a similar issue to what's been described in the thread. It seems to be a timing problem when communicating with the Mountpoint S3 CSI driver. I'm attempting to mount multiple pods to the same static PVC, which is linked to an S3 bucket. This setup is for running Spark Driver and Executor jobs, essentially using the S3 bucket as a shuffle disk in place of an EBS volume. Here's the configuration I'm using:

Initially, my driver pods only mount successfully after a few attempts and restarting the pod. The error encountered is:

MountVolume.MountDevice failed for volume s3-pv : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name s3.csi.aws.com not found in the list of registered CSI drivers

The Spark driver pod works on the second attempt of running the same job. However, the executor pods error out with the same issue.

Error in First Attempt

On the second attempt, both driver and executor pods run successfully.

@patrickpa suggested a possible solution using node startup taint, which I haven't tried yet but plan to explore later and will update accordingly.

However, I've run into a different issue related to file renaming. All the Spark executor pods failed with the following error, causing the job to terminate:

24/02/08 01:56:39 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (100.64.52.5 executor 1): java.io.IOException: fail to rename file /data/blockmgr-60c7d3cc-049c-4cf3-949d-89a9bbbc46e2/32/shuffle_0_2_0.index.24900016-5dda-41ed-8601-2b4c6cb575d6 to /data/blockmgr-60c7d3cc-049c-4cf3-949d-89a9bbbc46e2/32/shuffle_0_2_0.index
at org.apache.spark.shuffle.IndexShuffleBlockResolver.writeMetadataFile(IndexShuffleBlockResolver.scala:467)

This leads me to question the viability of using an S3 bucket with mountpoint-s3 for Spark jobs as a replacement for EBS storage.

Given that Spark Driver and Executors need to create, update, and delete files from the S3 bucket, are there any limitations or considerations I might be missing here?

Any insights would be greatly appreciated.

jmateusppay · 2024-02-08T11:28:35Z

I redid the installation using the deploy files, and I was successful.
https://github.com/awslabs/mountpoint-s3-csi-driver/tree/main/deploy/kubernetes

for some reason the aws addons were not replicating the drive pods on all nodes.

surya9teja · 2024-03-18T11:50:54Z

As anyone found a way to resolve this problem, In my case I use auto-scaling in my cluster whenever a new node provisioned via auto-scaler, The S3-csi-driver addon won't configured in new node which is annoying but I also have efs-driver-addon which works fine.

Update:

When I check the kubectl get deamonset -n kube-system

❯ kubectl get daemonset -n kube-system
NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
aws-node                         2         2         2       2            2           <none>                   52d
ebs-csi-node                     2         2         2       2            2           kubernetes.io/os=linux   49d
efs-csi-node                     2         2         2       2            2           kubernetes.io/os=linux   41d
kube-proxy                       2         2         2       2            2           <none>                   52d
nvidia-device-plugin-daemonset   2         2         2       2            2           <none>                   52d
s3-csi-node                      1         1         1       1            1           kubernetes.io/os=linux   3m52s

s3-csi-node should be two each one in each node but the desired state is only one .

wcw84 · 2024-03-18T15:10:31Z

As anyone found a way to resolve this problem, In my case I use auto-scaling in my cluster whenever a new node provisioned via auto-scaler, The S3-csi-driver addon won't configured in new node which is annoying but I also have efs-driver-addon which works fine.

Update:

When I check the kubectl get deamonset -n kube-system
❯ kubectl get daemonset -n kube-system
NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
aws-node                         2         2         2       2            2           <none>                   52d
ebs-csi-node                     2         2         2       2            2           kubernetes.io/os=linux   49d
efs-csi-node                     2         2         2       2            2           kubernetes.io/os=linux   41d
kube-proxy                       2         2         2       2            2           <none>                   52d
nvidia-device-plugin-daemonset   2         2         2       2            2           <none>                   52d
s3-csi-node                      1         1         1       1            1           kubernetes.io/os=linux   3m52s
s3-csi-node should be two each one in each node but the desired state is only one .

check the taints of your nodes and the tolerance of your s3-csi-node and ebs-csi-node, which may be different.

surya9teja · 2024-03-18T21:26:26Z

As anyone found a way to resolve this problem, In my case I use auto-scaling in my cluster whenever a new node provisioned via auto-scaler, The S3-csi-driver addon won't configured in new node which is annoying but I also have efs-driver-addon which works fine.
Update:
When I check the kubectl get deamonset -n kube-system
❯ kubectl get daemonset -n kube-system
NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
aws-node                         2         2         2       2            2           <none>                   52d
ebs-csi-node                     2         2         2       2            2           kubernetes.io/os=linux   49d
efs-csi-node                     2         2         2       2            2           kubernetes.io/os=linux   41d
kube-proxy                       2         2         2       2            2           <none>                   52d
nvidia-device-plugin-daemonset   2         2         2       2            2           <none>                   52d
s3-csi-node                      1         1         1       1            1           kubernetes.io/os=linux   3m52s
s3-csi-node should be two each one in each node but the desired state is only one .
check the taints of your nodes and the tolerance of your s3-csi-node and ebs-csi-node, which may be different.

@wcw84
I have two group of nodes CPU and GPU node. CPU node is at least one running 24x7 but GPU node scale down to zero. CPU node have no taints while gpu node have taint nvidia.com/gpu=present:NoSchedule

When I check the tolerations and node-selector for s3-csi-node I saw

Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 :NoExecute op=Exists for 300s
                             CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists

And for ebs-csi-node

Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists

So I am not sure which stops the s3-csi-node to run in GPU node.

Note: I have installed both from EKS web interface in aws.

surya9teja · 2024-03-19T05:16:20Z

I found the issue, my GPU node have a following taint and i removed the taint

taints:
    - key: nvidia.com/gpu
      value: "present"
      effect: "NoSchedule"

These taints prevents the s3-csi-driver to schedule. So I am using labels and nodeaffinity to deploy the apps instead of toleration. So for now s3-csi-driver won't schedule into the node if you have any taints even though other addons in EKS works fine. For now this is the solution.

spolloni · 2024-07-25T04:02:37Z

@dlakhaws any chance we we can reopen this given the activity on the issue? I am experiencing similar issues to @surya9teja, seems problematic that the driver won't run on nodes when some taints are used.

unexge · 2024-07-25T09:50:41Z

Reopened. @spolloni are you also using EKS add-on?

surya9teja · 2024-07-25T10:20:35Z

@spolloni If you install s3-csi-driver from EKS management portal the issue still persist. So I used the source code from the repo and deploy the addon manually by adding the following tolerations into the file node-daemonset.yaml. So for now it works fine if you have any other tolerations you can add them and deploy.

tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - operator: Exists
          effect: NoExecute
          tolerationSeconds: 300
        - key: nvidia.com/gpu
          operator: Exists

spolloni · 2024-07-25T13:18:25Z

@surya9teja ok, thanks for the tip!

@spolloni are you also using EKS add-on?

@unexge yes I am. I just updated to the latest version (1.7.0) to make sure the issue persisted. I am completely ignorant about how this driver works but do you think this issue is "generally" fixable in the add-on install without the workaround suggested above?

unexge · 2024-07-25T14:40:41Z

Hey @spolloni, our EKS add-on doesn't allow configuring tolerations at the moment. We plan to support that, it's tracked by #109.

Meanwhile, I think only workaround is using our Helm-chart/Kustomization-manfiest to configure tolerations as @surya9teja suggested.

spolloni · 2024-07-25T17:14:07Z

sounds good. thanks for the help @unexge!

unexge · 2024-08-30T07:34:19Z

v1.8.0 of our EKS add-on has been released with node.tolerateAllTaints and node.tolerations configuration values:

$ aws eks describe-addon-configuration --addon-name aws-mountpoint-s3-csi-driver --addon-version v1.8.0-eksbuild.1
{
    "addonName": "aws-mountpoint-s3-csi-driver",
    "addonVersion": "v1.8.0-eksbuild.1",
    "configurationSchema": "{\"$schema\":\"https://json-schema.org/draft/2019-09/schema\",\"additionalProperties\":false,\"description\":\"Configurable param
eters for Mountpoint for S3 CSI Driver\",\"properties\":{\"node\":{\"additionalProperties\":false,\"properties\":{\"tolerateAllTaints\":{\"default\":false,\"
description\":\"Mountpoint for S3 CSI Driver Pods will tolerate all taints and will be scheduled in all nodes\",\"type\":\"boolean\"},\"tolerations\":{\"defa
ult\":[],\"items\":{\"type\":\"object\"},\"title\":\"Tolerations for Mountpoint for S3 CSI Driver Pods\",\"type\":\"array\"}},\"type\":\"object\"}},\"type\":
\"object\"}",
    "podIdentityConfiguration": []
}

You can set node.tolerateAllTaints to true if you want CSI driver's Pods to schedule all nodes in the cluster, or you can configure node.tolerations array if you need more granularity.

For example:

$ aws eks create-addon --cluster-name ... \
    --addon-name aws-mountpoint-s3-csi-driver \
    --service-account-role-arn ... \
    --configuration-values '{"node":{"tolerateAllTaints":true}}'

Closing the issue now. Could you please try upgrading to v1.8.0 with a toleration config to see if that solves the problem? Please let us know if the issue persists.

dlakhaws closed this as completed Jan 11, 2024

unexge reopened this Jul 25, 2024

unexge added the bug Something isn't working label Jul 30, 2024

unexge closed this as completed Aug 30, 2024

unexge mentioned this issue Aug 30, 2024

Bottlerocket AMI mounting fail event in pod #168

Open

artamokhin mentioned this issue Sep 12, 2024

Mount error with --allow-overwrite flag and several pods launched #249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Driver name s3.csi.aws.com not found in the list of registered CSI drivers #107

Driver name s3.csi.aws.com not found in the list of registered CSI drivers #107

jmateusppay commented Dec 12, 2023 •

edited

Loading

marcheyer commented Dec 13, 2023 •

edited

Loading

jmateusppay commented Dec 13, 2023

marcheyer commented Dec 13, 2023

patrickpa commented Dec 14, 2023

chuanwen-wu commented Dec 16, 2023

dlakhaws commented Jan 11, 2024

vara-bonthu commented Feb 8, 2024

jmateusppay commented Feb 8, 2024

surya9teja commented Mar 18, 2024 •

edited

Loading

wcw84 commented Mar 18, 2024

surya9teja commented Mar 18, 2024 •

edited

Loading

surya9teja commented Mar 19, 2024

spolloni commented Jul 25, 2024

unexge commented Jul 25, 2024

surya9teja commented Jul 25, 2024

spolloni commented Jul 25, 2024 •

edited

Loading

unexge commented Jul 25, 2024

spolloni commented Jul 25, 2024

unexge commented Aug 30, 2024

Driver name s3.csi.aws.com not found in the list of registered CSI drivers #107

Driver name s3.csi.aws.com not found in the list of registered CSI drivers #107

Comments

jmateusppay commented Dec 12, 2023 • edited Loading

marcheyer commented Dec 13, 2023 • edited Loading

jmateusppay commented Dec 13, 2023

marcheyer commented Dec 13, 2023

patrickpa commented Dec 14, 2023

chuanwen-wu commented Dec 16, 2023

dlakhaws commented Jan 11, 2024

vara-bonthu commented Feb 8, 2024

jmateusppay commented Feb 8, 2024

surya9teja commented Mar 18, 2024 • edited Loading

wcw84 commented Mar 18, 2024

surya9teja commented Mar 18, 2024 • edited Loading

surya9teja commented Mar 19, 2024

spolloni commented Jul 25, 2024

unexge commented Jul 25, 2024

surya9teja commented Jul 25, 2024

spolloni commented Jul 25, 2024 • edited Loading

unexge commented Jul 25, 2024

spolloni commented Jul 25, 2024

unexge commented Aug 30, 2024

jmateusppay commented Dec 12, 2023 •

edited

Loading

marcheyer commented Dec 13, 2023 •

edited

Loading

surya9teja commented Mar 18, 2024 •

edited

Loading

surya9teja commented Mar 18, 2024 •

edited

Loading

spolloni commented Jul 25, 2024 •

edited

Loading