Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver name s3.csi.aws.com not found in the list of registered CSI drivers #107

Closed
jmateusppay opened this issue Dec 12, 2023 · 19 comments
Closed
Labels
bug Something isn't working

Comments

@jmateusppay
Copy link

jmateusppay commented Dec 12, 2023

/kind bug

What happened?

When mounting the volume on the pod, it cannot locate the drive

Warning FailedMount 12s (x8 over 76s) kubelet MountVolume.MountDevice failed for volume "s3-pv" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name s3.csi.aws.com not found in the list of registered CSI drivers

What you expected to happen?

That it can normally mount the volume without failure

How to reproduce it (as minimally and precisely as possible)?

Apply the example yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: s3-pv
spec:
  capacity:
    storage: 120Gi # ignored, required
  accessModes:
    - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
  mountOptions:
    - allow-delete
    - region us-east-1
  csi:
    driver: s3.csi.aws.com # required
    volumeHandle: s3-csi-driver-volume
    volumeAttributes:
      bucketName: s3-csi-driver-private
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: s3-claim
spec:
  accessModes:
    - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
  storageClassName: "" # required for static provisioning
  resources:
    requests:
      storage: 120Gi # ignored, required
  volumeName: s3-pv
---
apiVersion: v1
kind: Pod
metadata:
  name: s3-app
spec:
  containers:
    - name: app
      image: centos
      command: ["/bin/sh"]
      args: ["-c", "echo 'Hello from the container!' >> /data/$(date -u).txt; tail -f /dev/null"]
      volumeMounts:
        - name: persistent-storage
          mountPath: /data
  volumes:
    - name: persistent-storage
      persistentVolumeClaim:
        claimName: s3-claim

Anything else we need to know?:

 kc get pvc
NAME       STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
s3-claim   Bound    s3-pv    120Gi      RWX                           4m47s
kc get pv
s3-pv       120Gi      RWX            Retain          Bound    kube-system/s3-claim                                                                                                                    
 kubectl get csidriver
NAME                         ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
ebs.csi.aws.com              true             false            false             <unset>         false               Persistent   48d
efs.csi.aws.com              false            false            false             <unset>         false               Persistent   48d
s3.csi.aws.com               false            false            false             <unset>         false               Persistent   135m
 kc get storageClass
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2             kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  203d
gp3 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  203d

Is it necessary to create a new storage class?

kc logs pod/s3-csi-node-r2w75
Defaulted container "s3-plugin" out of: s3-plugin, node-driver-registrar, liveness-probe, install-mountpoint (init)
I1212 16:35:21.276393       1 driver.go:61] Driver version: 1.1.0, Git commit: c681ab1f19ccba5976e3263f0e3df65718750369, build date: 2023-12-05T19:47:03Z, nodeID: ip-0-00-0-00.ec2.internal, mount-s3 version: 1.3.1
I1212 16:35:21.282921       1 mount_linux.go:285] 'umount /tmp/kubelet-detect-safe-umount3132235530' failed with: exit status 32, output: umount: /tmp/kubelet-detect-safe-umount3132235530: must be superuser to unmount.
I1212 16:35:21.282946       1 mount_linux.go:287] Detected umount with unsafe 'not mounted' behavior
I1212 16:35:21.289423       1 driver.go:83] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I1212 16:35:21.289599       1 driver.go:113] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I1212 16:35:22.113470       1 node.go:204] NodeGetInfo: called with args
kc describe sa s3-csi-driver-sa
Name:                s3-csi-driver-sa
Namespace:           kube-system
Labels:              app.kubernetes.io/component=csi-driver
                     app.kubernetes.io/instance=aws-mountpoint-s3-csi-driver
                     app.kubernetes.io/managed-by=EKS
                     app.kubernetes.io/name=aws-mountpoint-s3-csi-driver
Annotations:         eks.amazonaws.com/role-arn: arn:aws:iam::0000000:role/TMP_AmazonEKS_S3_CSI_DriverRole
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>

Environment

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:47:38Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.15-eks-4f4795d", GitCommit:"9587e521d190ecb7ce201993ceea41955ed4a556", GitTreeState:"clean", BuildDate:"2023-10-20T23:22:38Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.27) and server (1.25) exceeds the supported minor version skew of +/-1
  • Driver version: v1.1.0-eksbuild.1
@marcheyer
Copy link

marcheyer commented Dec 13, 2023

Did you install this EKS Addon? Then the CSI driver is not installed. But this works

apiVersion: storage.k8s.io/v1 
kind: CSIDriver
metadata:
  name: s3.csi.aws.com
spec:
  attachRequired: false

@jmateusppay
Copy link
Author

Did you install this EKS Addon? Then the CSI driver is not installed. But this works

apiVersion: storage.k8s.io/v1 
kind: CSIDriver
metadata:
  name: s3.csi.aws.com
spec:
  attachRequired: false

Yes, I installed it via EKS addons, the drive was already installed.

The strange thing is just that, even with it installed I get the error:

... driver name s3.csi.aws.com not found in the list of registered CSI drivers

static_provisioning git:(main) ✗ kc get CSIDriver
NAME                         ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
csi.oneagent.dynatrace.com   false            true             false             <unset>         false               Ephemeral    175d
ebs.csi.aws.com              true             false            false             <unset>         false               Persistent   48d
efs.csi.aws.com              false            false            false             <unset>         false               Persistent   48d
s3.csi.aws.com               false            false            false             <unset>         false               Persistent   20h

➜  static_provisioning git:(main) ✗ kc get CSIDriver s3.csi.aws.com -o yaml
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  creationTimestamp: "2023-12-12T16:35:19Z"
  labels:
    app.kubernetes.io/component: csi-driver
    app.kubernetes.io/instance: aws-mountpoint-s3-csi-driver
    app.kubernetes.io/managed-by: EKS
    app.kubernetes.io/name: aws-mountpoint-s3-csi-driver
  name: s3.csi.aws.com
  resourceVersion: "222744296"
  uid: 9703b843-5d47-4545-b5e2-00387ec1c2d0
spec:
  attachRequired: false
  fsGroupPolicy: ReadWriteOnceWithFSType
  podInfoOnMount: false
  requiresRepublish: false
  storageCapacity: false
  volumeLifecycleModes:

@marcheyer
Copy link

If i understand it correctly then the CSIDriver is in place but the resource can't find it.
Then i think it is rather a problem with your api-server then with the csi driver itself.
But i don't have a clou how to solve this. Sorry.

@patrickpa
Copy link

Hey, I have noticed this issue as well. Usually it happens when the pod with mounted volume spawns before the s3-csi pod in a recently spawned node. But, usually, after a while my pod resumes normally.

This is an error I have had before using other CSIs, such as FSx CSI, and they have a mechanism with startup taints to prevent pods to start before the CSI pod is present.

This is the reference for FSx CSI taint:

Hope this helps somehow! 😃

@chuanwen-wu
Copy link

I found the same error in some nodes, not all the nodes:

Normal Scheduled 18s default-scheduler Successfully assigned default/app-s3-84c5c995cf-k9v8g to ip-10-0-50-37.us-west-2.compute.internal
Warning FailedMount 2s (x6 over 18s) kubelet MountVolume.MountDevice failed for volume "s3-pv" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name s3.csi.aws.com not found in the list of registered CSI drivers

I found that by run:

kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-mountpoint-s3-csi-driver

NAME READY STATUS RESTARTS AGE
s3-csi-node-4t6kt 3/3 Running 0 76m
s3-csi-node-586lj 3/3 Running 0 76m
s3-csi-node-5hfzr 3/3 Running 0 57m

But I have 5 nodes, not only 3. I found that the error only occur in the node with some custom taints, so I delete these taints off the Node, then it works.

@dlakhaws
Copy link
Contributor

Closing the issue for now, feel free to re-open if this issue persists.

@vara-bonthu
Copy link

I have encountered a similar issue to what's been described in the thread. It seems to be a timing problem when communicating with the Mountpoint S3 CSI driver. I'm attempting to mount multiple pods to the same static PVC, which is linked to an S3 bucket. This setup is for running Spark Driver and Executor jobs, essentially using the S3 bucket as a shuffle disk in place of an EBS volume. Here's the configuration I'm using:

Initially, my driver pods only mount successfully after a few attempts and restarting the pod. The error encountered is:

MountVolume.MountDevice failed for volume s3-pv : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name s3.csi.aws.com not found in the list of registered CSI drivers

The Spark driver pod works on the second attempt of running the same job. However, the executor pods error out with the same issue.

Error in First Attempt

image

On the second attempt, both driver and executor pods run successfully.

image

@patrickpa suggested a possible solution using node startup taint, which I haven't tried yet but plan to explore later and will update accordingly.

However, I've run into a different issue related to file renaming. All the Spark executor pods failed with the following error, causing the job to terminate:

24/02/08 01:56:39 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (100.64.52.5 executor 1): java.io.IOException: fail to rename file /data/blockmgr-60c7d3cc-049c-4cf3-949d-89a9bbbc46e2/32/shuffle_0_2_0.index.24900016-5dda-41ed-8601-2b4c6cb575d6 to /data/blockmgr-60c7d3cc-049c-4cf3-949d-89a9bbbc46e2/32/shuffle_0_2_0.index
at org.apache.spark.shuffle.IndexShuffleBlockResolver.writeMetadataFile(IndexShuffleBlockResolver.scala:467)

This leads me to question the viability of using an S3 bucket with mountpoint-s3 for Spark jobs as a replacement for EBS storage.

Given that Spark Driver and Executors need to create, update, and delete files from the S3 bucket, are there any limitations or considerations I might be missing here?

Any insights would be greatly appreciated.

@jmateusppay
Copy link
Author

I redid the installation using the deploy files, and I was successful.
https://github.com/awslabs/mountpoint-s3-csi-driver/tree/main/deploy/kubernetes

for some reason the aws addons were not replicating the drive pods on all nodes.

@surya9teja
Copy link

surya9teja commented Mar 18, 2024

As anyone found a way to resolve this problem, In my case I use auto-scaling in my cluster whenever a new node provisioned via auto-scaler, The S3-csi-driver addon won't configured in new node which is annoying but I also have efs-driver-addon which works fine.

Update:

When I check the kubectl get deamonset -n kube-system

❯ kubectl get daemonset -n kube-system
NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
aws-node                         2         2         2       2            2           <none>                   52d
ebs-csi-node                     2         2         2       2            2           kubernetes.io/os=linux   49d
efs-csi-node                     2         2         2       2            2           kubernetes.io/os=linux   41d
kube-proxy                       2         2         2       2            2           <none>                   52d
nvidia-device-plugin-daemonset   2         2         2       2            2           <none>                   52d
s3-csi-node                      1         1         1       1            1           kubernetes.io/os=linux   3m52s

s3-csi-node should be two each one in each node but the desired state is only one .

@wcw84
Copy link

wcw84 commented Mar 18, 2024

As anyone found a way to resolve this problem, In my case I use auto-scaling in my cluster whenever a new node provisioned via auto-scaler, The S3-csi-driver addon won't configured in new node which is annoying but I also have efs-driver-addon which works fine.

Update:

When I check the kubectl get deamonset -n kube-system

❯ kubectl get daemonset -n kube-system
NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
aws-node                         2         2         2       2            2           <none>                   52d
ebs-csi-node                     2         2         2       2            2           kubernetes.io/os=linux   49d
efs-csi-node                     2         2         2       2            2           kubernetes.io/os=linux   41d
kube-proxy                       2         2         2       2            2           <none>                   52d
nvidia-device-plugin-daemonset   2         2         2       2            2           <none>                   52d
s3-csi-node                      1         1         1       1            1           kubernetes.io/os=linux   3m52s

s3-csi-node should be two each one in each node but the desired state is only one .

check the taints of your nodes and the tolerance of your s3-csi-node and ebs-csi-node, which may be different.

@surya9teja
Copy link

surya9teja commented Mar 18, 2024

As anyone found a way to resolve this problem, In my case I use auto-scaling in my cluster whenever a new node provisioned via auto-scaler, The S3-csi-driver addon won't configured in new node which is annoying but I also have efs-driver-addon which works fine.
Update:
When I check the kubectl get deamonset -n kube-system

❯ kubectl get daemonset -n kube-system
NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
aws-node                         2         2         2       2            2           <none>                   52d
ebs-csi-node                     2         2         2       2            2           kubernetes.io/os=linux   49d
efs-csi-node                     2         2         2       2            2           kubernetes.io/os=linux   41d
kube-proxy                       2         2         2       2            2           <none>                   52d
nvidia-device-plugin-daemonset   2         2         2       2            2           <none>                   52d
s3-csi-node                      1         1         1       1            1           kubernetes.io/os=linux   3m52s

s3-csi-node should be two each one in each node but the desired state is only one .

check the taints of your nodes and the tolerance of your s3-csi-node and ebs-csi-node, which may be different.

@wcw84
I have two group of nodes CPU and GPU node. CPU node is at least one running 24x7 but GPU node scale down to zero. CPU node have no taints while gpu node have taint nvidia.com/gpu=present:NoSchedule

When I check the tolerations and node-selector for s3-csi-node I saw

Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 :NoExecute op=Exists for 300s
                             CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists

And for ebs-csi-node

Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists

So I am not sure which stops the s3-csi-node to run in GPU node.

Note: I have installed both from EKS web interface in aws.

@surya9teja
Copy link

I found the issue, my GPU node have a following taint and i removed the taint

taints:
    - key: nvidia.com/gpu
      value: "present"
      effect: "NoSchedule"

These taints prevents the s3-csi-driver to schedule. So I am using labels and nodeaffinity to deploy the apps instead of toleration. So for now s3-csi-driver won't schedule into the node if you have any taints even though other addons in EKS works fine. For now this is the solution.

@spolloni
Copy link

@dlakhaws any chance we we can reopen this given the activity on the issue? I am experiencing similar issues to @surya9teja, seems problematic that the driver won't run on nodes when some taints are used.

@unexge unexge reopened this Jul 25, 2024
@unexge
Copy link
Contributor

unexge commented Jul 25, 2024

Reopened. @spolloni are you also using EKS add-on?

@surya9teja
Copy link

@spolloni If you install s3-csi-driver from EKS management portal the issue still persist. So I used the source code from the repo and deploy the addon manually by adding the following tolerations into the file node-daemonset.yaml. So for now it works fine if you have any other tolerations you can add them and deploy.

tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - operator: Exists
          effect: NoExecute
          tolerationSeconds: 300
        - key: nvidia.com/gpu
          operator: Exists

@spolloni
Copy link

spolloni commented Jul 25, 2024

@surya9teja ok, thanks for the tip!

@spolloni are you also using EKS add-on?

@unexge yes I am. I just updated to the latest version (1.7.0) to make sure the issue persisted. I am completely ignorant about how this driver works but do you think this issue is "generally" fixable in the add-on install without the workaround suggested above?

@unexge
Copy link
Contributor

unexge commented Jul 25, 2024

Hey @spolloni, our EKS add-on doesn't allow configuring tolerations at the moment. We plan to support that, it's tracked by #109.

Meanwhile, I think only workaround is using our Helm-chart/Kustomization-manfiest to configure tolerations as @surya9teja suggested.

@spolloni
Copy link

sounds good. thanks for the help @unexge!

@unexge unexge added the bug Something isn't working label Jul 30, 2024
@unexge
Copy link
Contributor

unexge commented Aug 30, 2024

v1.8.0 of our EKS add-on has been released with node.tolerateAllTaints and node.tolerations configuration values:

$ aws eks describe-addon-configuration --addon-name aws-mountpoint-s3-csi-driver --addon-version v1.8.0-eksbuild.1
{
    "addonName": "aws-mountpoint-s3-csi-driver",
    "addonVersion": "v1.8.0-eksbuild.1",
    "configurationSchema": "{\"$schema\":\"https://json-schema.org/draft/2019-09/schema\",\"additionalProperties\":false,\"description\":\"Configurable param
eters for Mountpoint for S3 CSI Driver\",\"properties\":{\"node\":{\"additionalProperties\":false,\"properties\":{\"tolerateAllTaints\":{\"default\":false,\"
description\":\"Mountpoint for S3 CSI Driver Pods will tolerate all taints and will be scheduled in all nodes\",\"type\":\"boolean\"},\"tolerations\":{\"defa
ult\":[],\"items\":{\"type\":\"object\"},\"title\":\"Tolerations for Mountpoint for S3 CSI Driver Pods\",\"type\":\"array\"}},\"type\":\"object\"}},\"type\":
\"object\"}",
    "podIdentityConfiguration": []
}

You can set node.tolerateAllTaints to true if you want CSI driver's Pods to schedule all nodes in the cluster, or you can configure node.tolerations array if you need more granularity.

For example:

$ aws eks create-addon --cluster-name ... \
    --addon-name aws-mountpoint-s3-csi-driver \
    --service-account-role-arn ... \
    --configuration-values '{"node":{"tolerateAllTaints":true}}' 

Closing the issue now. Could you please try upgrading to v1.8.0 with a toleration config to see if that solves the problem? Please let us know if the issue persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

10 participants