Pod "Sometimes" cannot mount PVC in CSI version 1.4.0 #174

zhushendhh · 2024-03-31T14:08:23Z

Hello Team,

I am trying to test CSI driver 1.4.0 in K8s 1.27. But I found "Some Times" the Pod cannot mount the PVC and the CSI driver Pod reports below error:

I0331 13:53:37.687569       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-east-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"comfyui-outputs-835894076989-us-east-2" > 
I0331 13:53:37.687634       1 node.go:112] NodePublishVolume: mounting comfyui-outputs-835894076989-us-east-2 at /var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount with options [--allow-delete --region=us-east-2]
E0331 13:53:37.687730       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "comfyui-outputs-835894076989-us-east-2" at "/var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount": Could not check if "/var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" is a mount point: stat /var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount: no such file or directory, Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument

And even in the case of normal mounting, the CSI driver Pod still posting such of the above logs. Seems like the CSI driver keep tring to mount the same PVC to the sam Pod and failed. Not sure whether there is anything is misconfig.

Those problems only happend with "Karpenter" scale up worker nodes with new deployment/pod in EKS v1.27.
All mount operations are normal for static k8s worker nodes.

Workaround:

delete the s3-csi-xxxxx Pod running on the Karpenter worker node
using S3 CSi driver 1.0.0

Looking forward to your support, thanks.

The text was updated successfully, but these errors were encountered:

jjkr · 2024-04-02T12:45:22Z

What underlying operating system are your node hosts running? The driver uses a host mount to read /proc/mounts from the host operating system to determine what is mounted on the system and that error suggests there was an error reading that file. This has been known to cause compatibility issues in the past, though it is odd that the behavior is intermittent.

psavva · 2024-04-21T17:52:40Z

I'm facing the same on version 1.4.0

E0421 17:49:31.700733       1 driver.go:97] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/1223b3bd-75c1-4ce8-ad56-f61ed72707e4/volumes/kubernetes.io~csi/s3-pv-appdata/mount": Failed to cat /proc/mounts: Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument
I0421 17:49:31.804114       1 node.go:188] NodeGetCapabilities: called with args
I0421 17:49:31.804755       1 node.go:188] NodeGetCapabilities: called with args
I0421 17:49:31.805458       1 node.go:188] NodeGetCapabilities: called with args
I0421 17:49:31.806119       1 node.go:49] NodePublishVolume: called with args volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/acc27aed-9202-4e11-995c-b5f0fadb70a5/volumes/kubernetes.io~csi/s3-pv-appdata/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"allow-overwrite" mount_flags:"region eu-central-1" mount_flags:"cache /tmp" mount_flags:"metadata-ttl 1200" mount_flags:"max-cache-size 2000" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"groovit-app-data" >
I0421 17:49:31.806171       1 node.go:81] NodePublishVolume: creating dir /var/lib/kubelet/pods/acc27aed-9202-4e11-995c-b5f0fadb70a5/volumes/kubernetes.io~csi/s3-pv-appdata/mount
I0421 17:49:31.806248       1 node.go:108] NodePublishVolume: mounting groovit-app-data at /var/lib/kubelet/pods/acc27aed-9202-4e11-995c-b5f0fadb70a5/volumes/kubernetes.io~csi/s3-pv-appdata/mount with options [--allow-delete --allow-overwrite --cache=/tmp --max-cache-size=2000 --metadata-ttl=1200 --region=eu-central-1]
E0421 17:49:31.806445       1 driver.go:97] GRPC error: rpc error: code = Internal desc = Could not mount "groovit-app-data" at "/var/lib/kubelet/pods/acc27aed-9202-4e11-995c-b5f0fadb70a5/volumes/kubernetes.io~csi/s3-pv-appdata/mount": Mount failed: Failed to start transient systemd service: Failed StartTransientUnit with Call.Err: dbus: connection closed by user output:
I0421 17:49:31.906146       1 node.go:144] NodeUnpublishVolume: called with args volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/1223b3bd-75c1-4ce8-ad56-f61ed72707e4/volumes/kubernetes.io~csi/s3-pv-appdata/mount"

lekev · 2024-05-28T20:20:56Z

Same problem for the version 1.6.0

twellckhpx · 2024-06-05T06:08:11Z

Can confirm this happened on version 1.6.0, on some Karpenter managed nodes. Restarting the s3-csi pods did solve it but it would good to find the root cause of this (Especially if it happens when some a cluster scales nodes).
I can't be 100% sure but so far it seems to have happened on newly provisioned nodes.

adesito · 2024-06-17T14:20:37Z

Can confirm this happened on version 1.6.0, on some Karpenter managed nodes. Restarting the s3-csi pods did solve it but it would good to find the root cause of this (Especially if it happens when some a cluster scales nodes). I can't be 100% sure but so far it seems to have happened on newly provisioned nodes.

Same here with 1.60 too and same workaround restarting the s3 pod after GPU EC2 machine was created using karpenter.

augustkang · 2024-06-25T09:46:36Z

I am experiencing the same issue with Kubernetes version 1.28 on EKS with p4d.24xlarge instances.

muddyfish · 2024-07-03T13:52:10Z

Are you still experiencing this on 1.7.0?

If so, we need logs to investigate this issue, please see https://github.com/awslabs/mountpoint-s3-csi-driver/blob/main/docs/LOGGING.md for how to collect them

augustkang · 2024-07-04T09:39:25Z

Thank you for your response. I just upgraded from version 1.6.0 to 1.7.0. I will need some time to observe the system for any reoccurrence of the issue. If the problem persists, I will collect and share the necessary logs as per your provided guidelines.

twellckhpx · 2024-07-08T10:00:57Z

Thanks for taking some time to look into this. Unfortunately, I just updated to v1.7.0 and I still face the same issue.

Here are my CSI Driver logs with the default logLevel of 4:

I0708 09:01:01.678376       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-0-92-227.ap-southeast-2.compute.internal, mount-s3 version: 1.7.2
I0708 09:01:01.679523       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0708 09:01:01.679708       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0708 09:01:06.417348       1 node.go:222] NodeGetInfo: called with args 
I0708 09:01:35.412398       1 node.go:222] NodeGetInfo: called with args 
I0708 09:01:36.511799       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.513141       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.514163       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.515191       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.518005       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount" volume_capability:<mount:<mount_flags:"region ap-southeast-2" mount_flags:"uid=1000" mount_flags:"allow-other" > access_mode:<mode:MULTI_NODE_READER_ONLY > > volume_context:<key:"bucketName" value:"mdr3ivviap-hpx-ai" > 
I0708 09:01:36.518078       1 node.go:112] NodePublishVolume: mounting mdr3ivviap-hpx-ai at /var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount with options [--allow-other --read-only --region=ap-southeast-2 --uid=1000]
E0708 09:01:36.518202       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "mdr3ivviap-hpx-ai" at "/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount": Could not check if "/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount" is a mount point: stat /var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount: no such file or directory, Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument

I have also tried to retrieve the mountpoint logs but have not been able to retrieve any logs for it.
I have not been able to retrieve the MOUNT_PID as per the instructions you shared and journalctl --boot -t mount-s3 did not return any logs (See below).

-- No entries --

Please feel free to let me know if I missed a step or would like me to follow some more specific instructions in order to get you any more relevant information.

dannycjones · 2024-07-08T15:40:11Z

Thanks for taking some time to look into this. Unfortunately, I just updated to v1.7.0 and I still face the same issue.

Here are my CSI Driver logs with the default logLevel of 4:

I0708 09:01:01.678376       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-0-92-227.ap-southeast-2.compute.internal, mount-s3 version: 1.7.2
I0708 09:01:01.679523       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0708 09:01:01.679708       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0708 09:01:06.417348       1 node.go:222] NodeGetInfo: called with args 
I0708 09:01:35.412398       1 node.go:222] NodeGetInfo: called with args 
I0708 09:01:36.511799       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.513141       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.514163       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.515191       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.518005       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount" volume_capability:<mount:<mount_flags:"region ap-southeast-2" mount_flags:"uid=1000" mount_flags:"allow-other" > access_mode:<mode:MULTI_NODE_READER_ONLY > > volume_context:<key:"bucketName" value:"mdr3ivviap-hpx-ai" > 
I0708 09:01:36.518078       1 node.go:112] NodePublishVolume: mounting mdr3ivviap-hpx-ai at /var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount with options [--allow-other --read-only --region=ap-southeast-2 --uid=1000]
E0708 09:01:36.518202       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "mdr3ivviap-hpx-ai" at "/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount": Could not check if "/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount" is a mount point: stat /var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount: no such file or directory, Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument

I have also tried to retrieve the mountpoint logs but have not been able to retrieve any logs for it. I have not been able to retrieve the MOUNT_PID as per the instructions you shared and journalctl --boot -t mount-s3 did not return any logs (See below).

-- No entries --

Please feel free to let me know if I missed a step or would like me to follow some more specific instructions in order to get you any more relevant information.

Thanks for sharing this, @twellckhpx!

It's really unclear right now why the driver cannot read /proc/mounts. It is understandable that there's no Mountpoint logs since we don't get as far as launching Mountpoint.

If you still have access to that node or are able to reproduce, please can you check dmesg on the node to see if there's any log related to opening /proc/mounts. I'm hoping that will contain information that can give us a clue into what's going wrong with more granularity than "invalid argument".

Please can you also share what operating system you're using for your K8s nodes, and any other OS configurations (like SELinux) that may be interacting with the CSI driver.

Shellmode · 2024-07-21T13:11:54Z

I can 100% reproduce the issue, when scale out node with Karpenter, the pod on newly provisioned node cannot mount S3 bucket.

Node OS is Amazon Linux 2, AMI ID amazon-eks-gpu-node-1.30-v20240703 (older versions of amazon-eks-gpu-node-xxx also have same issue)

Here are some logs for your reference

Failed log

Defaulted container "s3-plugin" out of: s3-plugin, node-driver-registrar, liveness-probe, install-mountpoint (init)
I0718 16:22:01.176122       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-2-126-165.us-west-2.compute.internal, mount-s3 version: 1.7.2                                                                                                                                                    
I0718 16:22:01.177147       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0718 16:22:01.177329       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}                                                           
I0718 16:22:01.962501       1 node.go:222] NodeGetInfo: called with args
I0718 16:22:39.128126       1 node.go:222] NodeGetInfo: called with args                                                                                                                     
I0718 16:22:59.729019       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.729166       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.732021       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:22:59.732073       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.732799       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.732898       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:22:59.733592       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.733592       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:22:59.734762       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume-outputs" target_path:"/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-west-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"comfyui-outputs-930179054915-us-west-2" >                                                                                                                           I0718 16:22:59.734824       1 node.go:112] NodePublishVolume: mounting comfyui-outputs-930179054915-us-west-2 at /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernete
s.io~csi/comfyui-outputs-pv/mount with options [--allow-delete --region=us-west-2]
I0718 16:22:59.734840       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume-inputs" target_path:"/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kub
ernetes.io~csi/comfyui-inputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-west-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<k
ey:"bucketName" value:"comfyui-inputs-930179054915-us-west-2" >
I0718 16:22:59.734891       1 node.go:112] NodePublishVolume: mounting comfyui-inputs-930179054915-us-west-2 at /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes
.io~csi/comfyui-inputs-pv/mount with options [--allow-delete --region=us-west-2]
E0718 16:22:59.734961       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "comfyui-outputs-930179054915-us-west-2" at "/var/lib/kubelet/pods/5d662061-4f4b-45
4e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount": Could not check if "/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-ou
tputs-pv/mount" is a mount point: stat /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount: no such file or directory, Failed to re
ad /host/proc/mounts: open /host/proc/mounts: invalid argument
E0718 16:22:59.735036       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "comfyui-inputs-930179054915-us-west-2" at "/var/lib/kubelet/pods/5d662061-4f4b-454
e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount": Could not check if "/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-inpu
ts-pv/mount" is a mount point: stat /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount: no such file or directory, Failed to read /
host/proc/mounts: open /host/proc/mounts: invalid argument
I0718 16:23:00.333023       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.333022       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.333794       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.333796       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.334829       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.334832       1 node.go:206] NodeGetCapabilities: called with args

Restart and successfully mounted log

Defaulted container "s3-plugin" out of: s3-plugin, node-driver-registrar, liveness-probe, install-mountpoint (init)
I0718 16:00:54.819571       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-2-135-107.us-west-2.compute.internal, mount-s3 version: 1.7.2
I0718 16:00:54.820723       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0718 16:00:54.821032       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0718 16:00:55.267048       1 node.go:222] NodeGetInfo: called with args
 ray@Rays-MacBook-Pro  ~  kubectl logs -f s3-csi-node-vgvg9 -n kube-system
Defaulted container "s3-plugin" out of: s3-plugin, node-driver-registrar, liveness-probe, install-mountpoint (init)
I0718 16:00:54.819571       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-2-135-107.us-west-2.compute.internal, mount-s3 version: 1.7.2
I0718 16:00:54.820723       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0718 16:00:54.821032       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0718 16:00:55.267048       1 node.go:222] NodeGetInfo: called with args
I0718 16:02:30.532895       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.532984       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.533770       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.533824       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.534440       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.534787       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.535870       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume-outputs" target_path:"/var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-west-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"comfyui-outputs-930179054915-us-west-2" >
I0718 16:02:30.535929       1 node.go:112] NodePublishVolume: mounting comfyui-outputs-930179054915-us-west-2 at /var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount with options [--allow-delete --region=us-west-2]
I0718 16:02:30.535927       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume-inputs" target_path:"/var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-west-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"comfyui-inputs-930179054915-us-west-2" >
I0718 16:02:30.535966       1 node.go:112] NodePublishVolume: mounting comfyui-inputs-930179054915-us-west-2 at /var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount with options [--allow-delete --region=us-west-2]
I0718 16:02:30.663455       1 node.go:132] NodePublishVolume: /var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount was mounted
I0718 16:02:30.663558       1 node.go:132] NodePublishVolume: /var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount was mounted
I0718 16:04:25.236179       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:04:25.237064       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:05:35.113343       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:05:35.114167       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:06:53.403359       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:06:53.404562       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:08:02.679829       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:08:02.680536       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:09:26.008572       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:09:26.009406       1 node.go:206] NodeGetCapabilities: called with args

I can also access the newly booted node, which cannot mount the S3.

Error log shows that

0s          Warning   FailedMount   pod/comfyui-54698dcb57-tzkp5   MountVolume.SetUp failed for volume "comfyui-outputs-pv" : rpc error: code = Internal desc = Could not mount "comfyui-outputs-930179054915-us-west-2" at "/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount": Could not check if "/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" is a mount point: stat /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount: no such file or directory, Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument

It seems like there's no such file(dir) /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount on the node.

After my restarting the s3-csi-node-xxx pod, the dir /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount created.

Still investigating, if you wanna reproduce the bug, I'm willing to help.

Shellmode · 2024-07-23T17:00:11Z

It seems like that the created directory is removed by this line.

I tried to add some debug messages to mountpoint-s3-csi-driver v1.7.0 source code, and rebuild the image to replace current daemonsets/s3-csi-node.

Add some debug message around suspected code

Here're my findings:

When s3-csi-node-xxx run at the first time, cleanupDir = true, and clean the created directory

But after restarting s3-csi-node-xxx, cleanupDir = false, and not clean the created directory

Shellmode · 2024-07-23T17:54:59Z

It seems like that the created directory is removed by this line.

I tried to add some debug messages to mountpoint-s3-csi-driver v1.7.0 source code, and rebuild the image to replace current daemonsets/s3-csi-node.

Add some debug message around suspected code

Here're my findings:

When s3-csi-node-xxx run at the first time, cleanupDir = true, and clean the created directory

But after restarting s3-csi-node-xxx, cleanupDir = false, and not clean the created directory

Because driver cannot read /host/proc/mounts and then created directory was removed.

Shellmode · 2024-07-23T19:34:13Z

Retry read /host/proc/mounts will success, submitted the pull requests, please kindly review.

It is necessary to keep investigating the root cause, but this issue currently affects many Karpenter (and possibly other) users. To prevent users from having to manually restart s3-csi-node-xxx every time, it is better to solve the issue by retrying the read in the code.

unexge · 2024-07-24T10:07:57Z

Thanks a lot for the deep-dive and the pull request @Shellmode!

I'm trying to reproduce the issue to understand the root cause. I've tried using karpenter.k8s.aws/instance-category IN ["p"] in my Karpenter node pool configuration to get a GPU node, and I was able to get a node with amazon-eks-gpu-node-1.30-* AMI and AL2 OS as you mentioned.

I tried spawning new nodes couple of times but couldn't reproduce the issue. I didn't set up https://github.com/NVIDIA/k8s-device-plugin properly though, that might be the issue. I'll try to properly set up NVIDIA plugin and reproduce the issue.

Shellmode · 2024-07-24T12:16:47Z

Thanks a lot for the deep-dive and the pull request @Shellmode!

I'm trying to reproduce the issue to understand the root cause. I've tried using karpenter.k8s.aws/instance-category IN ["p"] in my Karpenter node pool configuration to get a GPU node, and I was able to get a node with amazon-eks-gpu-node-1.30-* AMI and AL2 OS as you mentioned.

I tried spawning new nodes couple of times but couldn't reproduce the issue. I didn't set up https://github.com/NVIDIA/k8s-device-plugin properly though, that might be the issue. I'll try to properly set up NVIDIA plugin and reproduce the issue.

You can reproduce the issue by building this solution. But it may take some time.

longzhihun · 2024-07-29T07:54:44Z

Same issue here, just want to know any update about this.

unexge · 2024-07-29T15:58:10Z

Hey @longzhihun, we haven't been able to find the root case yet, but meanwhile we'll merge @Shellmode's fix as a workaround.

* Fix cannot mount PVC in driver (#174) (#229) * Add logs to retry on reading `/proc/mounts` * Update logging message on retry Co-authored-by: Daniel Carl Jones <danny@danielcarl.info> * Fix formatting of log message --------- Co-authored-by: Array <shellmode13@gmail.com> Co-authored-by: Daniel Carl Jones <danny@danielcarl.info>

Shellmode · 2024-08-12T02:22:08Z

When will the workaround be merged into current releases or new release like 1.8.0?
Our customer is still being impacted.

unexge · 2024-08-12T08:58:10Z

We plan to make a new release this or next week and the fix will be included in that release.

lgb861213 · 2024-08-20T02:21:36Z

I encountered the same problem, and then I checked and found that gpu-feature-discovery could not start normally. I tried to restart it but it still could not run normally. Later, I installed k8s-device-plugin from https://github.com/NVIDIA/k8s-device-plugin and fixed it.

twellckhpx · 2024-08-20T02:36:31Z

I encountered the same problem, and then I checked and found that gpu-feature-discovery could not start normally. I tried to restart it but it still could not run normally. Later, I installed k8s-device-plugin from https://github.com/NVIDIA/k8s-device-plugin and fixed it.

I've had the NVIDIA/k8s-device-plugin since I first faced this issue so I'm not sure this is fully related / a solution.

I'll update to their latest version and see if there is any impact on this particular issue.

Edit: Still facing the same issue with the latest k8s-device-plugin v0.16.2, obviously that's without the workaround.

lgb861213 · 2024-08-20T02:47:45Z

I synchronized s3-csi, downgraded its version to v1.6.0-eksbuild.1, and then installed k8s-device-plugin. and then use the following to restart s3-csi
kubectl get pods -A|grep s3-csi|awk '{print $2}'|xargs -n1 kubectl delete pod -n kube-system
and resolved it. you can try it

Shellmode · 2024-08-30T03:21:29Z

I synchronized s3-csi, downgraded its version to v1.6.0-eksbuild.1, and then installed k8s-device-plugin. and then use the following to restart s3-csi kubectl get pods -A|grep s3-csi|awk '{print $2}'|xargs -n1 kubectl delete pod -n kube-system and resolved it. you can try it

Simply restart s3-csi-xxx pods will fix.

unexge · 2024-08-30T06:54:37Z

v1.8.0 has been released with @Shellmode's potential fix for this issue. Could you please try upgrading to 1.8.0 to see if that fixes the problem for you?

Shellmode · 2024-08-30T16:52:32Z

Recently released v1.8 added retry in the ListMounts() function, however I tried the new release and got the same error message still cannot mount S3. I found that if ListMounts() function ever return a nil, error it won't work.

Just leave error handling in parseProcMounts() function and retry reading /proc/mounts by calling ListMounts() function from other function, which will work.

It may be somehow confusing about "retry", it may because other function/module refresh/restart which fix the issue (just like restart the pod).

dienhartd · 2024-10-10T17:59:20Z

Experienced the same failure to mount PVC with driver v1.9.0, and k8s 1.30 -- after noticing original poster's downgrade workaround I decrementing the minor version to v1.8.0, which resolved it for me.

eksctl update addon --name aws-mountpoint-s3-csi-driver --version v1.8.0-eksbuild.1 --cluster <my-cluster> --region <region>

John-Funcity · 2024-10-28T09:31:04Z

anyone update this issue and solve this problem?

John-Funcity · 2024-10-28T10:56:16Z

Do you have met with "FailedMount" Errror? aws-samples/comfyui-on-eks#11

any update？

muddyfish · 2024-10-29T11:18:48Z

Hi @John-Funcity and @dienhartd, thanks for reporting that you're having a similar issue to this. Given our changes in v1.8.0, we're interested in root causing the issue you're having.

Please could you open a new bug report on this repository. It would be helpful for us if you included the following:

The version of the CSI Driver you're using
If you're installing via helm, karpenter, the EKS plugin, or something else
Mountpoint and CSI Driver logs following this runbook: https://github.com/awslabs/mountpoint-s3-csi-driver/blob/main/docs/LOGGING.md

I'm closing this issue - anyone else who has similar symptoms, please open a new issue so we can track better.

dannycjones added the bug label Jun 6, 2024

dannycjones added the pending customer info label Jul 8, 2024

Shellmode mentioned this issue Jul 23, 2024

Fix cannot mount PVC in driver (#174) #229

Merged

unexge pushed a commit that referenced this issue Jul 29, 2024

Fix cannot mount PVC in driver (#174) (#229)

Loading
Loading status checks…

969a28e

unexge mentioned this issue Jul 29, 2024

Add retry to reading /proc/mounts #234

Merged

Shellmode mentioned this issue Aug 19, 2024

Do you have met with "FailedMount" Errror? aws-samples/comfyui-on-eks#11

Closed

lgb861213 mentioned this issue Aug 20, 2024

comfy pod can not mout pv error aws-samples/comfyui-on-eks#12

Open

Shellmode mentioned this issue Aug 30, 2024

Add retry to reading /proc/mounts out of ListMounts() function #246

Closed

muddyfish closed this as completed Oct 29, 2024

dienhartd mentioned this issue Nov 4, 2024

Driver crashes unexpectedly with Failed to read /host/proc/mounts requiring pod restart #284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod "Sometimes" cannot mount PVC in CSI version 1.4.0 #174

Pod "Sometimes" cannot mount PVC in CSI version 1.4.0 #174

zhushendhh commented Mar 31, 2024 •

edited by dannycjones

Loading

jjkr commented Apr 2, 2024

psavva commented Apr 21, 2024

lekev commented May 28, 2024 •

edited

Loading

twellckhpx commented Jun 5, 2024

adesito commented Jun 17, 2024

augustkang commented Jun 25, 2024

muddyfish commented Jul 3, 2024

augustkang commented Jul 4, 2024

twellckhpx commented Jul 8, 2024

dannycjones commented Jul 8, 2024

Shellmode commented Jul 21, 2024 •

edited

Loading

Shellmode commented Jul 23, 2024

Shellmode commented Jul 23, 2024

Shellmode commented Jul 23, 2024

unexge commented Jul 24, 2024

Shellmode commented Jul 24, 2024

longzhihun commented Jul 29, 2024

unexge commented Jul 29, 2024

Shellmode commented Aug 12, 2024

unexge commented Aug 12, 2024

lgb861213 commented Aug 20, 2024

twellckhpx commented Aug 20, 2024 •

edited

Loading

lgb861213 commented Aug 20, 2024 •

edited

Loading

Shellmode commented Aug 30, 2024

unexge commented Aug 30, 2024

Shellmode commented Aug 30, 2024

dienhartd commented Oct 10, 2024

John-Funcity commented Oct 28, 2024

John-Funcity commented Oct 28, 2024

muddyfish commented Oct 29, 2024

Pod "Sometimes" cannot mount PVC in CSI version 1.4.0 #174

Pod "Sometimes" cannot mount PVC in CSI version 1.4.0 #174

Comments

zhushendhh commented Mar 31, 2024 • edited by dannycjones Loading

jjkr commented Apr 2, 2024

psavva commented Apr 21, 2024

lekev commented May 28, 2024 • edited Loading

twellckhpx commented Jun 5, 2024

adesito commented Jun 17, 2024

augustkang commented Jun 25, 2024

muddyfish commented Jul 3, 2024

augustkang commented Jul 4, 2024

twellckhpx commented Jul 8, 2024

dannycjones commented Jul 8, 2024

Shellmode commented Jul 21, 2024 • edited Loading

Shellmode commented Jul 23, 2024

Shellmode commented Jul 23, 2024

Shellmode commented Jul 23, 2024

unexge commented Jul 24, 2024

Shellmode commented Jul 24, 2024

longzhihun commented Jul 29, 2024

unexge commented Jul 29, 2024

Shellmode commented Aug 12, 2024

unexge commented Aug 12, 2024

lgb861213 commented Aug 20, 2024

twellckhpx commented Aug 20, 2024 • edited Loading

lgb861213 commented Aug 20, 2024 • edited Loading

Shellmode commented Aug 30, 2024

unexge commented Aug 30, 2024

Shellmode commented Aug 30, 2024

dienhartd commented Oct 10, 2024

John-Funcity commented Oct 28, 2024

John-Funcity commented Oct 28, 2024

muddyfish commented Oct 29, 2024

zhushendhh commented Mar 31, 2024 •

edited by dannycjones

Loading

lekev commented May 28, 2024 •

edited

Loading

Shellmode commented Jul 21, 2024 •

edited

Loading

twellckhpx commented Aug 20, 2024 •

edited

Loading

lgb861213 commented Aug 20, 2024 •

edited

Loading