Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod "Sometimes" cannot mount PVC in CSI version 1.4.0 #174

Closed
zhushendhh opened this issue Mar 31, 2024 · 30 comments
Closed

Pod "Sometimes" cannot mount PVC in CSI version 1.4.0 #174

zhushendhh opened this issue Mar 31, 2024 · 30 comments
Labels
bug Something isn't working pending customer info

Comments

@zhushendhh
Copy link

zhushendhh commented Mar 31, 2024

Hello Team,

I am trying to test CSI driver 1.4.0 in K8s 1.27. But I found "Some Times" the Pod cannot mount the PVC and the CSI driver Pod reports below error:

I0331 13:53:37.687569       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-east-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"comfyui-outputs-835894076989-us-east-2" > 
I0331 13:53:37.687634       1 node.go:112] NodePublishVolume: mounting comfyui-outputs-835894076989-us-east-2 at /var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount with options [--allow-delete --region=us-east-2]
E0331 13:53:37.687730       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "comfyui-outputs-835894076989-us-east-2" at "/var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount": Could not check if "/var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" is a mount point: stat /var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount: no such file or directory, Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument

And even in the case of normal mounting, the CSI driver Pod still posting such of the above logs. Seems like the CSI driver keep tring to mount the same PVC to the sam Pod and failed. Not sure whether there is anything is misconfig.

Those problems only happend with "Karpenter" scale up worker nodes with new deployment/pod in EKS v1.27.
All mount operations are normal for static k8s worker nodes.

Workaround:

  1. delete the s3-csi-xxxxx Pod running on the Karpenter worker node
  2. using S3 CSi driver 1.0.0

Looking forward to your support, thanks.

@jjkr
Copy link
Contributor

jjkr commented Apr 2, 2024

What underlying operating system are your node hosts running? The driver uses a host mount to read /proc/mounts from the host operating system to determine what is mounted on the system and that error suggests there was an error reading that file. This has been known to cause compatibility issues in the past, though it is odd that the behavior is intermittent.

@psavva
Copy link

psavva commented Apr 21, 2024

I'm facing the same on version 1.4.0

E0421 17:49:31.700733       1 driver.go:97] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/1223b3bd-75c1-4ce8-ad56-f61ed72707e4/volumes/kubernetes.io~csi/s3-pv-appdata/mount": Failed to cat /proc/mounts: Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument
I0421 17:49:31.804114       1 node.go:188] NodeGetCapabilities: called with args
I0421 17:49:31.804755       1 node.go:188] NodeGetCapabilities: called with args
I0421 17:49:31.805458       1 node.go:188] NodeGetCapabilities: called with args
I0421 17:49:31.806119       1 node.go:49] NodePublishVolume: called with args volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/acc27aed-9202-4e11-995c-b5f0fadb70a5/volumes/kubernetes.io~csi/s3-pv-appdata/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"allow-overwrite" mount_flags:"region eu-central-1" mount_flags:"cache /tmp" mount_flags:"metadata-ttl 1200" mount_flags:"max-cache-size 2000" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"groovit-app-data" >
I0421 17:49:31.806171       1 node.go:81] NodePublishVolume: creating dir /var/lib/kubelet/pods/acc27aed-9202-4e11-995c-b5f0fadb70a5/volumes/kubernetes.io~csi/s3-pv-appdata/mount
I0421 17:49:31.806248       1 node.go:108] NodePublishVolume: mounting groovit-app-data at /var/lib/kubelet/pods/acc27aed-9202-4e11-995c-b5f0fadb70a5/volumes/kubernetes.io~csi/s3-pv-appdata/mount with options [--allow-delete --allow-overwrite --cache=/tmp --max-cache-size=2000 --metadata-ttl=1200 --region=eu-central-1]
E0421 17:49:31.806445       1 driver.go:97] GRPC error: rpc error: code = Internal desc = Could not mount "groovit-app-data" at "/var/lib/kubelet/pods/acc27aed-9202-4e11-995c-b5f0fadb70a5/volumes/kubernetes.io~csi/s3-pv-appdata/mount": Mount failed: Failed to start transient systemd service: Failed StartTransientUnit with Call.Err: dbus: connection closed by user output:
I0421 17:49:31.906146       1 node.go:144] NodeUnpublishVolume: called with args volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/1223b3bd-75c1-4ce8-ad56-f61ed72707e4/volumes/kubernetes.io~csi/s3-pv-appdata/mount"

@lekev
Copy link

lekev commented May 28, 2024

Same problem for the version 1.6.0

@twellckhpx
Copy link

Can confirm this happened on version 1.6.0, on some Karpenter managed nodes. Restarting the s3-csi pods did solve it but it would good to find the root cause of this (Especially if it happens when some a cluster scales nodes).
I can't be 100% sure but so far it seems to have happened on newly provisioned nodes.

@dannycjones dannycjones added the bug Something isn't working label Jun 6, 2024
@adesito
Copy link

adesito commented Jun 17, 2024

Can confirm this happened on version 1.6.0, on some Karpenter managed nodes. Restarting the s3-csi pods did solve it but it would good to find the root cause of this (Especially if it happens when some a cluster scales nodes). I can't be 100% sure but so far it seems to have happened on newly provisioned nodes.

Same here with 1.60 too and same workaround restarting the s3 pod after GPU EC2 machine was created using karpenter.

@augustkang
Copy link

I am experiencing the same issue with Kubernetes version 1.28 on EKS with p4d.24xlarge instances.

@muddyfish
Copy link
Contributor

Are you still experiencing this on 1.7.0?

If so, we need logs to investigate this issue, please see https://github.com/awslabs/mountpoint-s3-csi-driver/blob/main/docs/LOGGING.md for how to collect them

@augustkang
Copy link

Thank you for your response. I just upgraded from version 1.6.0 to 1.7.0. I will need some time to observe the system for any reoccurrence of the issue. If the problem persists, I will collect and share the necessary logs as per your provided guidelines.

@twellckhpx
Copy link

Thanks for taking some time to look into this. Unfortunately, I just updated to v1.7.0 and I still face the same issue.

Here are my CSI Driver logs with the default logLevel of 4:

I0708 09:01:01.678376       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-0-92-227.ap-southeast-2.compute.internal, mount-s3 version: 1.7.2
I0708 09:01:01.679523       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0708 09:01:01.679708       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0708 09:01:06.417348       1 node.go:222] NodeGetInfo: called with args 
I0708 09:01:35.412398       1 node.go:222] NodeGetInfo: called with args 
I0708 09:01:36.511799       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.513141       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.514163       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.515191       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.518005       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount" volume_capability:<mount:<mount_flags:"region ap-southeast-2" mount_flags:"uid=1000" mount_flags:"allow-other" > access_mode:<mode:MULTI_NODE_READER_ONLY > > volume_context:<key:"bucketName" value:"mdr3ivviap-hpx-ai" > 
I0708 09:01:36.518078       1 node.go:112] NodePublishVolume: mounting mdr3ivviap-hpx-ai at /var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount with options [--allow-other --read-only --region=ap-southeast-2 --uid=1000]
E0708 09:01:36.518202       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "mdr3ivviap-hpx-ai" at "/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount": Could not check if "/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount" is a mount point: stat /var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount: no such file or directory, Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument

I have also tried to retrieve the mountpoint logs but have not been able to retrieve any logs for it.
I have not been able to retrieve the MOUNT_PID as per the instructions you shared and journalctl --boot -t mount-s3 did not return any logs (See below).

-- No entries --

Please feel free to let me know if I missed a step or would like me to follow some more specific instructions in order to get you any more relevant information.

@dannycjones
Copy link
Contributor

Thanks for taking some time to look into this. Unfortunately, I just updated to v1.7.0 and I still face the same issue.

Here are my CSI Driver logs with the default logLevel of 4:

I0708 09:01:01.678376       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-0-92-227.ap-southeast-2.compute.internal, mount-s3 version: 1.7.2
I0708 09:01:01.679523       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0708 09:01:01.679708       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0708 09:01:06.417348       1 node.go:222] NodeGetInfo: called with args 
I0708 09:01:35.412398       1 node.go:222] NodeGetInfo: called with args 
I0708 09:01:36.511799       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.513141       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.514163       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.515191       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.518005       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount" volume_capability:<mount:<mount_flags:"region ap-southeast-2" mount_flags:"uid=1000" mount_flags:"allow-other" > access_mode:<mode:MULTI_NODE_READER_ONLY > > volume_context:<key:"bucketName" value:"mdr3ivviap-hpx-ai" > 
I0708 09:01:36.518078       1 node.go:112] NodePublishVolume: mounting mdr3ivviap-hpx-ai at /var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount with options [--allow-other --read-only --region=ap-southeast-2 --uid=1000]
E0708 09:01:36.518202       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "mdr3ivviap-hpx-ai" at "/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount": Could not check if "/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount" is a mount point: stat /var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount: no such file or directory, Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument

I have also tried to retrieve the mountpoint logs but have not been able to retrieve any logs for it. I have not been able to retrieve the MOUNT_PID as per the instructions you shared and journalctl --boot -t mount-s3 did not return any logs (See below).

-- No entries --

Please feel free to let me know if I missed a step or would like me to follow some more specific instructions in order to get you any more relevant information.

Thanks for sharing this, @twellckhpx!

It's really unclear right now why the driver cannot read /proc/mounts. It is understandable that there's no Mountpoint logs since we don't get as far as launching Mountpoint.

If you still have access to that node or are able to reproduce, please can you check dmesg on the node to see if there's any log related to opening /proc/mounts. I'm hoping that will contain information that can give us a clue into what's going wrong with more granularity than "invalid argument".

Please can you also share what operating system you're using for your K8s nodes, and any other OS configurations (like SELinux) that may be interacting with the CSI driver.

@Shellmode
Copy link
Contributor

Shellmode commented Jul 21, 2024

I can 100% reproduce the issue, when scale out node with Karpenter, the pod on newly provisioned node cannot mount S3 bucket.

Node OS is Amazon Linux 2, AMI ID amazon-eks-gpu-node-1.30-v20240703 (older versions of amazon-eks-gpu-node-xxx also have same issue)

Here are some logs for your reference

Failed log

Defaulted container "s3-plugin" out of: s3-plugin, node-driver-registrar, liveness-probe, install-mountpoint (init)
I0718 16:22:01.176122       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-2-126-165.us-west-2.compute.internal, mount-s3 version: 1.7.2                                                                                                                                                    
I0718 16:22:01.177147       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0718 16:22:01.177329       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}                                                           
I0718 16:22:01.962501       1 node.go:222] NodeGetInfo: called with args
I0718 16:22:39.128126       1 node.go:222] NodeGetInfo: called with args                                                                                                                     
I0718 16:22:59.729019       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.729166       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.732021       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:22:59.732073       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.732799       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.732898       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:22:59.733592       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.733592       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:22:59.734762       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume-outputs" target_path:"/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-west-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"comfyui-outputs-930179054915-us-west-2" >                                                                                                                           I0718 16:22:59.734824       1 node.go:112] NodePublishVolume: mounting comfyui-outputs-930179054915-us-west-2 at /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernete
s.io~csi/comfyui-outputs-pv/mount with options [--allow-delete --region=us-west-2]
I0718 16:22:59.734840       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume-inputs" target_path:"/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kub
ernetes.io~csi/comfyui-inputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-west-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<k
ey:"bucketName" value:"comfyui-inputs-930179054915-us-west-2" >
I0718 16:22:59.734891       1 node.go:112] NodePublishVolume: mounting comfyui-inputs-930179054915-us-west-2 at /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes
.io~csi/comfyui-inputs-pv/mount with options [--allow-delete --region=us-west-2]
E0718 16:22:59.734961       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "comfyui-outputs-930179054915-us-west-2" at "/var/lib/kubelet/pods/5d662061-4f4b-45
4e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount": Could not check if "/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-ou
tputs-pv/mount" is a mount point: stat /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount: no such file or directory, Failed to re
ad /host/proc/mounts: open /host/proc/mounts: invalid argument
E0718 16:22:59.735036       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "comfyui-inputs-930179054915-us-west-2" at "/var/lib/kubelet/pods/5d662061-4f4b-454
e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount": Could not check if "/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-inpu
ts-pv/mount" is a mount point: stat /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount: no such file or directory, Failed to read /
host/proc/mounts: open /host/proc/mounts: invalid argument
I0718 16:23:00.333023       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.333022       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.333794       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.333796       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.334829       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.334832       1 node.go:206] NodeGetCapabilities: called with args

Restart and successfully mounted log

Defaulted container "s3-plugin" out of: s3-plugin, node-driver-registrar, liveness-probe, install-mountpoint (init)
I0718 16:00:54.819571       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-2-135-107.us-west-2.compute.internal, mount-s3 version: 1.7.2
I0718 16:00:54.820723       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0718 16:00:54.821032       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0718 16:00:55.267048       1 node.go:222] NodeGetInfo: called with args
 ray@Rays-MacBook-Pro  ~  kubectl logs -f s3-csi-node-vgvg9 -n kube-system
Defaulted container "s3-plugin" out of: s3-plugin, node-driver-registrar, liveness-probe, install-mountpoint (init)
I0718 16:00:54.819571       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-2-135-107.us-west-2.compute.internal, mount-s3 version: 1.7.2
I0718 16:00:54.820723       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0718 16:00:54.821032       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0718 16:00:55.267048       1 node.go:222] NodeGetInfo: called with args
I0718 16:02:30.532895       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.532984       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.533770       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.533824       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.534440       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.534787       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.535870       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume-outputs" target_path:"/var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-west-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"comfyui-outputs-930179054915-us-west-2" >
I0718 16:02:30.535929       1 node.go:112] NodePublishVolume: mounting comfyui-outputs-930179054915-us-west-2 at /var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount with options [--allow-delete --region=us-west-2]
I0718 16:02:30.535927       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume-inputs" target_path:"/var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-west-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"comfyui-inputs-930179054915-us-west-2" >
I0718 16:02:30.535966       1 node.go:112] NodePublishVolume: mounting comfyui-inputs-930179054915-us-west-2 at /var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount with options [--allow-delete --region=us-west-2]
I0718 16:02:30.663455       1 node.go:132] NodePublishVolume: /var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount was mounted
I0718 16:02:30.663558       1 node.go:132] NodePublishVolume: /var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount was mounted
I0718 16:04:25.236179       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:04:25.237064       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:05:35.113343       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:05:35.114167       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:06:53.403359       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:06:53.404562       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:08:02.679829       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:08:02.680536       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:09:26.008572       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:09:26.009406       1 node.go:206] NodeGetCapabilities: called with args

I can also access the newly booted node, which cannot mount the S3.

Error log shows that

0s          Warning   FailedMount   pod/comfyui-54698dcb57-tzkp5   MountVolume.SetUp failed for volume "comfyui-outputs-pv" : rpc error: code = Internal desc = Could not mount "comfyui-outputs-930179054915-us-west-2" at "/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount": Could not check if "/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" is a mount point: stat /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount: no such file or directory, Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument

It seems like there's no such file(dir) /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount on the node.

After my restarting the s3-csi-node-xxx pod, the dir /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount created.

Still investigating, if you wanna reproduce the bug, I'm willing to help.

@Shellmode
Copy link
Contributor

It seems like that the created directory is removed by this line.

I tried to add some debug messages to mountpoint-s3-csi-driver v1.7.0 source code, and rebuild the image to replace current daemonsets/s3-csi-node.

Add some debug message around suspected code
Screenshot 2024-07-24 at 00 57 14

Here're my findings:

When s3-csi-node-xxx run at the first time, cleanupDir = true, and clean the created directory
Screenshot 2024-07-24 at 00 55 47

But after restarting s3-csi-node-xxx, cleanupDir = false, and not clean the created directory
Screenshot 2024-07-24 at 00 58 54

@Shellmode
Copy link
Contributor

It seems like that the created directory is removed by this line.

I tried to add some debug messages to mountpoint-s3-csi-driver v1.7.0 source code, and rebuild the image to replace current daemonsets/s3-csi-node.

Add some debug message around suspected code Screenshot 2024-07-24 at 00 57 14

Here're my findings:

When s3-csi-node-xxx run at the first time, cleanupDir = true, and clean the created directory Screenshot 2024-07-24 at 00 55 47

But after restarting s3-csi-node-xxx, cleanupDir = false, and not clean the created directory Screenshot 2024-07-24 at 00 58 54

Because driver cannot read /host/proc/mounts and then created directory was removed.

@Shellmode
Copy link
Contributor

Retry read /host/proc/mounts will success, submitted the pull requests, please kindly review.

It is necessary to keep investigating the root cause, but this issue currently affects many Karpenter (and possibly other) users. To prevent users from having to manually restart s3-csi-node-xxx every time, it is better to solve the issue by retrying the read in the code.

@unexge
Copy link
Contributor

unexge commented Jul 24, 2024

Thanks a lot for the deep-dive and the pull request @Shellmode!

I'm trying to reproduce the issue to understand the root cause. I've tried using karpenter.k8s.aws/instance-category IN ["p"] in my Karpenter node pool configuration to get a GPU node, and I was able to get a node with amazon-eks-gpu-node-1.30-* AMI and AL2 OS as you mentioned.

I tried spawning new nodes couple of times but couldn't reproduce the issue. I didn't set up https://github.com/NVIDIA/k8s-device-plugin properly though, that might be the issue. I'll try to properly set up NVIDIA plugin and reproduce the issue.

@Shellmode
Copy link
Contributor

Thanks a lot for the deep-dive and the pull request @Shellmode!

I'm trying to reproduce the issue to understand the root cause. I've tried using karpenter.k8s.aws/instance-category IN ["p"] in my Karpenter node pool configuration to get a GPU node, and I was able to get a node with amazon-eks-gpu-node-1.30-* AMI and AL2 OS as you mentioned.

I tried spawning new nodes couple of times but couldn't reproduce the issue. I didn't set up https://github.com/NVIDIA/k8s-device-plugin properly though, that might be the issue. I'll try to properly set up NVIDIA plugin and reproduce the issue.

You can reproduce the issue by building this solution. But it may take some time.

@longzhihun
Copy link

Same issue here, just want to know any update about this.

@unexge
Copy link
Contributor

unexge commented Jul 29, 2024

Hey @longzhihun, we haven't been able to find the root case yet, but meanwhile we'll merge @Shellmode's fix as a workaround.

unexge added a commit that referenced this issue Jul 30, 2024
* Fix cannot mount PVC in driver (#174) (#229)

* Add logs to retry on reading `/proc/mounts`

* Update logging message on retry

Co-authored-by: Daniel Carl Jones <danny@danielcarl.info>

* Fix formatting of log message

---------

Co-authored-by: Array <shellmode13@gmail.com>
Co-authored-by: Daniel Carl Jones <danny@danielcarl.info>
@Shellmode
Copy link
Contributor

When will the workaround be merged into current releases or new release like 1.8.0?
Our customer is still being impacted.

@unexge
Copy link
Contributor

unexge commented Aug 12, 2024

We plan to make a new release this or next week and the fix will be included in that release.

@lgb861213
Copy link

I encountered the same problem, and then I checked and found that gpu-feature-discovery could not start normally. I tried to restart it but it still could not run normally. Later, I installed k8s-device-plugin from https://github.com/NVIDIA/k8s-device-plugin and fixed it.

@twellckhpx
Copy link

twellckhpx commented Aug 20, 2024

I encountered the same problem, and then I checked and found that gpu-feature-discovery could not start normally. I tried to restart it but it still could not run normally. Later, I installed k8s-device-plugin from https://github.com/NVIDIA/k8s-device-plugin and fixed it.

I've had the NVIDIA/k8s-device-plugin since I first faced this issue so I'm not sure this is fully related / a solution.

I'll update to their latest version and see if there is any impact on this particular issue.

Edit: Still facing the same issue with the latest k8s-device-plugin v0.16.2, obviously that's without the workaround.

@lgb861213
Copy link

lgb861213 commented Aug 20, 2024

I synchronized s3-csi, downgraded its version to v1.6.0-eksbuild.1, and then installed k8s-device-plugin. and then use the following to restart s3-csi
kubectl get pods -A|grep s3-csi|awk '{print $2}'|xargs -n1 kubectl delete pod -n kube-system
and resolved it. you can try it

@Shellmode
Copy link
Contributor

I synchronized s3-csi, downgraded its version to v1.6.0-eksbuild.1, and then installed k8s-device-plugin. and then use the following to restart s3-csi kubectl get pods -A|grep s3-csi|awk '{print $2}'|xargs -n1 kubectl delete pod -n kube-system and resolved it. you can try it

Simply restart s3-csi-xxx pods will fix.

@unexge
Copy link
Contributor

unexge commented Aug 30, 2024

v1.8.0 has been released with @Shellmode's potential fix for this issue. Could you please try upgrading to 1.8.0 to see if that fixes the problem for you?

@Shellmode
Copy link
Contributor

Recently released v1.8 added retry in the ListMounts() function, however I tried the new release and got the same error message still cannot mount S3. I found that if ListMounts() function ever return a nil, error it won't work.

Just leave error handling in parseProcMounts() function and retry reading /proc/mounts by calling ListMounts() function from other function, which will work.

It may be somehow confusing about "retry", it may because other function/module refresh/restart which fix the issue (just like restart the pod).

@dienhartd
Copy link

Experienced the same failure to mount PVC with driver v1.9.0, and k8s 1.30 -- after noticing original poster's downgrade workaround I decrementing the minor version to v1.8.0, which resolved it for me.

eksctl update addon --name aws-mountpoint-s3-csi-driver --version v1.8.0-eksbuild.1 --cluster <my-cluster> --region <region>

@John-Funcity
Copy link

anyone update this issue and solve this problem?

@John-Funcity
Copy link

Do you have met with "FailedMount" Errror? aws-samples/comfyui-on-eks#11

any update?

@muddyfish
Copy link
Contributor

Hi @John-Funcity and @dienhartd, thanks for reporting that you're having a similar issue to this. Given our changes in v1.8.0, we're interested in root causing the issue you're having.

Please could you open a new bug report on this repository. It would be helpful for us if you included the following:

  1. The version of the CSI Driver you're using
  2. If you're installing via helm, karpenter, the EKS plugin, or something else
  3. Mountpoint and CSI Driver logs following this runbook: https://github.com/awslabs/mountpoint-s3-csi-driver/blob/main/docs/LOGGING.md

I'm closing this issue - anyone else who has similar symptoms, please open a new issue so we can track better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending customer info
Projects
None yet
Development

No branches or pull requests