-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod "Sometimes" cannot mount PVC in CSI version 1.4.0 #174
Comments
What underlying operating system are your node hosts running? The driver uses a host mount to read /proc/mounts from the host operating system to determine what is mounted on the system and that error suggests there was an error reading that file. This has been known to cause compatibility issues in the past, though it is odd that the behavior is intermittent. |
I'm facing the same on version 1.4.0
|
Same problem for the version 1.6.0 |
Can confirm this happened on version 1.6.0, on some Karpenter managed nodes. Restarting the s3-csi pods did solve it but it would good to find the root cause of this (Especially if it happens when some a cluster scales nodes). |
Same here with 1.60 too and same workaround restarting the s3 pod after GPU EC2 machine was created using karpenter. |
I am experiencing the same issue with Kubernetes version 1.28 on EKS with p4d.24xlarge instances. |
Are you still experiencing this on 1.7.0? If so, we need logs to investigate this issue, please see https://github.com/awslabs/mountpoint-s3-csi-driver/blob/main/docs/LOGGING.md for how to collect them |
Thank you for your response. I just upgraded from version 1.6.0 to 1.7.0. I will need some time to observe the system for any reoccurrence of the issue. If the problem persists, I will collect and share the necessary logs as per your provided guidelines. |
Thanks for taking some time to look into this. Unfortunately, I just updated to v1.7.0 and I still face the same issue. Here are my CSI Driver logs with the default logLevel of 4:
I have also tried to retrieve the mountpoint logs but have not been able to retrieve any logs for it.
Please feel free to let me know if I missed a step or would like me to follow some more specific instructions in order to get you any more relevant information. |
Thanks for sharing this, @twellckhpx! It's really unclear right now why the driver cannot read If you still have access to that node or are able to reproduce, please can you check Please can you also share what operating system you're using for your K8s nodes, and any other OS configurations (like SELinux) that may be interacting with the CSI driver. |
I can 100% reproduce the issue, when scale out node with Karpenter, the pod on newly provisioned node cannot mount S3 bucket. Node OS is Amazon Linux 2, AMI ID amazon-eks-gpu-node-1.30-v20240703 (older versions of amazon-eks-gpu-node-xxx also have same issue) Here are some logs for your reference Failed log
Restart and successfully mounted log
I can also access the newly booted node, which cannot mount the S3. Error log shows that
It seems like there's no such file(dir) After my restarting the s3-csi-node-xxx pod, the dir Still investigating, if you wanna reproduce the bug, I'm willing to help. |
It seems like that the created directory is removed by this line. I tried to add some debug messages to mountpoint-s3-csi-driver v1.7.0 source code, and rebuild the image to replace current Add some debug message around suspected code Here're my findings: When s3-csi-node-xxx run at the first time, But after restarting s3-csi-node-xxx, |
Because driver cannot read /host/proc/mounts and then created directory was removed. |
Retry read It is necessary to keep investigating the root cause, but this issue currently affects many Karpenter (and possibly other) users. To prevent users from having to manually restart |
Thanks a lot for the deep-dive and the pull request @Shellmode! I'm trying to reproduce the issue to understand the root cause. I've tried using I tried spawning new nodes couple of times but couldn't reproduce the issue. I didn't set up https://github.com/NVIDIA/k8s-device-plugin properly though, that might be the issue. I'll try to properly set up NVIDIA plugin and reproduce the issue. |
You can reproduce the issue by building this solution. But it may take some time. |
Same issue here, just want to know any update about this. |
Hey @longzhihun, we haven't been able to find the root case yet, but meanwhile we'll merge @Shellmode's fix as a workaround. |
* Fix cannot mount PVC in driver (#174) (#229) * Add logs to retry on reading `/proc/mounts` * Update logging message on retry Co-authored-by: Daniel Carl Jones <danny@danielcarl.info> * Fix formatting of log message --------- Co-authored-by: Array <shellmode13@gmail.com> Co-authored-by: Daniel Carl Jones <danny@danielcarl.info>
When will the workaround be merged into current releases or new release like 1.8.0? |
We plan to make a new release this or next week and the fix will be included in that release. |
I encountered the same problem, and then I checked and found that gpu-feature-discovery could not start normally. I tried to restart it but it still could not run normally. Later, I installed k8s-device-plugin from https://github.com/NVIDIA/k8s-device-plugin and fixed it. |
I've had the NVIDIA/k8s-device-plugin since I first faced this issue so I'm not sure this is fully related / a solution. I'll update to their latest version and see if there is any impact on this particular issue. Edit: Still facing the same issue with the latest k8s-device-plugin v0.16.2, obviously that's without the workaround. |
I synchronized s3-csi, downgraded its version to v1.6.0-eksbuild.1, and then installed k8s-device-plugin. and then use the following to restart s3-csi |
Simply restart s3-csi-xxx pods will fix. |
v1.8.0 has been released with @Shellmode's potential fix for this issue. Could you please try upgrading to 1.8.0 to see if that fixes the problem for you? |
Recently released v1.8 added retry in the ListMounts() function, however I tried the new release and got the same error message still cannot mount S3. I found that if ListMounts() function ever return a nil, error it won't work. Just leave error handling in parseProcMounts() function and retry reading /proc/mounts by calling ListMounts() function from other function, which will work. It may be somehow confusing about "retry", it may because other function/module refresh/restart which fix the issue (just like restart the pod). |
Experienced the same failure to mount PVC with driver v1.9.0, and k8s 1.30 -- after noticing original poster's downgrade workaround I decrementing the minor version to v1.8.0, which resolved it for me.
|
anyone update this issue and solve this problem? |
any update? |
Hi @John-Funcity and @dienhartd, thanks for reporting that you're having a similar issue to this. Given our changes in v1.8.0, we're interested in root causing the issue you're having. Please could you open a new bug report on this repository. It would be helpful for us if you included the following:
I'm closing this issue - anyone else who has similar symptoms, please open a new issue so we can track better. |
Hello Team,
I am trying to test CSI driver 1.4.0 in K8s 1.27. But I found "Some Times" the Pod cannot mount the PVC and the CSI driver Pod reports below error:
And even in the case of normal mounting, the CSI driver Pod still posting such of the above logs. Seems like the CSI driver keep tring to mount the same PVC to the sam Pod and failed. Not sure whether there is anything is misconfig.
Those problems only happend with "Karpenter" scale up worker nodes with new deployment/pod in EKS v1.27.
All mount operations are normal for static k8s worker nodes.
Workaround:
Looking forward to your support, thanks.
The text was updated successfully, but these errors were encountered: