Timeout from geesefs leads to a wrong mount into a pod #107

berlic · 2024-04-05T06:23:52Z

Hi, thanks for this CSI driver — very useful indeed!

We were troubleshooting transient problem with s3 mounts inside pods.
The problem is that sometimes after a pod has started the directory inside a pod that is expected to be an s3 mount leads to some different place.

We noticed some patterns for the issue and believe that there is a bug in the geesefs mounter here (returning without waitForMount)

Chain of events to trigger the bug:

A pod is scheduled to a node
Kubelet tries to create a mountpoint through csi-s3
csi-s3 creates SystemD unit with geesefs process and starts it
geesefs is slow to create a mount for some reason (although the process is running)
csi-s3 waits for the mountpoint to appear and fails with "GRPC error: Timeout waiting for mount"
Kubelet retries the process
csi-s3 detects a running SystemD unit with proper arguments and thinks that the mountpoint exists (bug here)
Kubelet assumes that csi-s3 has done its job and creates a pod
Docker starts a container with a 'mount' volume, which points to a wrong place (a directory on a disk where /var/lib/kubelet/pods/ is located instead of and geesefs mountpoint, because the original geesefs process is slow to create a mount and the mountpoint is not yet created)

The timeout for waitForMount is 10 seconds, while geesefs slow starts can be as slow as 2 minutes:

Apr 04 15:28:03 host123 systemd[1]: Started GeeseFS mount for Kubernetes volume test/pvc-d0b3ec1f-2c7b-4e66-cc1b-2f1c5db11d87.
Apr 04 15:30:03 host123 geesefs[1946684]: 2024/04/04 15:30:03.670881 fuse.DEBUG Beginning the mounting kickoff process

The text was updated successfully, but these errors were encountered:

vitalif · 2024-04-22T09:36:17Z

Thanks, I committed a fix in master
Even though I think GeeseFS should never start so slow, it doesn't do anything slow when it starts...

vitalif · 2024-04-22T10:06:31Z

Released 0.40.8, please check

vitalif · 2024-04-22T10:07:11Z

Please note that you should delete resources from old provisioner.yaml when upgrading if you're not using Helm
https://github.com/yandex-cloud/k8s-csi-s3/?tab=readme-ov-file#upgrading

vitalif closed this as completed in b6e93a8 Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout from geesefs leads to a wrong mount into a pod #107

Timeout from geesefs leads to a wrong mount into a pod #107

berlic commented Apr 5, 2024

vitalif commented Apr 22, 2024

vitalif commented Apr 22, 2024

vitalif commented Apr 22, 2024

Timeout from geesefs leads to a wrong mount into a pod #107

Timeout from geesefs leads to a wrong mount into a pod #107

Comments

berlic commented Apr 5, 2024

vitalif commented Apr 22, 2024

vitalif commented Apr 22, 2024

vitalif commented Apr 22, 2024