Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod cannot be deleted if the S3 PVC fails to mount #309

Open
geniass opened this issue Dec 5, 2024 · 5 comments
Open

Pod cannot be deleted if the S3 PVC fails to mount #309

geniass opened this issue Dec 5, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@geniass
Copy link

geniass commented Dec 5, 2024

/kind bug

NOTE: If this is a filesystem related bug, please take a look at the Mountpoint repo to submit a bug report

What happened?
Mounting an s3 volume into a pod fails (FailedMount k8s event). Then the kubelet is unable to delete the pod because the vol_data.json file is missing. The pod hangs in the Terminating state forever.

Kubelet logs get flooded with:

Dec 05 10:46:44 gin k3s[3936577]: E1205 10:46:44.663100 3936577 reconciler_common.go:158] "operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume \"s3-mybucket-storage\" (UniqueName: \"kubernetes.io/csi/s3.csi.aws.com^s3-csi-driver-volume-mybucket-storage\") pod \"ee13ebc2-e424-47da-aa81-c8f627f69782\" (UID: \"ee13ebc2-e424-47da-aa81-c8f627f69782\") : UnmountVolume.NewUnmounter failed for volume \"s3-mybucket-storage\" (UniqueName: \"kubernetes.io/csi/s3.csi.aws.com^s3-csi-driver-volume-mybucket-storage\") pod \"ee13ebc2-e424-47da-aa81-c8f627f69782\" (UID: \"ee13ebc2-e424-47da-aa81-c8f627f69782\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/ee13ebc2-e424-47da-aa81-c8f627f69782/volumes/kubernetes.io~csi/s3-mybucket-storage/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/ee13ebc2-e424-47da-aa81-c8f627f69782/volumes/kubernetes.io~csi/s3-mybucket-storage/vol_data.json]: open /var/lib/kubelet/pods/ee13ebc2-e424-47da-aa81-c8f627f69782/volumes/kubernetes.io~csi/s3-mybucket-storage/vol_data.json: no such file or directory" err="UnmountVolume.NewUnmounter failed for volume \"s3-mybucket-storage\" (UniqueName: \"kubernetes.io/csi/s3.csi.aws.com^s3-csi-driver-volume-mybucket-storage\") pod \"ee13ebc2-e424-47da-aa81-c8f627f69782\" (UID: \"ee13ebc2-e424-47da-aa81-c8f627f69782\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/ee13ebc2-e424-47da-aa81-c8f627f69782/volumes/kubernetes.io~csi/s3-mybucket-storage/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/ee13ebc2-e424-47da-aa81-c8f627f69782/volumes/kubernetes.io~csi/s3-mybucket-storage/vol_data.json]: open /var/lib/kubelet/pods/ee13ebc2-e424-47da-aa81-c8f627f69782/volumes/kubernetes.io~csi/s3-mybucket-storage/vol_data.json: no such file or directory"

What you expected to happen?
S3 volume should be unmounted and the pod should terminate normally.

How to reproduce it (as minimally and precisely as possible)?
S3 PV and PVC:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: s3-mybucket-storage
spec:
  accessModes:
  - ReadOnlyMany
  capacity:
    storage: 1200Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: s3-mybucket-storage
    namespace: coder
  csi:
    driver: s3.csi.aws.com
    volumeAttributes:
      bucketName: mybucket-af-south-1-storage
    volumeHandle: s3-csi-driver-volume-mybucket-storage
  mountOptions:
  - region af-south-1
  - uid=1000
  - gid=1000
  - allow-other
  - read-only
  persistentVolumeReclaimPolicy: Retain
  volumeMode: Filesystem

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: s3-mybucket-storage
  namespace: coder
spec:
  accessModes:
  - ReadOnlyMany
  resources:
    requests:
      storage: 1200Gi
  storageClassName: ""
  volumeMode: Filesystem
  volumeName: s3-mybucket-storage

Create a pod that mounts an s3 pvc:

apiVersion: v1
kind: Pod
metadata:
  name: s3-mybucket-storage-pod
  namespace: coder
spec:
  containers:
  - name: s3-mybucket-storage-container
    image: busybox
    command: ["sh", "-c", "while true; do sleep 3600; done"]
    volumeMounts:
    - name: s3-mybucket-storage
      mountPath: /mnt/s3
  volumes:
  - name: s3-mybucket-storage
    persistentVolumeClaim:
      claimName: s3-mybucket-storage
kubectl -n coder describe pod s3-mybucket-storage-pod

There will be a Warning event that the s3 PVC could not be mounted:

   Warning  FailedMount  6s    kubelet            MountVolume.SetUp failed for volume "s3-teneo-storage" : rpc error: code = Internal desc = Could not mount "isazi-hudson-teneo-af-south-1-storage" at "/var/lib/kubelet/pods/3dc25df6-2e3e-4c79-800d-ec4ad50ca70c/volumes/kubernetes.io~csi/s3-teneo-storage/mount": Mount failed: Failed to start service output: Error: mount point /var/lib/kubelet/pods/3dc25df6-2e3e-4c79-800d-ec4ad50ca70c/volumes/kubernetes.io~csi/s3-teneo-storage/mount is already mounted Error: Failed to create mount process

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version):
Client Version: v1.30.6+k3s1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.6+k3s1
  • Driver version: v1.10.0
@unexge
Copy link
Contributor

unexge commented Dec 17, 2024

Hey @geniass, sorry for the late response. I've applied your manifest (by replacing your bucket and region with mine) and it worked fine for me.

The error message you got suggest that there was a faulty mount from a previous run, and it didn't get unmounted somehow. We perform unmounting in our NodeUnpublishVolume CSI method, seems like there was an error in this call, and it resulted such an outcome.

For us to debug your issue, would you be able to share logs from the CSI Driver when you experience this error? You can get the logs via kubectl -n kube-system logs -lapp=s3-csi-node -c s3-plugin, especially log lines starting withNodePublishVolume and NodeUnpublishVolume would be very useful to understand what's going on.

@geniass
Copy link
Author

geniass commented Dec 20, 2024

Sorry forgot to respond to this.

This is pretty much all I can see in the logs:

	2024-12-18 11:01:31.317	
I1218 09:01:31.317148       1 node.go:207] NodeGetCapabilities: called with args 
	
	
2024-12-18 10:59:59.654	
I1218 08:59:59.653939       1 node.go:207] NodeGetCapabilities: called with args 
	
	
2024-12-18 10:59:50.689	
E1218 08:59:50.689314       1 driver.go:113] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/2c8efe52-28c0-4c75-be7a-ecb016d838a6/volumes/kubernetes.io~csi/s3-customer-storage/mount": Unmount failed: Non zero status code: 32 unmount output: umount: /var/lib/kubelet/pods/2c8efe52-28c0-4c75-be7a-ecb016d838a6/volumes/kubernetes.io~csi/s3-customer-storage/mount: not mounted. 
	
	
2024-12-18 10:59:50.662	
I1218 08:59:50.661856       1 node.go:189] NodeUnpublishVolume: unmounting /var/lib/kubelet/pods/2c8efe52-28c0-4c75-be7a-ecb016d838a6/volumes/kubernetes.io~csi/s3-customer-storage/mount
	
	
2024-12-18 10:59:50.661	
I1218 08:59:50.660447       1 node.go:163] NodeUnpublishVolume: called with args volume_id:"s3-csi-driver-volume-customer-storage" target_path:"/var/lib/kubelet/pods/2c8efe52-28c0-4c75-be7a-ecb016d838a6/volumes/kubernetes.io~csi/s3-customer-storage/mount" 
	
	
2024-12-18 10:58:52.705	
I1218 08:58:52.705911       1 node.go:207] NodeGetCapabilities: called with args 
	
	
2024-12-18 10:57:48.581	
E1218 08:57:48.580081       1 driver.go:113] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/2c8efe52-28c0-4c75-be7a-ecb016d838a6/volumes/kubernetes.io~csi/s3-customer-storage/mount": Unmount failed: Non zero status code: 32 unmount output: umount: /var/lib/kubelet/pods/2c8efe52-28c0-4c75-be7a-ecb016d838a6/volumes/kubernetes.io~csi/s3-customer-storage/mount: not mounted. 
	
	
2024-12-18 10:57:48.552	
I1218 08:57:48.552436       1 node.go:189] NodeUnpublishVolume: unmounting /var/lib/kubelet/pods/2c8efe52-28c0-4c75-be7a-ecb016d838a6/volumes/kubernetes.io~csi/s3-customer-storage/mount
	
	
2024-12-18 10:57:48.551	
I1218 08:57:48.550906       1 node.go:163] NodeUnpublishVolume: called with args volume_id:"s3-csi-driver-volume-customer-storage" target_path:"/var/lib/kubelet/pods/2c8efe52-28c0-4c75-be7a-ecb016d838a6/volumes/kubernetes.io~csi/s3-customer-storage/mount" 
	
	
2024-12-18 10:57:13.529	
I1218 08:57:13.529015       1 node.go:207] NodeGetCapabilities: called with args 
	
	
2024-12-18 10:55:46.454	
E1218 08:55:46.454495       1 driver.go:113] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/2c8efe52-28c0-4c75-be7a-ecb016d838a6/volumes/kubernetes.io~csi/s3-customer-storage/mount": Unmount failed: Non zero status code: 32 unmount output: umount: /var/lib/kubelet/pods/2c8efe52-28c0-4c75-be7a-ecb016d838a6/volumes/kubernetes.io~csi/s3-customer-storage/mount: not mounted. 
	
	
2024-12-18 10:55:46.428	
I1218 08:55:46.427474       1 node.go:189] NodeUnpublishVolume: unmounting /var/lib/kubelet/pods/2c8efe52-28c0-4c75-be7a-ecb016d838a6/volumes/kubernetes.io~csi/s3-customer-storage/mount
	
	
2024-12-18 10:55:46.423	
I1218 08:55:46.423286       1 node.go:163] NodeUnpublishVolume: called with args volume_id:"s3-csi-driver-volume-customer-storage" target_path:"/var/lib/kubelet/pods/2c8efe52-28c0-4c75-be7a-ecb016d838a6/volumes/kubernetes.io~csi/s3-customer-storage/mount" 

@unexge
Copy link
Contributor

unexge commented Dec 20, 2024

It looks like the mount operation in NodePublishVolume has been failed. Do you have any logs from that operation? Also, can you see mounts if you do mount | grep mountpoint-s3 in your node?

@geniass
Copy link
Author

geniass commented Jan 6, 2025

Sorry I forgot to respond to this.

Logs seem to be quite vague:

s3-plugin I0106 09:57:24.446170       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume mybucketstorage" target_path:"/var/lib/kubelet/pods/7b8b09c5-2200-47c0-b6cf-8258e8f30b46/volumes/kubernetes.io~csi s3-mybucket-storage/mount" volume_capability:<mount:<mount_flags:"region af-south-1" mount_flags:"uid=1000"mount_flags:"gid=1000" mount_flags:"allow-other" mount_flags:"read-only" > access_mode:<mode:MULTI_NODE_READER_ONLY > > readonly:true volume_context:<key:"bucketName" value:"org-hudson-mybucket-af-south-1-storage" >
s3-plugin I0106 09:57:24.446272       1 node.go:112] NodePublishVolume: mounting org-hudson-mybucket-af-south-1storage at /var/lib/kubelet/pods/7b8b09c5-2200-47c0-b6cf-8258e8f30b46/volumes/kubernetes.io~csi/s3-mybucket-storage/mount with options [--allow-other --gid=1000 --read-only --region=af-south-1 --uid=1000]
s3-plugin I0106 09:57:27.021072       1 node.go:207] NodeGetCapabilities: called with args
s3-plugin I0106 09:57:54.502772       1 mount.go:231] Mount: Failed to delete target dir: /var/lib/kubeletpods/7b8b09c5-2200-47c0-b6cf-8258e8f30b46/volumes/kubernetes.io~csi/s3-mybucket-storage/mount.
s3-plugin E0106 09:57:54.503426       1 driver.go:113] GRPC error: rpc error: code = Internal desc = Could not mount "org-hudson-mybucket-af-south-1-storage" at "/var/lib/kubelet/pods/7b8b09c5-2200-47c0-b6cf-8258e8f30b46/volumes/kubernetes.io~csi/s3-mybucket-storage/mount": Mount failed: Failed to start systemd unit, context cancelled output: Error: Timeout after 30 seconds while waiting for mount process to be ready

There are mounts on the node:

mountpoint-s3 on /var/lib/kubelet/pods/74f9f1c4-210e-4584-bc84-baf3022e08b7/
mountpoint-s3 on /var/lib/kubelet/pods/a48b67d8-26d7-4b46-a2b8-f6f60d7115c2/
mountpoint-s3 on /var/lib/kubelet/pods/d9a7cbea-999b-4ec8-96a7-7c3b1a17b70b/
mountpoint-s3 on /var/lib/kubelet/pods/f957c697-8254-44d9-9313-34916f5e72a8/
mountpoint-s3 on /var/lib/kubelet/pods/2051d02f-a427-430f-9709-0e100536fb21/
mountpoint-s3 on /var/lib/kubelet/pods/aa507778-b2ed-47a8-9aa6-48d64b4547be/
mountpoint-s3 on /var/lib/kubelet/pods/21a4feb4-6eee-46d6-afc2-fe9cd2e58ca9/
mountpoint-s3 on /var/lib/kubelet/pods/1b0a09f4-4615-4421-a6ea-889a0feafa94/
mountpoint-s3 on /var/lib/kubelet/pods/21612679-6b8a-4f68-81b0-6972eb9bc629/
mountpoint-s3 on /var/lib/kubelet/pods/7e15d868-e50b-43fb-8d89-9ff27af1c644/
mountpoint-s3 on /var/lib/kubelet/pods/98478e0f-5471-4e44-9be2-6431e3b462fa/
mountpoint-s3 on /var/lib/kubelet/pods/a016e1ab-6917-4743-be23-290d093c085c/
mountpoint-s3 on /var/lib/kubelet/pods/1a30cf04-764b-41e9-9848-3cbbea438996/
mountpoint-s3 on /var/lib/kubelet/pods/9b900198-0440-4697-8e24-69fdbf4603d3/
mountpoint-s3 on /var/lib/kubelet/pods/1fa988da-cfc4-4461-a6cf-5bbd79f55914/
mountpoint-s3 on /var/lib/kubelet/pods/10320956-df41-45cf-b4f8-d1c7a2b55f26/
mountpoint-s3 on /var/lib/kubelet/pods/6d9fa799-ebf8-4e07-a176-b878d0f26075/
mountpoint-s3 on /var/lib/kubelet/pods/a8219ab0-2700-4e15-985a-2b61f981397e/
mountpoint-s3 on /var/lib/kubelet/pods/9975cbfa-e227-4667-9300-3961ed6bde21/
mountpoint-s3 on /var/lib/kubelet/pods/b6096169-760e-4ba1-9579-e3b377fc0682/
mountpoint-s3 on /var/lib/kubelet/pods/bba4a438-147f-48b8-95f8-8ab07c124076/
mountpoint-s3 on /var/lib/kubelet/pods/92164762-144a-4ce5-8ed2-f7ba1cd81b3c/
mountpoint-s3 on /var/lib/kubelet/pods/5a5d7d87-fbc8-42ae-a89a-9d79cb4ffbd5/
mountpoint-s3 on /var/lib/kubelet/pods/20978155-96f4-4ed5-8a95-0f992157373b/
mountpoint-s3 on /var/lib/kubelet/pods/421b34c4-5085-41af-901f-ce193128c0d5/
mountpoint-s3 on /var/lib/kubelet/pods/545d2955-e874-4b99-8a15-ea76afc8925f/
mountpoint-s3 on /var/lib/kubelet/pods/d7cf7900-38fc-4aba-99dc-e718178c2f4e/
mountpoint-s3 on /var/lib/kubelet/pods/fc2d2200-e196-45c4-a121-9aa6c5c15367/
mountpoint-s3 on /var/lib/kubelet/pods/fd11c44a-2724-44bd-b2f9-d3078cfbb64b/
mountpoint-s3 on /var/lib/kubelet/pods/92fea603-e7f1-4cc5-8644-08934ce09eee/
mountpoint-s3 on /var/lib/kubelet/pods/fb77955a-1d33-4a5c-a87d-8b24f455c69d/
mountpoint-s3 on /var/lib/kubelet/pods/bd94014f-79dc-4315-a724-22ac5e5c2d99/

@unexge
Copy link
Contributor

unexge commented Jan 6, 2025

I now realized you're using k3s, which not officially supported I think. We don't run our tests against it, and you might be hitting some limitation.

By looking at the logs, it seems like the CSI Driver fails to get updates from systemd regarding Mountpoint, although Mountpoint successfully starts on the host as seen by the mounts you see.

Are you using SELinux by any change? That could be one reason why the CSI Driver is not receiving updates from systemd. See this documentation for getting audit logs for SELinux if so.

We plan to get away from using systemd (see #279) as it causes some problems, but I'm not sure if this is the only blocker for using the CSI Driver in k3s, there might be other incompatibilities.

@unexge unexge added the bug Something isn't working label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants