Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volumes stuck in Released state #9833

Closed
zc-devs opened this issue Mar 29, 2024 · 7 comments
Closed

Volumes stuck in Released state #9833

zc-devs opened this issue Mar 29, 2024 · 7 comments
Assignees
Milestone

Comments

@zc-devs
Copy link
Contributor

zc-devs commented Mar 29, 2024

Environmental Info:
K3s Version: v1.29.3+k3s1
Local path provisioner: v0.0.26

Node(s) CPU architecture, OS, and Version:
Linux 5.14.0-362.24.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 20 04:52:13 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
3 servers, embedded etcd

Describe the bug:
After deploying new 1.29.3 cluster I noticed that PVs doesn't delete and stuck in released state, even though they have persistentVolumeReclaimPolicy: Delete.

Steps To Reproduce:

  1. Install K3s cluster.
  2. Check Local path provisioner version:
# kubectl get deployment -n kube-system local-path-provisioner -o=jsonpath='{$.spec.template.spec.con
tainers[:1].image}'
rancher/local-path-provisioner:v0.0.26
  1. Create test pvc.yaml.
kubectl apply -f pvc.yaml
  1. Create test pod.yaml.
kubectl apply -f pod.yaml
  1. Delete Pod.
kubectl delete -f pod.yaml
  1. Delete PVC.
kubectl delete -f pvc.yaml
  1. Check volume.
# kubectl get persistentvolume | grep test-pvc
pvc-3f2233a9-795e-4ba0-a52f-e4bf335979a4   10Mi       RWO            Delete           Released   kube-system/test-pvc             local-ssd      <unset>                          17m
  1. Check logs of Local path provisioner (after applying a workaround from Local path provisioner disallowed from reading Pods logs #9834).
    local-path-provisioner.log

Expected behavior:
Persistent volume deletes, there are no errors in Local path provisioner, there are no failed helper-pod-delete-pvc-* Pods.

Actual behavior:
Persistent volume doesn't delete, but stuck in released state, there are errors in Local path provisioner's logs and failed helper-pod-delete-pvc-* pods appear.

Additional context / logs:
helper-pod-delete-pvc-3f2233a9-795e-4ba0-a52f-e4bf335979a4.yml

Workaround:
If downgrade Local path provisioner to v0.0.24, then previously stuck in released state PVs automatically get deleted.

@zc-devs
Copy link
Contributor Author

zc-devs commented Mar 29, 2024

First change may be introducing new opt:

  setup: |-
    #!/bin/sh
    while getopts "m:s:p:a:" opt
    do
        case $opt in
            p)
            absolutePath=$OPTARG
            ;;
            s)
            sizeInBytes=$OPTARG
            ;;
            m)
            volMode=$OPTARG
            ;;
            a)
            action=$OPTARG
            ;;
        esac
    done
    if [ "$action" = "create" ]
    then
      mkdir -m 0777 -p ${absolutePath}
      chmod 700 ${absolutePath}/..
    fi
  teardown: |-
    #!/bin/sh
    set -x
    while getopts "m:s:p:a:" opt
    do
        case $opt in
            p)
            absolutePath=$OPTARG
            ;;
            s)
            sizeInBytes=$OPTARG
            ;;
            m)
            volMode=$OPTARG
            ;;
            a)
            action=$OPTARG
            ;;
        esac
    done
    if [ "$action" = "delete" ]
    then
      rm -rf ${absolutePath}
    fi

In that case I don't get Illegal option -a error, but helper pod fails anyways.
Then I tried to debug:

  teardown: |-
    #!/bin/sh
    sleep infinity
/ # ls -lah /var/lib/rancher/k3s/storage/local/ssd/pvc-a9809075-743a-4a86-ba28-38ca3d8256cd_kube-system_test-pvc
ls: can't open '/var/lib/rancher/k3s/storage/local/ssd/pvc-a9809075-743a-4a86-ba28-38ca3d8256cd_kube-system_test-pvc': Permission denied
total 0

/ # ls -lah /var/lib/rancher/k3s/storage/local/ssd
total 3G
drwx------   14 root     root        4.0K Mar 29 22:12 .
drwxr-xr-x    3 root     root        4.0K Mar 29 22:33 ..
drwxrwxrwx    2 root     root        4.0K Mar 29 22:13 pvc-a9809075-743a-4a86-ba28-38ca3d8256cd_kube-system_test-pvc

/ # id
uid=0(root) gid=0(root) groups=0(root),10(wheel)

Then I compared Pod definitions between v0.24 and v0.26:
Screenshot 2024-03-30

@VestigeJ
Copy link

TLDR from above link where I encountered this issue while testing the latest COMMIT_ID for v1.28 branch

$ kg pv -A

NAME            CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM              STORAGECLASS   REASON   AGE
checking-path   5Gi        RWO            Recycle          Failed   default/test-pvc   local-path              50m

@0xMALVEE
Copy link
Contributor

0xMALVEE commented May 15, 2024

is this resolved already?

@brandond
Copy link
Member

nope. issue is still open and PR is not merged. Waiting for end of code freeze.

@zc-devs
Copy link
Contributor Author

zc-devs commented May 21, 2024

Hi, I've just tested #9964 with Local path config: local-storage.yaml.

While PV creates, it cannot be deleted, the helper pod fails.

helper-pod-create-pvc-v0.24.yaml
helper-pod-create-pvc-v0.26.yaml
helper-pod-delete-pvc-v0.24.yaml
helper-pod-delete-pvc-v0.26.yaml

I think, the main difference is v0.26 doesn't use privileged security context flag, as I've noticed in #9833 (comment). Have to mention also, that I use Oracle Linux 9 with enabled SELinux. Directory permissions are:

# ls -lanZ /var/lib/rancher/k3s/storage/
total 48
drwx------. 12    0    0 system_u:object_r:container_file_t:s0           4096 May 21 19:24 .
drwxr-xr-x.  6    0    0 system_u:object_r:container_var_lib_t:s0        4096 Jan 18 18:32 ..

Volume permissions (v0.24):

# ls -lanZ pvc-9391085e-05dc-42f4-9c0d-25a1b4e6fe4a_kube-system_test-pvc/
total 12
drwxrwxrwx.  2 0 0 system_u:object_r:container_file_t:s0:c131,c199 4096 May 21 19:26 .
drwx------. 12 0 0 system_u:object_r:container_file_t:s0           4096 May 21 19:24 ..
-rw-r--r--.  1 0 0 system_u:object_r:container_file_t:s0:c131,c199    5 May 21 19:29 test.txt

Volume permissions (v0.26):

# ls -lanZ pvc-1fdbed66-a718-463d-908e-21bd731934c4_kube-system_test-pvc/
total 12
drwxrwxrwx.  2 0 0 system_u:object_r:container_file_t:s0:c143,c741 4096 May 21 19:34 .
drwx------. 11 0 0 system_u:object_r:container_file_t:s0           4096 May 21 19:33 ..
-rw-r--r--.  1 0 0 system_u:object_r:container_file_t:s0:c143,c741    5 May 21 19:34 test.txt

Should I file a separate issue?

@brandond
Copy link
Member

yes, that sounds like a separate issue. I don't see any difference in the volume permissions or contexts between the two versions though? It is expected that the numeric portion at the end will differ.

@VestigeJ
Copy link

##Environment Details
Attempted to reproduce but didn't hit it this time using VERSION=v1.29.5+k3s1 nor VERSION=v1.30.1+k3s1
Validated using COMMIT=cff6f7aa1d7987a658b030b2bc69df7c25f515c8

Infrastructure

  • Cloud

Node(s) CPU architecture, OS, and version:

Linux 5.14.21-150500.53-default x86_64 GNU/Linux
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP5"

Cluster Configuration:

NAME               STATUS   ROLES                       AGE   VERSION
ip-3-3-8-8         Ready    control-plane,etcd,master   23m   v1.30.1+k3s-cff6f7aa

Config.yaml:

node-external-ip: 3.3.8.8
token: YOUR_TOKEN_HERE
write-kubeconfig-mode: 644
debug: true
cluster-init: true
embedded-registry: true

Reproduction

$ curl https://get.k3s.io --output install-"k3s".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/90-kubelet.conf
$ sudo cp 90-kubelet.conf /etc/sysctl.d/
$ sudo systemctl restart systemd-sysctl
$ COMMIT=cff6f7aa1d7987a658b030b2bc69df7c25f515c8
$ sudo INSTALL_K3S_COMMIT=$COMMIT INSTALL_K3S_EXEC=server ./install-k3s.sh
$ kg no,po -A
$ vim pvc.yaml
$ vim pv-pod.yaml
$ k apply -f pvc.yaml
$ kg pvc -A
$ k apply -f pv-pod.yaml
$ kg po,pv,pvc -A
$ k delete -f pv-pod.yaml
$ k delete -f pvc.yaml
$ kg pv -A
$ kubectl get deployment -n kube-system local-path-provisioner -o=jsonpath='{$.spec.template.spec.containers[:1].image}'
$ k apply -f pvc.yaml
$ k apply -f pv-pod.yaml
$ kg pv,pod,pvc -A
$ k delete -f pvc.yaml; sleep 40; k delete -f pv-pod.yaml
$ kg pv,pvc -A
$ k apply -f pvc.yaml
$ k apply -f pv-pod.yaml
$ k delete -f pvc.yaml
$ k delete -f pv-pod.yaml
$ kg pvc,pv -A

Results:

No dangling leftover volume claims, they're reclaimed appropriately.

$ cat pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
  namespace: kube-system
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Mi

$ cat pv-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  namespace: kube-system
spec:
  containers:
    - name: debian
      image: digitalocean/doks-debug
      command: ["sleep", "infinity"]
      volumeMounts:
        - name: data
          mountPath: /data
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: test-pvc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

5 participants