amazon-efs-mount-watchdog may crash loop if restarted shortly after a volume is unmounted #74

wongma7 · 2020-07-02T00:15:01Z

This bug only affects the latest release v1.26.2.

Steps to reproduce:

Mount a volume fs-12345678 to /mountpoint with option tls. mount.efs writes a state file like /var/run/efs/stunnel-config.fs-12345678.mountpoint.20238

efs-utils/src/mount_efs/__init__.py

Line 894 in 914889d

os.rename(os.path.join(state_file_dir, temp_tls_state_file), os.path.join(state_file_dir, temp_tls_state_file[1:]))
Unmount the volume fs-12345678 from /mountpoint. amazon-efs-mount-watchdog calls check_efs_mounts in the next 1 second*, sees that the mount of fs-1234568 at /mountpoint is absent from the OS mount table, and edits the state file to mark it unmounted

efs-utils/src/watchdog/__init__.py

Line 472 in 914889d

state = mark_as_unmounted(state, state_file_dir, state_file, current_time)
Stop amazon-efs-mount-watchdog in the next 30 seconds**. Kill the stunnel process of the mount. Start amazon-efs-mount-watchdog again.
amazon-efs-mount-watchdog calls clean_up_previous_stunnel_pids, removing the 'pid' field from /var/run/efs/stunnel-config.fs-12345678.mountpoint.20238
amazon-efs-mount-watchdog calls check_efs_mounts and throws an exception because the 'pid' field is unexpectedly absent from the state file:

Traceback (most recent call last):
  File "/usr/bin/amazon-efs-mount-watchdog", line 1047, in <module>
    main()
  File "/usr/bin/amazon-efs-mount-watchdog", line 1038, in main
    check_efs_mounts(config, child_procs, unmount_grace_period_sec)
  File "/usr/bin/amazon-efs-mount-watchdog", line 468, in check_efs_mounts
    clean_up_mount_state(state_file_dir, state_file, state['pid'], is_running, state.get('mountStateDir'))

efs-utils/src/watchdog/__init__.py

Line 468 in 914889d

    
           clean_up_mount_state(state_file_dir, state_file, state['pid'], is_running, state.get('mountStateDir'))

6. amazon-efs-mount-watchdog crash loops

*By default poll_interval_sec is 1 second so calls of check_efs_mounts occur every 1 second.
**By default unmount_grace_period_sec is 30 seconds so the call of check_efs_mounts that would delete the state file occurs at least 30 seconds after the unmount. So the window to trigger this bug is 30 seconds after the unmount.

The text was updated successfully, but these errors were encountered:

Cappuccinuo · 2020-07-02T00:19:03Z

Thanks for the detail feedback, we will have someone working on this.

wongma7 · 2020-07-02T00:58:14Z

OOps, the crucial step I was missing is that in step 3 you cannot just restart watchdog, you must also kill the stunnel process of the mount.

Since I found this in Kubernetes using CSI driver I should translate the exact steps in Kubernetes-world that I did:

Create a pod, PVC, and PV using EFS with tls https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes/encryption_in_transit/specs
kubectl create -f efs-pod.yaml
kubectl create -f efs-pv.yaml
kubectl create -f efs-pvc.yaml
Delete the pod.
kubectl delete pod efs-app
Delete the CSI driver pod. This restarts amazon-efs-mount-watchdog AND kills all stunnel processes.
Same as before (amazon-efs-mount-watchdog is running inside the CSI driver pod)
Same as before (amazon-efs-mount-watchdog is running inside the CSI driver pod)
Same as before (amazon-efs-mount-watchdog is running inside the CSI driver pod)

Plus exact steps on vanilla ec2 instance:

sudo mount -t efs -o tls fs-fb878251.efs.us-west-2.amazonaws.com /mountpoint
sudo umount /mountpoint
sudo systemctl stop amazon-efs-watchdog; (find the stunnel. PID and kill it cat /var/run/efs/fs-fb878251.mountpoint.20437 | grep pid) sudo kill $PID sudo systemctl start amazon-efs-watchdog
Same as before
Same as before
Same as before

…s killed and pid key was removed from state file

Cappuccinuo · 2020-07-06T16:50:50Z

Hey @wongma7 , fix is in v1.26.3, thanks for the report.

alfredkrohmer mentioned this issue Jul 6, 2020

Terminating stunnel process in nondeterministic; might get stuck or kill other processes #75

Closed

Cappuccinuo referenced this issue Jul 6, 2020

Fix an issue where watchdog crashed during restart because stunnel wa…

13508a8

…s killed and pid key was removed from state file

Cappuccinuo closed this as completed Jul 6, 2020

This was referenced Jul 6, 2020

Update amazon-efs-utils to 1.26-2 kubernetes-sigs/aws-efs-csi-driver#206

Closed

Update to efs-utils 1.26-3 kubernetes-sigs/aws-efs-csi-driver#208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amazon-efs-mount-watchdog may crash loop if restarted shortly after a volume is unmounted #74

amazon-efs-mount-watchdog may crash loop if restarted shortly after a volume is unmounted #74

wongma7 commented Jul 2, 2020 •

edited

Loading

Cappuccinuo commented Jul 2, 2020

wongma7 commented Jul 2, 2020 •

edited

Loading

Cappuccinuo commented Jul 6, 2020

amazon-efs-mount-watchdog may crash loop if restarted shortly after a volume is unmounted #74

amazon-efs-mount-watchdog may crash loop if restarted shortly after a volume is unmounted #74

Comments

wongma7 commented Jul 2, 2020 • edited Loading

Cappuccinuo commented Jul 2, 2020

wongma7 commented Jul 2, 2020 • edited Loading

Cappuccinuo commented Jul 6, 2020

wongma7 commented Jul 2, 2020 •

edited

Loading

wongma7 commented Jul 2, 2020 •

edited

Loading