Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amazon-efs-mount-watchdog may crash loop if restarted shortly after a volume is unmounted #74

Closed
wongma7 opened this issue Jul 2, 2020 · 3 comments

Comments

@wongma7
Copy link
Member

wongma7 commented Jul 2, 2020

This bug only affects the latest release v1.26.2.

Steps to reproduce:

  1. Mount a volume fs-12345678 to /mountpoint with option tls. mount.efs writes a state file like /var/run/efs/stunnel-config.fs-12345678.mountpoint.20238
    os.rename(os.path.join(state_file_dir, temp_tls_state_file), os.path.join(state_file_dir, temp_tls_state_file[1:]))
  2. Unmount the volume fs-12345678 from /mountpoint. amazon-efs-mount-watchdog calls check_efs_mounts in the next 1 second*, sees that the mount of fs-1234568 at /mountpoint is absent from the OS mount table, and edits the state file to mark it unmounted
    state = mark_as_unmounted(state, state_file_dir, state_file, current_time)
  3. Stop amazon-efs-mount-watchdog in the next 30 seconds**. Kill the stunnel process of the mount. Start amazon-efs-mount-watchdog again.
  4. amazon-efs-mount-watchdog calls clean_up_previous_stunnel_pids, removing the 'pid' field from /var/run/efs/stunnel-config.fs-12345678.mountpoint.20238
  5. amazon-efs-mount-watchdog calls check_efs_mounts and throws an exception because the 'pid' field is unexpectedly absent from the state file:
Traceback (most recent call last):
  File "/usr/bin/amazon-efs-mount-watchdog", line 1047, in <module>
    main()
  File "/usr/bin/amazon-efs-mount-watchdog", line 1038, in main
    check_efs_mounts(config, child_procs, unmount_grace_period_sec)
  File "/usr/bin/amazon-efs-mount-watchdog", line 468, in check_efs_mounts
    clean_up_mount_state(state_file_dir, state_file, state['pid'], is_running, state.get('mountStateDir'))

clean_up_mount_state(state_file_dir, state_file, state['pid'], is_running, state.get('mountStateDir'))

6. amazon-efs-mount-watchdog crash loops

*By default poll_interval_sec is 1 second so calls of check_efs_mounts occur every 1 second.
**By default unmount_grace_period_sec is 30 seconds so the call of check_efs_mounts that would delete the state file occurs at least 30 seconds after the unmount. So the window to trigger this bug is 30 seconds after the unmount.

@Cappuccinuo
Copy link
Contributor

Thanks for the detail feedback, we will have someone working on this.

@wongma7
Copy link
Member Author

wongma7 commented Jul 2, 2020

OOps, the crucial step I was missing is that in step 3 you cannot just restart watchdog, you must also kill the stunnel process of the mount.

Since I found this in Kubernetes using CSI driver I should translate the exact steps in Kubernetes-world that I did:

  1. Create a pod, PVC, and PV using EFS with tls https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes/encryption_in_transit/specs
    kubectl create -f efs-pod.yaml
    kubectl create -f efs-pv.yaml
    kubectl create -f efs-pvc.yaml
  2. Delete the pod.
    kubectl delete pod efs-app
  3. Delete the CSI driver pod. This restarts amazon-efs-mount-watchdog AND kills all stunnel processes.
  4. Same as before (amazon-efs-mount-watchdog is running inside the CSI driver pod)
  5. Same as before (amazon-efs-mount-watchdog is running inside the CSI driver pod)
  6. Same as before (amazon-efs-mount-watchdog is running inside the CSI driver pod)

Plus exact steps on vanilla ec2 instance:

  1. sudo mount -t efs -o tls fs-fb878251.efs.us-west-2.amazonaws.com /mountpoint
  2. sudo umount /mountpoint
  3. sudo systemctl stop amazon-efs-watchdog; (find the stunnel. PID and kill it cat /var/run/efs/fs-fb878251.mountpoint.20437 | grep pid) sudo kill $PID sudo systemctl start amazon-efs-watchdog
  4. Same as before
  5. Same as before
  6. Same as before

@Cappuccinuo
Copy link
Contributor

Hey @wongma7 , fix is in v1.26.3, thanks for the report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants