Restarting the efs-csi-node Pod will cause mounts to hang on v1.6.0 #1270

RyanStan · 2024-02-13T21:02:45Z

/kind bug

Issue discovered with v1.6.0 of the aws-efs-csi-driver

When the EFS client mounts a file system, we re-direct a local NFS mount from the Linux Kernel to localhost, and then use a proxy process, stunnel, to receive the NFS traffic and forward it to EFS. The stunnel process runs in the efs-csi-node Pods.

Version v1.6.0 of the csi driver switched hostNetwork=true to hostNetwork=false. This means that Pods in the efs-csi-node Daemonset will launch into a new network namespace whenever they are restarted. This causes an issue. Any time these Pods are restarted, stunnel will launch in a new network namespace, while the local NFS mount from the kernel to localhost remains in the previous network namespace. This causes the mount to hang because the localhost NFS mount will not be able to reach the stunnel process once the Pod has restarted. When mounts hang, they go into uninterruptible sleep.

The issue was resolved in v1.7.0 of the driver, where we reverted the hostNetwork change, and set hostNetwork=true again. Thus, this issue only affects customers that established mounts while using v1.6.0 of the csi driver.

Work-arounds

Any attempts to upgrade or restart the v1.6.0 of the efs-csi-node Daemonset will result in EFS mounts on the node hanging.

To work-around this issue, you can launch new EKS nodes into your cluster, and then deploy a new efs-csi-node Daemonset, with hostNetwork=true, that targets these new nodes using Kubernetes Selectors. A rolling migration of your application to these new Nodes will allow you to upgrade to a new aws-efs-csi-driver version while ensuring that your application doesn't experience any downtime due to hanging mounts.

This issue was originally discovered here, but I'm making this post to raise visibility.

The text was updated successfully, but these errors were encountered:

poblahblahblah · 2024-02-15T17:16:09Z

We are also seeing this issue when upgrading from 1.6.0 to 1.7.5

nkryption · 2024-03-21T05:13:54Z

We are also seeing this issue when upgrading from 1.6.0 to 1.7.2, any resolution for this?

k8s-triage-robot · 2024-06-19T05:38:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

nkryption · 2024-06-19T11:30:07Z

We were able to resolve this by releasing the EFS v1.7.2 by setting the UpdateStrategy type to OnDelete(to avoid EFS CSI v1.6.0 daemonset pods to restart) and then rotating all the nodes in the cluster so that the new nodes have the EFS CSI daemonset v1.7.2.

k8s-triage-robot · 2024-07-19T11:32:41Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-08-18T11:37:19Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-08-18T11:37:24Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 13, 2024

RyanStan mentioned this issue Mar 20, 2024

Templatizing priotityClassName, hostNetwork, and additionalLabels in node-daemonset.yaml #1281

Merged

seanzatzdev-amazon mentioned this issue Mar 20, 2024

Revert "Templatizing priotityClassName, hostNetwork, and additionalLabels in node-daemonset.yaml" #1288

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 19, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting the efs-csi-node Pod will cause mounts to hang on v1.6.0 #1270

Restarting the efs-csi-node Pod will cause mounts to hang on v1.6.0 #1270

RyanStan commented Feb 13, 2024 •

edited

Loading

poblahblahblah commented Feb 15, 2024

nkryption commented Mar 21, 2024

k8s-triage-robot commented Jun 19, 2024

nkryption commented Jun 19, 2024

k8s-triage-robot commented Jul 19, 2024

k8s-triage-robot commented Aug 18, 2024

k8s-ci-robot commented Aug 18, 2024

Restarting the efs-csi-node Pod will cause mounts to hang on v1.6.0 #1270

Restarting the efs-csi-node Pod will cause mounts to hang on v1.6.0 #1270

Comments

RyanStan commented Feb 13, 2024 • edited Loading

Issue discovered with v1.6.0 of the aws-efs-csi-driver

Work-arounds

poblahblahblah commented Feb 15, 2024

nkryption commented Mar 21, 2024

k8s-triage-robot commented Jun 19, 2024

nkryption commented Jun 19, 2024

k8s-triage-robot commented Jul 19, 2024

k8s-triage-robot commented Aug 18, 2024

k8s-ci-robot commented Aug 18, 2024

RyanStan commented Feb 13, 2024 •

edited

Loading