Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from 1.5.9 to 1.6.0 breaks the EFS #1111

Closed
sumeet-zuora opened this issue Aug 22, 2023 · 11 comments · Fixed by #1124 or #1130
Closed

Upgrade from 1.5.9 to 1.6.0 breaks the EFS #1111

sumeet-zuora opened this issue Aug 22, 2023 · 11 comments · Fixed by #1124 or #1130
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@sumeet-zuora
Copy link

/kind bug

What happened?

After upgrading from 1.5.9 -> 1.6.0, started getting errors Output: Error retrieving region. Please set the "region" parameter in the efs-utils configuration file.

What you expected to happen?

EFS should get mounted

How to reproduce it (as minimally and precisely as possible)?

Uprgade EFS from 1.5.9 --> 1.6.0

Anything else we need to know?:

Did verify the IAM policy it does have "ec2:DescribeAvailabilityZones"
On Side note we use cilium, and did see hosnetwork was removed in 1.6.0 Helm Chart deployment

Environment

  • Kubernetes version (use kubectl version): 1.24
  • Driver version: 1.6.0

Please also attach debug logs to help us better diagnose

[home-init kube-api-access-wqlzk vault-secrets aws-iam-token downloads]: timed out waiting for the condition
  Warning  FailedMount  4m52s (x6 over 43m)  kubelet            Unable to attach or mount volumes: unmounted volumes=[downloads], unattached volumes=[aws-iam-token downloads home-init kube-api-access-wqlzk vault-secrets]: timed out waiting for the condition
  Warning  FailedMount  29s (x30 over 45m)   kubelet            MountVolume.SetUp failed for volume "pvc-c07f079e-e4c8-4c6e-9beb-2308ebefa88a" : rpc error: code = Internal desc = Could not mount "fs-0597418a1c1470d57:/" at "/var/lib/kubelet/pods/e685202f-4b90-4a14-8bba-14307d26bea0/volumes/kubernetes.io~csi/pvc-c07f079e-e4c8-4c6e-9beb-2308ebefa88a/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0ad8341ef6779e9f1,tls fs-0597418a1c1470d57:/ /var/lib/kubelet/pods/e685202f-4b90-4a14-8bba-14307d26bea0/volumes/kubernetes.io~csi/pvc-c07f079e-e4c8-4c6e-9beb-2308ebefa88a/mount
Output: Error retrieving region. Please set the "region" parameter in the efs-utils configuration file.
  • Instructions to gather debug logs can be found here
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 22, 2023
@david-a-morgan
Copy link

This is similar to an issue we are seeing, so I'll some additional context. We have a Cilium network policy to allow the controller egress access to AWS but not IMDS. CSI nodes do not have egress access to anything. The controller is using IRSA.

The controller logs indicate that it will use Kubernetes for metadata, but when trying to provision or delete a PV it is reaching out to IMDS and timing out.

@RyanStan
Copy link
Contributor

@david-a-morgan and others experiencing the issue: how are you installing the driver? Through kustomize?

This recent commit removed hostNetwork = true from the node daemonset. This means that the Node Daemonset Pods cannot use IMDS for getting the region.

A while ago, this commit was merged which allows us to pull EC2 info from Kubernetes instead of IMDS if IMDS is enabled. However, it requires the CSI_NODE_NAME env variable to be set. The author of this commit only added it to the Controller Deployment, but our Node Daemonset needs it as well. Thus, that's why the commit above, where hostNetwork = true was removed, also introduced this CSI_NODE_NAME to our Helm Chart. However, it wasn't added to the Node daemonset spec that our Kustomize files use.

I'll open a PR to add it in. However, this brings up two additional points:

  1. We should update the E2E tests to test the driver installed via Helm, and the driver installed via Kustomize.
  2. We need a way to run the E2E tests against multiple clusters with different configurations, like E2E disabled.

I'll open up issues on the project to track the above two items.

@david-a-morgan
Copy link

We install the driver using Helm. I did notice the new CSI_NODE_NAME on the node DaemonSet when comparing chart version differences.

Here are more details as to what we are experiencing:

  • Volume provisioning and deprovisioning usually times out multiple times before eventually succeeding, but only if egress to IMDS is blocked. It can take 10-20 minutes of retries before de/provisioning succeeds.
  • Volume provisioning and deprovisioning never succeed if egress to IMDS is not blocked.

In both cases Hubble shows that there are repeated attempts to egress to IMDS even when the controller is using Kubernetes metadata.

@RyanStan
Copy link
Contributor

/reopen

@k8s-ci-robot
Copy link
Contributor

@RyanStan: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Ashley-wenyizha
Copy link
Contributor

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Sep 13, 2023
@k8s-ci-robot
Copy link
Contributor

@Ashley-wenyizha: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@RyanStan
Copy link
Contributor

Ok, I think I figured out the issue. The CSI Driver uses the region it pulls from Kubernetes metadata to build a client to the EFS API. This is working as expected. However, the utility that the csi driver uses under the hood for performing mounts to EFS, efs-utils, requires IMDS to find the region, which it then uses to construct the DNS name for the mount target.

The reason that I didn't run into this issue when initially trying to recreate it was because my efs-utils configuration file had been hardcoded with the correct region, so IMDS was not needed.

The immediate solution here is to add hostNetwork=true back into the Node Daemonset.

And for the long term solution:
I'd like to see us add a mount option in efs-utils to support configuring the region that way, instead of through the config file. There is already an open PR for this: https://github.com/aws/efs-utils/pull/171/files. Once that is merged, we can modify this driver to pass in the region it pulls from the Kubernetes Node spec as a mount option to efs-utils.

We will also need to update our testing infra to test against a IMDS disabled cluster.

@RyanStan
Copy link
Contributor

RyanStan commented Sep 18, 2023

@david-a-morgan and others that experienced this issue:

Were you performing a cross-region mount? Also, were you using IRSA with your Node Daemonset Pods (e.g. annotating them with an IAM Role)? I assume the answer to this second question is no, because our current documentation doesn't list this as a requirement, but this will need to change.

I was looking into this a bit more to try, and I found that the watchdog process should overwrite the region in the efs-utils configuration with the AWS_DEFAULT_REGION. This environment variable is set from IRSA to the region that the cluster is in. However, for @sumeet-zuora (and I assume others), this variable was not set
Output: Error retrieving region. Please set the "region" parameter in the efs-utils configuration file.. Since this wasn't set, efs-utils tried falling back on IMDS (which isn't possible in this case).

@bd-spl
Copy link

bd-spl commented Nov 28, 2023

watchdog process should overwrite the region in the efs-utils configuration with the AWS_DEFAULT_REGION

it seems that the Region value is not wired into the config template

source={{.EfsClientSource}}

@codeMehtab
Copy link

I am still facing this issue, below are the details:
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:49:13Z", GoVersion:"go1.17.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.12-eks-5e0fdde", GitCommit:"95c835ee1111774fe5e8b327187034d8136720a0", GitTreeState:"clean", BuildDate:"2024-01-02T20:34:50Z", GoVersion:"go1.20.12", Compiler:"gc", Platform:"linux/amd64"}

I am using HELM to install the driver:

{ chart = "aws-efs-csi-driver",
    repo = "https://kubernetes-sigs.github.io/aws-efs-csi-driver",
  version = "2.4.9" }

Error: Output: Error retrieving region. Please set the "region" parameter in the efs-utils configuration file.
Warning FailedMount 7 secs ago

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
7 participants