-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aws-eks] EKS fails to deploy with CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml" #4087
Comments
Looks like some race condition. Stack gets sucesfully deployed sometimes. More often it fails on random resources.
|
This time it failed with no resources manually added. I guess the AwsAuth resource comes from the AWS IAM Mapping feature.
|
It looks like kubectl launched from lambda is suffering from timeout when calling EKS API Server. I suggest to implement some kind of retry mechanism in this place ( or just pass correct arguments to kubectl ) to make sure kubectl doesn't fail due to just timeout that can happen sometimes.
|
I found that the lambda has a 13m timeout. If EKS is having a bad day, this fails and everything rolls back. This week as I was playing with CDK and EKS, the EKS cluster would OFTEN take longer than 13m to become active. aws-cdk/packages/@aws-cdk/aws-eks/lib/cluster-resource/index.py Lines 78 to 84 in 4e11d86
How do you address this since a lambda can only run for 15m max? Also, why are we paying metered lambda costs for a waiter routine? |
I can verify that whenever I deploy a CDK solution for EKS that I've been working on in When it fails, I get the following error:
I'd be glad to provide further details. My code is written in Typescript. |
Thanks @runlevel-six. We need to modify our resource to allow a much longer wait time. |
Providing some more detail. This issue repeatedly does not occur until after the cluster is created. Here is a greater view of the activity:
I'm not sure what is timing out, because the creation of Is there anything else I can provide? The last pass at this I stripped my code down to just the creation of the cluster admin role and then the creation of the cluster. Everything else I removed. It almost always fails in |
I am also having this issue, using us-east-1. Let me know if there is any information I can provide or tests I can perform to help with troubleshooting.
|
For those that are having this issue, I got around it by not setting a master role in the cluster stack, and instead creating a second stack that builds the aws-auth manifest and applies it as a new KubernetesResource. That second stack, which I apply after the cluster stack has completed creation, uses an export of the cluster (from the cluster stack) and configured as a part of Cluster Definition (from initial stack): this.cluster = new eks.Cluster(this, `${props.deploy}-${props.clusterNum}`, {
clusterName: `${props.deploy}-${props.clusterNum}`,
vpc: props.vpc,
defaultCapacity: 0,
vpcSubnets: [{
subnetType: ec2.SubnetType.PRIVATE
}, {
subnetType: ec2.SubnetType.PUBLIC
}],
kubectlEnabled: true,
role: props.clusterAdmin,
//mastersRole: props.kubeAdminRole,
outputConfigCommand: false,
version: environment.k8sVersion
}); Cluster Auth Stack (hopefully temporarily): export class K8SClusterAuthStack extends Stack {
constructor(scope: App, id: string, props: K8SExtendedStackProps) {
super(scope, id, props);
try {
const awsAwthRoles = {
apiVersion: 'v1',
kind: 'ConfigMap',
metadata: {
name: 'aws-auth',
namespace: 'kube-system'
},
data: {
mapRoles: `[{\"rolearn\": \"${props.nodeGroupRole.roleArn}\",\"username\":\"system:node:{{EC2PrivateDNSName}}\",\"groups\": [\"system:bootstrappers\",\"system:nodes\"]},{\"rolearn\": \"${props.kubeAdminRole.roleArn}\",\"groups\": [\"system:masters\"]}]`,
mapUsers: "[]",
mapAccounts: "[]"
}
};
new eks.KubernetesResource(this, 'k8sRolesAwsAuthManifest', {
cluster: props.cluster,
manifest: [
awsAwthRoles
]
});
new CfnOutput(this, `${props.deploy}-${props.clusterNum}-Kubeconfig-Command`, {
value: `aws eks update-kubeconfig --name ${props.deploy}-${props.clusterNum} --region ${this.region} --role-arn ${props.kubeAdminRole.roleArn}`
});
} catch(e) {
console.log(e);
}
}
} It then rebuilds the new update-kubeconfig command and provides it as output. A few items to note, though:
|
@runlevel-six Thanks for sharing your workaround! There is one item I don't understand, could you tell me what role the "props.nodeGroupRole" maps to in your code? Is that first role required to make the masters role mapping work or is it something unrelated? Thanks for your time, I really appreciate it. |
@cseickel: that role was a second role I created for the nodegroups so that they could properly join the cluster as worker nodes. If it doesn't apply to what you are doing and you just need the masters role, you could change that line to:
|
Unfortunately for me, the two stage KubernetesResource workaround did not work out. It does avoid the bug, but I am left with permissions issues that I have opted not to debug any further right now. At least for my setup, I do need to add more than just the masters role to enable the worker nodes to join properly. |
In that case you need to use:
Where |
The lambda can also fails in another place. This is what happened to me few times:
aws-cdk/packages/@aws-cdk/aws-eks/lib/k8s-resource/index.py Lines 41 to 45 in 8911e7a
It's strange because my cluster-resource lambda received info that the cluster is active 30 seconds before that. |
In addtion to what has already been reported - this is what I experienced yesterday.
Logs snippet from the cluster resource handler lambda
|
Can anyone else confirm whether this issue is primarily with us-east-1? |
No, it happens quite often in eu-west-1 too. |
Hello, any updates on this? |
Hello, this is still high on our priority list, but we are a bit heads down towards re:Invent next week, and will get to this as soon as possible. |
There were two causes of timeouts for EKS cluster creation: create time which is longer than the AWS Lambda timeout (15min) and lack of retry when applying kubectl after the cluster has been created. The change fixes the first issue by leveraging the custom resource provider framework to implement the cluster resource as an async resource. The second issue is fixed by adding 3 retries to "kubectl apply". Fixes #4087 Fixes #4695
There were two causes of timeouts for EKS cluster creation: create time which is longer than the AWS Lambda timeout (15min) and lack of retry when applying kubectl after the cluster has been created. The change fixes the first issue by leveraging the custom resource provider framework to implement the cluster resource as an async resource. The custom resource providers are now bundled as nested stacks so they don't take up too many resources from users, and are also reused by multiple clusters within the same stack. This required that the creation role will not be the same as the lambda role, so we define this role separately and assume it within the providers. The second issue is fixed by adding 3 retries to "kubectl apply". **Backwards compatibility**: as described in #5544, since the resource provider handler of `Cluster` and `KubernetesResource` has been changed, this change requires a replacement of existing clusters (deployment fails with "service token cannot be changed" error). Since this can be disruptive to users, this change includes an exact copy of the previous version under a new module called `@aws-cdk/aws-eks-legacy`, which can be used as a drop-in replacement until users decide to upgrade to the new version. Using the legacy cluster will emit a synthesis warning that this module will no longer be released as part of the CDK starting March 1st, 2020. - Fixes #4087 - Fixes #4695 - Fixes #5259 - Fixes #5501 --- BREAKING CHANGE: (in experimental module) the providers behind the AWS EKS module have been rewritten to address multiple stability issues. Since this change requires cluster replacement, the old version of this module is available under `@aws-cdk/aws-eks-legacy`. Please read #5544 carefully for upgrade instructions.
There were two causes of timeouts for EKS cluster creation: create time which is longer than the AWS Lambda timeout (15min) and lack of retry when applying kubectl after the cluster has been created. The change fixes the first issue by leveraging the custom resource provider framework to implement the cluster resource as an async resource. The custom resource providers are now bundled as nested stacks so they don't take up too many resources from users, and are also reused by multiple clusters within the same stack. This required that the creation role will not be the same as the lambda role, so we define this role separately and assume it within the providers. The second issue is fixed by adding 3 retries to "kubectl apply". **Backwards compatibility**: as described in #5544, since the resource provider handler of `Cluster` and `KubernetesResource` has been changed, this change requires a replacement of existing clusters (deployment fails with "service token cannot be changed" error). Since this can be disruptive to users, this change includes an exact copy of the previous version under a new module called `@aws-cdk/aws-eks-legacy`, which can be used as a drop-in replacement until users decide to upgrade to the new version. Using the legacy cluster will emit a synthesis warning that this module will no longer be released as part of the CDK starting March 1st, 2020. - Fixes #4087 - Fixes #4695 - Fixes #5259 - Fixes #5501 --- BREAKING CHANGE: (in experimental module) the providers behind the AWS EKS module have been rewritten to address multiple stability issues. Since this change requires cluster replacement, the old version of this module is available under `@aws-cdk/aws-eks-legacy`. Please read #5544 carefully for upgrade instructions. Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
❓ General Issue
The Question
Manifests are tiller deployment files. The work when deployed manually. I can attach the files if needed.
Deployment fails with:
rest of the stack trace ommited.
Environment
Other information
The text was updated successfully, but these errors were encountered: