[aws-eks] EKS fails to deploy with CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml" #4087

lkoniecz · 2019-09-16T07:33:40Z

❓ General Issue

The Question

eks_cluster = aws_eks.Cluster(
            scope=self,
            id=cluster_name,
            cluster_name=cluster_name,
            default_capacity=0,
            masters_role=cluster_masters_role,
            version='1.12',
            vpc=vpc
        )

with open('assets/tiller-service-account.yml', 'r') as file:
    eks_cluster.add_resource('TillerServiceAccount', yaml.safe_load(file))

with open('assets/tiller-cluster-role-binding.yml', 'r') as file:
    eks_cluster.add_resource('TillerClusterRoleBinding', yaml.safe_load(file))

with open('assets/tiller-deployment.yml', 'r') as file:
    eks_cluster.add_resource('TillerDeployment', yaml.safe_load(file))

with open('assets/tiller-service.yml', 'r') as file:
    eks_cluster.add_resource('TillerService', yaml.safe_load(file))

Manifests are tiller deployment files. The work when deployed manually. I can attach the files if needed.

Deployment fails with:

 36/41 | 09:24:23 | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | DevEksCluster/manifest-TillerDeployment/Resource/Default (DevEksClustermanifestTillerDeployment81AA2785) Failed to create resource. b'E0916 07:23:46.832354      13 round_trippers.go:174] CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml": Get https://3E59AF23B9E843E291570C60FFFDAF2C.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s: dial tcp 52.2.151.216:443: i/o timeout\n'
	new CustomResource (/tmp/jsii-kernel-otb6xc/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.js:32:25)
	\_ new KubernetesResource (/tmp/jsii-kernel-otb6xc/node_modules/@aws-cdk/aws-eks/lib/k8s-resource.js:22:9)
	\_ Cluster.addResource (/tmp/jsii-kernel-otb6xc/node_modules/@aws-cdk/aws-eks/lib/cluster.js:215:16)
	\_ _wrapSandboxCode (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:6498:51)
	\_ Kernel._wrapSandboxCode (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7131:20)
	\_ ret._ensureSync (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:6498:25)
	\_ Kernel._ensureSync (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7102:20)

rest of the stack trace ommited.

Environment

CDK CLI Version: 1.4.0 (build 175471f)
Module Version: 1.4
OS: all
Language: python

Other information

The text was updated successfully, but these errors were encountered:

lkoniecz · 2019-09-18T10:53:07Z

Looks like some race condition. Stack gets sucesfully deployed sometimes. More often it fails on random resources.

 34/40 | 12:43:52 | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | DevEksCluster/manifest-TillerServiceAccount/Resource/Default (DevEksClustermanifestTillerServiceAccountF8370BEF) Resource creation Initiated
 34/40 | 12:43:52 | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | DevEksCluster/manifest-TillerDeployment/Resource/Default (DevEksClustermanifestTillerDeployment81AA2785) Resource creation Initiated
 35/40 | 12:43:52 | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | DevEksCluster/manifest-TillerServiceAccount/Resource/Default (DevEksClustermanifestTillerServiceAccountF8370BEF) Failed to create resource. b'E0918 10:43:16.159351      14 round_trippers.go:174] CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml": Get https://B4A809FC4A5D31F7A9577E53CC42C0D7.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s: dial tcp 3.210.93.152:443: i/o timeout\n'
	new CustomResource (/tmp/jsii-kernel-0sUqCd/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.js:32:25)
	\_ new KubernetesResource (/tmp/jsii-kernel-0sUqCd/node_modules/@aws-cdk/aws-eks/lib/k8s-resource.js:22:9)
	\_ Cluster.addResource (/tmp/jsii-kernel-0sUqCd/node_modules/@aws-cdk/aws-eks/lib/cluster.js:215:16)
	\_ _wrapSandboxCode (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:6498:51)
	\_ Kernel._wrapSandboxCode (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7131:20)
	\_ ret._ensureSync (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:6498:25)
	\_ Kernel._ensureSync (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7102:20)
	\_ Kernel.invoke (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:6497:26)

lkoniecz · 2019-09-18T12:08:08Z

This time it failed with no resources manually added. I guess the AwsAuth resource comes from the AWS IAM Mapping feature.

 50/53 | 13:49:34 | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | DevEksCluster/AwsAuth/manifest/Resource/Default (DevEksClusterAwsAuthmanifest25FB57E0) Resource creation Initiated
 51/53 | 13:49:34 | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | DevEksCluster/AwsAuth/manifest/Resource/Default (DevEksClusterAwsAuthmanifest25FB57E0) Failed to create resource. b'E0918 11:48:57.714860      13 round_trippers.go:174] CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml": Get https://962A6CC83950262D83626C48486A6719.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s: dial tcp 34.231.238.152:443: i/o timeout\n'
	new CustomResource (/tmp/jsii-kernel-sd5H2r/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.js:32:25)

stefanolczak · 2019-09-18T14:22:44Z

It looks like kubectl launched from lambda is suffering from timeout when calling EKS API Server. I suggest to implement some kind of retry mechanism in this place ( or just pass correct arguments to kubectl ) to make sure kubectl doesn't fail due to just timeout that can happen sometimes.

aws-cdk/packages/@aws-cdk/aws-eks/lib/k8s-resource/index.py

Line 57 in 8911e7a

kubectl('apply', manifest_file)

tjbaker · 2019-09-19T13:30:22Z

I found that the lambda has a 13m timeout. If EKS is having a bad day, this fails and everything rolls back. This week as I was playing with CDK and EKS, the EKS cluster would OFTEN take longer than 13m to become active.

aws-cdk/packages/@aws-cdk/aws-eks/lib/cluster-resource/index.py

Lines 78 to 84 in 4e11d86

    
           # wait for the cluster to become active (13min timeout) 
        
           logger.info('waiting for cluster to become active...') 
        
           waiter = eks.get_waiter('cluster_active') 
        
           waiter.wait(name=cluster_name, WaiterConfig={ 
        
               'Delay': 30, 
        
               'MaxAttempts': 26 
        
           })

How do you address this since a lambda can only run for 15m max? Also, why are we paying metered lambda costs for a waiter routine?

runlevel-six · 2019-09-23T16:49:51Z

I can verify that whenever I deploy a CDK solution for EKS that I've been working on in us-east-1 that I more often than not receive the same error during the AwsAuth resource (no custom resources are being added). Any other region I've tried (us-west-2, us-east-2, & eu-west-1 works fine with no errors). I've tested it multiple times in each region.

When it fails, I get the following error:

57/60 | 11:31:49 AM | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | k8s-sandbox-1/AwsAuth/manifest/Resource/Default (k8ssandbox1AwsAuthmanifest8F203DE4) Resource creation Initiated
58/60 | 11:31:50 AM | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | k8s-sandbox-1/AwsAuth/manifest/Resource/Default (k8ssandbox1AwsAuthmanifest8F203DE4) Failed to create resource. b'E0923 16:31:13.504110      14 round_trippers.go:174] CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml": Get https://B7302555A3113CB74E42C83A93928DB0.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s: dial tcp 52.71.64.29:443: i/o timeout\n'
	new CustomResource (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.ts:92:21)
	\_ new KubernetesResource (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/k8s-resource.ts:62:5)
	\_ new AwsAuth (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/aws-auth.ts:32:5)
	\_ Cluster.get awsAuth [as awsAuth] (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/cluster.ts:563:23)
	\_ new Cluster (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/cluster.ts:412:12)
	\_ new K8SPlatformStack (/Users/jsohl/code/adobe/k8s-platform/lib/k8s-platform-stack.ts:150:27)
	\_ Object.<anonymous> (/Users/jsohl/code/adobe/k8s-platform/bin/k8s-platform.ts:38:1)
	\_ Module._compile (internal/modules/cjs/loader.js:936:30)
	\_ Module.m._compile (/Users/jsohl/code/adobe/k8s-platform/node_modules/ts-node/src/index.ts:493:23)
	\_ Module._extensions..js (internal/modules/cjs/loader.js:947:10)
	\_ Object.require.extensions.<computed> [as .ts] (/Users/jsohl/code/adobe/k8s-platform/node_modules/ts-node/src/index.ts:496:12)
	\_ Module.load (internal/modules/cjs/loader.js:790:32)
	\_ Function.Module._load (internal/modules/cjs/loader.js:703:12)
	\_ Function.Module.runMain (internal/modules/cjs/loader.js:999:10)
	\_ Object.<anonymous> (/Users/jsohl/code/adobe/k8s-platform/node_modules/ts-node/src/bin.ts:158:12)
	\_ Module._compile (internal/modules/cjs/loader.js:936:30)
	\_ Object.Module._extensions..js (internal/modules/cjs/loader.js:947:10)
	\_ Module.load (internal/modules/cjs/loader.js:790:32)
	\_ Function.Module._load (internal/modules/cjs/loader.js:703:12)
	\_ Function.Module.runMain (internal/modules/cjs/loader.js:999:10)
	\_ /usr/local/lib/node_modules/npm/node_modules/libnpx/index.js:268:14

I'd be glad to provide further details. My code is written in Typescript.

eladb · 2019-09-25T19:50:58Z

Thanks @runlevel-six. We need to modify our resource to allow a much longer wait time.

runlevel-six · 2019-09-26T21:36:13Z

@eladb:

Providing some more detail. This issue repeatedly does not occur until after the cluster is created. Here is a greater view of the activity:

k8s-cluster-playground-1: deploying...
k8s-cluster-playground-1: creating CloudFormation changeset...
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::CloudFormation::Stack            | kubectl-layer-8C2542BC-BF2B-4DFE-B765-E181FD30A9A0 (kubectllayer8C2542BCBF2B4DFEB765E181FD30A9A0617C4ADA)
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::EC2::SecurityGroup               | playground-1/ControlPlaneSecurityGroup (playground1ControlPlaneSecurityGroup9F2E8BE6)
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::IAM::Role                        | playground-1-us-east-1-cluster-admin-role (playground1useast1clusteradminrole64CFE07D)
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::IAM::Role                        | playground-1/Resource/ResourceHandler/ServiceRole (playground1ResourceHandlerServiceRole5DE06889)
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::IAM::Role                        | playground-1/ClusterRole (playground1ClusterRole09E5E62F)
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::CDK::Metadata                    | CDKMetadata
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::IAM::Role                        | playground-1-us-east-1-cluster-admin-role (playground1useast1clusteradminrole64CFE07D) Resource creation Initiated
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::CloudFormation::Stack            | kubectl-layer-8C2542BC-BF2B-4DFE-B765-E181FD30A9A0 (kubectllayer8C2542BCBF2B4DFEB765E181FD30A9A0617C4ADA) Resource creation Initiated
  0/12 | 4:09:17 PM | CREATE_IN_PROGRESS   | AWS::IAM::Role                        | playground-1/ClusterRole (playground1ClusterRole09E5E62F) Resource creation Initiated
  0/12 | 4:09:17 PM | CREATE_IN_PROGRESS   | AWS::IAM::Role                        | playground-1/Resource/ResourceHandler/ServiceRole (playground1ResourceHandlerServiceRole5DE06889) Resource creation Initiated
  0/12 | 4:09:18 PM | CREATE_IN_PROGRESS   | AWS::CDK::Metadata                    | CDKMetadata Resource creation Initiated
  1/12 | 4:09:18 PM | CREATE_COMPLETE      | AWS::CDK::Metadata                    | CDKMetadata
  1/12 | 4:09:21 PM | CREATE_IN_PROGRESS   | AWS::EC2::SecurityGroup               | playground-1/ControlPlaneSecurityGroup (playground1ControlPlaneSecurityGroup9F2E8BE6) Resource creation Initiated
  2/12 | 4:09:22 PM | CREATE_COMPLETE      | AWS::EC2::SecurityGroup               | playground-1/ControlPlaneSecurityGroup (playground1ControlPlaneSecurityGroup9F2E8BE6)
  3/12 | 4:09:34 PM | CREATE_COMPLETE      | AWS::IAM::Role                        | playground-1-us-east-1-cluster-admin-role (playground1useast1clusteradminrole64CFE07D)
  4/12 | 4:09:35 PM | CREATE_COMPLETE      | AWS::IAM::Role                        | playground-1/Resource/ResourceHandler/ServiceRole (playground1ResourceHandlerServiceRole5DE06889)
  5/12 | 4:09:35 PM | CREATE_COMPLETE      | AWS::IAM::Role                        | playground-1/ClusterRole (playground1ClusterRole09E5E62F)
  5/12 | 4:09:37 PM | CREATE_IN_PROGRESS   | AWS::IAM::Policy                      | playground-1/Resource/ResourceHandler/ServiceRole/DefaultPolicy (playground1ResourceHandlerServiceRoleDefaultPolicyF9F64556)
  5/12 | 4:09:38 PM | CREATE_IN_PROGRESS   | AWS::IAM::Policy                      | playground-1/Resource/ResourceHandler/ServiceRole/DefaultPolicy (playground1ResourceHandlerServiceRoleDefaultPolicyF9F64556) Resource creation Initiated
  6/12 | 4:09:46 PM | CREATE_COMPLETE      | AWS::IAM::Policy                      | playground-1/Resource/ResourceHandler/ServiceRole/DefaultPolicy (playground1ResourceHandlerServiceRoleDefaultPolicyF9F64556)
  7/12 | 4:09:51 PM | CREATE_COMPLETE      | AWS::CloudFormation::Stack            | kubectl-layer-8C2542BC-BF2B-4DFE-B765-E181FD30A9A0 (kubectllayer8C2542BCBF2B4DFEB765E181FD30A9A0617C4ADA)
  7/12 | 4:09:53 PM | CREATE_IN_PROGRESS   | AWS::Lambda::Function                 | playground-1/Resource/ResourceHandler (playground1ResourceHandlerFCA27D23)
  7/12 | 4:10:00 PM | CREATE_IN_PROGRESS   | AWS::Lambda::Function                 | playground-1/Resource/ResourceHandler (playground1ResourceHandlerFCA27D23) Resource creation Initiated
  8/12 | 4:10:00 PM | CREATE_COMPLETE      | AWS::Lambda::Function                 | playground-1/Resource/ResourceHandler (playground1ResourceHandlerFCA27D23)
  8/12 | 4:10:04 PM | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-Cluster            | playground-1/Resource/Resource/Default (playground13674B29B)
 8/12 Currently in progress: playground13674B29B
  8/12 | 4:18:51 PM | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-Cluster            | playground-1/Resource/Resource/Default (playground13674B29B) Resource creation Initiated
  9/12 | 4:18:52 PM | CREATE_COMPLETE      | Custom::AWSCDK-EKS-Cluster            | playground-1/Resource/Resource/Default (playground13674B29B)
  9/12 | 4:18:54 PM | CREATE_IN_PROGRESS   | AWS::Lambda::Function                 | playground-1/KubernetesResourceHandler (playground1KubernetesResourceHandler520AFAB4)
  9/12 | 4:19:00 PM | CREATE_IN_PROGRESS   | AWS::Lambda::Function                 | playground-1/KubernetesResourceHandler (playground1KubernetesResourceHandler520AFAB4) Resource creation Initiated
 10/12 | 4:19:00 PM | CREATE_COMPLETE      | AWS::Lambda::Function                 | playground-1/KubernetesResourceHandler (playground1KubernetesResourceHandler520AFAB4)
 10/12 | 4:19:03 PM | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | playground-1/AwsAuth/manifest/Resource/Default (playground1AwsAuthmanifestE4865195)
10/12 Currently in progress: playground1AwsAuthmanifestE4865195
 10/12 | 4:20:20 PM | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | playground-1/AwsAuth/manifest/Resource/Default (playground1AwsAuthmanifestE4865195) Resource creation Initiated
 11/12 | 4:20:20 PM | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | playground-1/AwsAuth/manifest/Resource/Default (playground1AwsAuthmanifestE4865195) Failed to create resource. b'E0926 21:19:45.159704      14 round_trippers.go:174] CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml": Get https://589777DBCFCB148C747DA4CB0E49B65D.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s: dial tcp 3.231.36.92:443: i/o timeout\n'
	new CustomResource (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.ts:92:21)
	\_ new KubernetesResource (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/k8s-resource.ts:62:5)
	\_ new AwsAuth (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/aws-auth.ts:32:5)
	\_ Cluster.get awsAuth [as awsAuth] (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/cluster.ts:563:23)
	\_ new Cluster (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/cluster.ts:412:12)
	\_ new K8SClusterStack (/Users/jsohl/code/adobe/k8s-platform/lib/k8s-platform-stack.ts:142:22)
	\_ Object.<anonymous> (/Users/jsohl/code/adobe/k8s-platform/bin/k8s-platform.ts:34:18)
	\_ Module._compile (internal/modules/cjs/loader.js:936:30)
	\_ Module.m._compile (/Users/jsohl/code/adobe/k8s-platform/node_modules/ts-node/src/index.ts:493:23)
	\_ Module._extensions..js (internal/modules/cjs/loader.js:947:10)
	\_ Object.require.extensions.<computed> [as .ts] (/Users/jsohl/code/adobe/k8s-platform/node_modules/ts-node/src/index.ts:496:12)
	\_ Module.load (internal/modules/cjs/loader.js:790:32)
	\_ Function.Module._load (internal/modules/cjs/loader.js:703:12)
	\_ Function.Module.runMain (internal/modules/cjs/loader.js:999:10)
	\_ Object.<anonymous> (/Users/jsohl/code/adobe/k8s-platform/node_modules/ts-node/src/bin.ts:158:12)
	\_ Module._compile (internal/modules/cjs/loader.js:936:30)
	\_ Object.Module._extensions..js (internal/modules/cjs/loader.js:947:10)
	\_ Module.load (internal/modules/cjs/loader.js:790:32)
	\_ Function.Module._load (internal/modules/cjs/loader.js:703:12)
	\_ Function.Module.runMain (internal/modules/cjs/loader.js:999:10)
	\_ /usr/local/lib/node_modules/npm/node_modules/libnpx/index.js:268:14
 11/12 | 4:20:21 PM | ROLLBACK_IN_PROGRESS | AWS::CloudFormation::Stack            | k8s-cluster-playground-1 The following resource(s) failed to create: [playground1AwsAuthmanifestE4865195]. . Rollback requested by user.

I'm not sure what is timing out, because the creation of Custom::AWSCDK-EKS-KubernetesResource starts only 1 minute and 17 seconds before it fails. That should not be due to a Lambda timeout or the waiter function, unless I am misunderstanding what is going on (even the Lambda function is only created a little over 10 minutes before the failure).

Is there anything else I can provide? The last pass at this I stripped my code down to just the creation of the cluster admin role and then the creation of the cluster. Everything else I removed. It almost always fails in us-east-1, has failed once (after dozens of creates) in us-west-2 and all other regions I tested never failed (dozens in us-east-2 and several times in eu-west-1).

cseickel · 2019-10-07T16:16:04Z

I am also having this issue, using us-east-1. Let me know if there is any information I can provide or tests I can perform to help with troubleshooting.

38/43 | 11:38:56 | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | EKSCluster/AwsAuth/manifest/Resource/Default (EKSClusterAwsAuthmanifestA4E0796C) Resource creation Initiated
 39/43 | 11:38:56 | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | EKSCluster/AwsAuth/manifest/Resource/Default (EKSClusterAwsAuthmanifestA4E0796C) Failed to create resource. b'error: unable to recognize "/tmp/manifest.yaml": Get https://EB01F1F7F0334B54FFF12955B4154E66.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s: dial tcp 34.198.248.74:443: i/o timeout\n'
        new CustomResource (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-cloudformation\lib\custom-resource.ts:92:21)
        \_ new KubernetesResource (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-eks\lib\k8s-resource.ts:62:5)
        \_ new AwsAuth (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-eks\lib\aws-auth.ts:32:5)
        \_ Cluster.get awsAuth [as awsAuth] (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-eks\lib\cluster.ts:563:23)
        \_ new Cluster (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-eks\lib\cluster.ts:412:12)
        \_ new AlfrescoAppStack (C:\local\invest-apps\alfresco-aws-cdk\lib\alfresco-app-stack.ts:40:22)
        \_ Object.<anonymous> (C:\local\invest-apps\alfresco-aws-cdk\bin\alfresco-aws-cdk.ts:51:18)
        \_ Module._compile (module.js:652:30)
        \_ Module.m._compile (C:\local\invest-apps\alfresco-aws-cdk\node_modules\ts-node\src\index.ts:493:23)
        \_ Module._extensions..js (module.js:663:10)
        \_ Object.require.extensions.(anonymous function) [as .ts] (C:\local\invest-apps\alfresco-aws-cdk\node_modules\ts-node\src\index.ts:496:12)
        \_ Module.load (module.js:565:32)
        \_ tryModuleLoad (module.js:505:12)
        \_ Function.Module._load (module.js:497:3)
        \_ Function.Module.runMain (module.js:693:10)
        \_ Object.<anonymous> (C:\local\invest-apps\alfresco-aws-cdk\node_modules\ts-node\src\bin.ts:158:12)
        \_ Module._compile (module.js:652:30)
        \_ Object.Module._extensions..js (module.js:663:10)
        \_ Module.load (module.js:565:32)
        \_ tryModuleLoad (module.js:505:12)
        \_ Function.Module._load (module.js:497:3)
        \_ Function.Module.runMain (module.js:693:10)
        \_ findNodeScript.then.existing (C:\Program Files\nodejs\node_modules\npm\node_modules\libnpx\index.js:268:14)
        \_ <anonymous>
 40/43 | 11:38:57 | CREATE_FAILED        | AWS::IAM::InstanceProfile             | EKSCluster/DefaultCapacity/InstanceProfile (EKSClusterDefaultCapacityInstanceProfile79FC0597) Resource creation cancelled
        new AutoScalingGroup (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-autoscaling\lib\auto-scaling-group.ts:420:24)
        \_ Cluster.addCapacity (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-eks\lib\cluster.ts:449:17)
        \_ new Cluster (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-eks\lib\cluster.ts:425:35)
        \_ new AlfrescoAppStack (C:\local\invest-apps\alfresco-aws-cdk\lib\alfresco-app-stack.ts:40:22)
        \_ Object.<anonymous> (C:\local\invest-apps\alfresco-aws-cdk\bin\alfresco-aws-cdk.ts:51:18)
        \_ Module._compile (module.js:652:30)
        \_ Module.m._compile (C:\local\invest-apps\alfresco-aws-cdk\node_modules\ts-node\src\index.ts:493:23)
        \_ Module._extensions..js (module.js:663:10)
        \_ Object.require.extensions.(anonymous function) [as .ts] (C:\local\invest-apps\alfresco-aws-cdk\node_modules\ts-node\src\index.ts:496:12)
        \_ Module.load (module.js:565:32)
        \_ tryModuleLoad (module.js:505:12)
        \_ Function.Module._load (module.js:497:3)
        \_ Function.Module.runMain (module.js:693:10)
        \_ Object.<anonymous> (C:\local\invest-apps\alfresco-aws-cdk\node_modules\ts-node\src\bin.ts:158:12)
        \_ Module._compile (module.js:652:30)
        \_ Object.Module._extensions..js (module.js:663:10)
        \_ Module.load (module.js:565:32)
        \_ tryModuleLoad (module.js:505:12)
        \_ Function.Module._load (module.js:497:3)
        \_ Function.Module.runMain (module.js:693:10)
        \_ findNodeScript.then.existing (C:\Program Files\nodejs\node_modules\npm\node_modules\libnpx\index.js:268:14)
        \_ <anonymous>

runlevel-six · 2019-10-07T19:27:02Z

For those that are having this issue, I got around it by not setting a master role in the cluster stack, and instead creating a second stack that builds the aws-auth manifest and applies it as a new KubernetesResource. That second stack, which I apply after the cluster stack has completed creation, uses an export of the cluster (from the cluster stack) and configured as a part of K8SExtendedStackProps to apply. It looks something like this:

Cluster Definition (from initial stack):

this.cluster = new eks.Cluster(this, `${props.deploy}-${props.clusterNum}`, {
  clusterName: `${props.deploy}-${props.clusterNum}`,
  vpc: props.vpc,
  defaultCapacity: 0,
  vpcSubnets: [{
    subnetType: ec2.SubnetType.PRIVATE
  }, {
    subnetType: ec2.SubnetType.PUBLIC
  }],
  kubectlEnabled: true,
  role: props.clusterAdmin,
  //mastersRole: props.kubeAdminRole,
  outputConfigCommand: false,
  version: environment.k8sVersion
});

Cluster Auth Stack (hopefully temporarily):

export class K8SClusterAuthStack extends Stack {
  constructor(scope: App, id: string, props: K8SExtendedStackProps) {
    super(scope, id, props);

    try {
      const awsAwthRoles = {
        apiVersion: 'v1',
        kind: 'ConfigMap',
        metadata: {
          name: 'aws-auth',
          namespace: 'kube-system'
        },
        data: {
          mapRoles: `[{\"rolearn\": \"${props.nodeGroupRole.roleArn}\",\"username\":\"system:node:{{EC2PrivateDNSName}}\",\"groups\": [\"system:bootstrappers\",\"system:nodes\"]},{\"rolearn\": \"${props.kubeAdminRole.roleArn}\",\"groups\": [\"system:masters\"]}]`,
          mapUsers: "[]",
          mapAccounts: "[]"
        }
      };
      new eks.KubernetesResource(this, 'k8sRolesAwsAuthManifest', {
        cluster: props.cluster,
        manifest: [
          awsAwthRoles
        ]
      });
      new CfnOutput(this, `${props.deploy}-${props.clusterNum}-Kubeconfig-Command`, {
        value: `aws eks update-kubeconfig --name ${props.deploy}-${props.clusterNum} --region ${this.region} --role-arn ${props.kubeAdminRole.roleArn}`
      });
    } catch(e) {
      console.log(e);
    }
  }
}

It then rebuilds the new update-kubeconfig command and provides it as output. A few items to note, though:

This is just for my own purposes of getting around this issue temporarily. I am not saying this is production-ready (or even quality) code. But it can give you an idea of a workaround.
There are a lot of types that are configured and other resources that are created and exported from previously run stacks within the app, and then used in this code (and the creation and export of which are not shown in this code excerpt). I just wanted to provide a basic idea of how I'm (for now, until this issue is resolved), working around it.

cseickel · 2019-10-07T20:46:04Z

@runlevel-six Thanks for sharing your workaround! There is one item I don't understand, could you tell me what role the "props.nodeGroupRole" maps to in your code? Is that first role required to make the masters role mapping work or is it something unrelated?

Thanks for your time, I really appreciate it.

runlevel-six · 2019-10-07T20:48:22Z

@cseickel: that role was a second role I created for the nodegroups so that they could properly join the cluster as worker nodes. If it doesn't apply to what you are doing and you just need the masters role, you could change that line to:

mapRoles: `[{\"rolearn\": \"${props.kubeAdminRole.roleArn}\",\"groups\": [\"system:masters\"]}]`,

kubeAdminRole is a role I created earlier in the app and exported so that this stack could apply it to the masters role.

cseickel · 2019-10-08T14:16:05Z

Unfortunately for me, the two stage KubernetesResource workaround did not work out. It does avoid the bug, but I am left with permissions issues that I have opted not to debug any further right now.

At least for my setup, I do need to add more than just the masters role to enable the worker nodes to join properly.

runlevel-six · 2019-10-08T15:50:28Z

In that case you need to use:

mapRoles: `[{\"rolearn\": \"${props.nodeGroupRole.roleArn}\",\"username\":\"system:node:{{EC2PrivateDNSName}}\",\"groups\": [\"system:bootstrappers\",\"system:nodes\"]},{\"rolearn\": \"${props.kubeAdminRole.roleArn}\",\"groups\": [\"system:masters\"]}]`,

Where props.nodeGroupRole.roleArn represents the ARN of the role the worker nodes are deployed with and props.kubeAdminRole.roleArn represents a role that you want to have masters capability for.

stefanolczak · 2019-10-10T13:59:32Z

The lambda can also fails in another place. This is what happened to me few times:

Cluster status not active
[ERROR] 2019-10-03T12:58:39.506Z b47a8eba-f840-4bd0-8ba3-01d776dfc47e Command '['aws', 'eks', 'update-kubeconfig', '--name', 'DevEksCluster', '--kubeconfig', '/tmp/kubeconfig']' returned non-zero exit status 255.
Traceback (most recent call last):
File "/var/task/index.py", line 44, in handler
'--kubeconfig', kubeconfig
File "/var/lang/lib/python3.7/subprocess.py", line 347, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['aws', 'eks', 'update-kubeconfig', '--name', 'DevEksCluster', '--kubeconfig', '/tmp/kubeconfig']' returned non-zero exit status 255.
[ERROR] 2019-10-03T12:58:39.532Z b47a8eba-f840-4bd0-8ba3-01d776dfc47e | cfn_error: Command '['aws', 'eks', 'update-kubeconfig', '--name', 'DevEksCluster', '--kubeconfig', '/tmp/kubeconfig']' returned non-zero exit status 255.

aws-cdk/packages/@aws-cdk/aws-eks/lib/k8s-resource/index.py

Lines 41 to 45 in 8911e7a

    
           # "log in" to the cluster 
        
           subprocess.check_call([ 'aws', 'eks', 'update-kubeconfig', 
        
               '--name', cluster_name, 
        
               '--kubeconfig', kubeconfig 
        
           ])

It's strange because my cluster-resource lambda received info that the cluster is active 30 seconds before that.

lkoniecz · 2019-10-11T07:04:47Z

In addtion to what has already been reported - this is what I experienced yesterday.

30/69 | 12:34:37 | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-Cluster            | DevEksCluster/DevEksCluster/Resource/Resource/Default (DevEksCluster6F41DD8A) Resource creation Initiated
31/69 | 12:34:38 | CREATE_FAILED        | Custom::AWSCDK-EKS-Cluster            | DevEksCluster/DevEksCluster/Resource/Resource/Default (DevEksCluster6F41DD8A) Failed to create resource. Waiter ClusterActive failed: Max attempts exceeded
    new CustomResource (/tmp/jsii-kernel-1QxloY/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.js:32:25)
    \_ new ClusterResource (/tmp/jsii-kernel-1QxloY/node_modules/@aws-cdk/aws-eks/lib/cluster-resource.js:46:26)
    \_ new Cluster (/tmp/jsii-kernel-1QxloY/node_modules/@aws-cdk/aws-eks/lib/cluster.js:81:24)

Logs snippet from the cluster resource handler lambda

10:21:55 [INFO]    2019-10-10T10:21:55.743Z    c0bc686d-8132-470c-a022-83af9924c5a0    waiting for cluster to become active...

10:21:56

10:34:28 [ERROR]    2019-10-10T10:34:28.674Z    c0bc686d-8132-470c-a022-83af9924c5a0   
Waiter ClusterActive failed: Max attempts exceeded

10:34:28 Traceback (most recent call last):

10:34:28 File "/var/task/index.py", line 83, in handler

10:34:28 'MaxAttempts': 26

10:34:28 File "/opt/awscli/botocore/waiter.py", line 53, in wait

10:34:28 Waiter.wait(self, **kwargs)

10:34:28 File "/opt/awscli/botocore/waiter.py", line 329, in wait

10:34:28 last_response=response

10:34:28 botocore.exceptions.WaiterError: Waiter ClusterActive failed: Max attempts exceeded

cseickel · 2019-10-16T20:08:06Z

Can anyone else confirm whether this issue is primarily with us-east-1?

lkoniecz · 2019-10-17T06:23:14Z

No, it happens quite often in eu-west-1 too.

lkoniecz · 2019-11-28T12:33:17Z

Hello, any updates on this?

eladb · 2019-11-28T13:49:28Z

Hello, this is still high on our priority list, but we are a bit heads down towards re:Invent next week, and will get to this as soon as possible.

There were two causes of timeouts for EKS cluster creation: create time which is longer than the AWS Lambda timeout (15min) and lack of retry when applying kubectl after the cluster has been created. The change fixes the first issue by leveraging the custom resource provider framework to implement the cluster resource as an async resource. The second issue is fixed by adding 3 retries to "kubectl apply". Fixes #4087 Fixes #4695

There were two causes of timeouts for EKS cluster creation: create time which is longer than the AWS Lambda timeout (15min) and lack of retry when applying kubectl after the cluster has been created. The change fixes the first issue by leveraging the custom resource provider framework to implement the cluster resource as an async resource. The custom resource providers are now bundled as nested stacks so they don't take up too many resources from users, and are also reused by multiple clusters within the same stack. This required that the creation role will not be the same as the lambda role, so we define this role separately and assume it within the providers. The second issue is fixed by adding 3 retries to "kubectl apply". **Backwards compatibility**: as described in #5544, since the resource provider handler of `Cluster` and `KubernetesResource` has been changed, this change requires a replacement of existing clusters (deployment fails with "service token cannot be changed" error). Since this can be disruptive to users, this change includes an exact copy of the previous version under a new module called `@aws-cdk/aws-eks-legacy`, which can be used as a drop-in replacement until users decide to upgrade to the new version. Using the legacy cluster will emit a synthesis warning that this module will no longer be released as part of the CDK starting March 1st, 2020. - Fixes #4087 - Fixes #4695 - Fixes #5259 - Fixes #5501 --- BREAKING CHANGE: (in experimental module) the providers behind the AWS EKS module have been rewritten to address multiple stability issues. Since this change requires cluster replacement, the old version of this module is available under `@aws-cdk/aws-eks-legacy`. Please read #5544 carefully for upgrade instructions.

There were two causes of timeouts for EKS cluster creation: create time which is longer than the AWS Lambda timeout (15min) and lack of retry when applying kubectl after the cluster has been created. The change fixes the first issue by leveraging the custom resource provider framework to implement the cluster resource as an async resource. The custom resource providers are now bundled as nested stacks so they don't take up too many resources from users, and are also reused by multiple clusters within the same stack. This required that the creation role will not be the same as the lambda role, so we define this role separately and assume it within the providers. The second issue is fixed by adding 3 retries to "kubectl apply". **Backwards compatibility**: as described in #5544, since the resource provider handler of `Cluster` and `KubernetesResource` has been changed, this change requires a replacement of existing clusters (deployment fails with "service token cannot be changed" error). Since this can be disruptive to users, this change includes an exact copy of the previous version under a new module called `@aws-cdk/aws-eks-legacy`, which can be used as a drop-in replacement until users decide to upgrade to the new version. Using the legacy cluster will emit a synthesis warning that this module will no longer be released as part of the CDK starting March 1st, 2020. - Fixes #4087 - Fixes #4695 - Fixes #5259 - Fixes #5501 --- BREAKING CHANGE: (in experimental module) the providers behind the AWS EKS module have been rewritten to address multiple stability issues. Since this change requires cluster replacement, the old version of this module is available under `@aws-cdk/aws-eks-legacy`. Please read #5544 carefully for upgrade instructions. Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

lkoniecz added the needs-triage This issue or PR still needs to be triaged. label Sep 16, 2019

SomayaB added @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. needs-reproduction This issue needs reproduction. labels Sep 16, 2019

SomayaB assigned eladb Sep 16, 2019

SomayaB added the language/python Related to Python bindings label Sep 16, 2019

SomayaB assigned RomainMuller Sep 16, 2019

SomayaB removed the needs-reproduction This issue needs reproduction. label Sep 16, 2019

NGL321 removed the needs-triage This issue or PR still needs to be triaged. label Oct 7, 2019

eladb added the p0 label Oct 23, 2019

eladb unassigned RomainMuller Nov 4, 2019

eladb added p1 and removed p0 labels Nov 4, 2019

eladb added the in-progress This issue is being actively worked on. label Nov 30, 2019

eladb mentioned this issue Dec 23, 2019

fix(eks): failures when creating or updating clusters #5540

Merged

8 tasks

mergify bot closed this as completed in #5540 Dec 30, 2019

lkoniecz mentioned this issue Feb 20, 2020

[aws-eks] EKS cluster fails to update with helm chart added #6381

Closed

github-actions bot assigned iliapolo Aug 16, 2020

iliapolo removed the in-progress This issue is being actively worked on. label Aug 16, 2020

briancaffey mentioned this issue Jun 9, 2021

[aws-load-balancer-controller] Unable to deploy stack briancaffey/cdk-django#26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aws-eks] EKS fails to deploy with CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml" #4087

[aws-eks] EKS fails to deploy with CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml" #4087

lkoniecz commented Sep 16, 2019

lkoniecz commented Sep 18, 2019

lkoniecz commented Sep 18, 2019

stefanolczak commented Sep 18, 2019 •

edited

Loading

tjbaker commented Sep 19, 2019 •

edited

Loading

runlevel-six commented Sep 23, 2019 •

edited

Loading

eladb commented Sep 25, 2019

runlevel-six commented Sep 26, 2019

cseickel commented Oct 7, 2019 •

edited by eladb

Loading

runlevel-six commented Oct 7, 2019

cseickel commented Oct 7, 2019

runlevel-six commented Oct 7, 2019 •

edited

Loading

cseickel commented Oct 8, 2019

runlevel-six commented Oct 8, 2019

stefanolczak commented Oct 10, 2019

lkoniecz commented Oct 11, 2019 •

edited

Loading

cseickel commented Oct 16, 2019

lkoniecz commented Oct 17, 2019

lkoniecz commented Nov 28, 2019

eladb commented Nov 28, 2019

[aws-eks] EKS fails to deploy with CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml" #4087

[aws-eks] EKS fails to deploy with CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml" #4087

Comments

lkoniecz commented Sep 16, 2019

❓ General Issue

The Question

Environment

Other information

lkoniecz commented Sep 18, 2019

lkoniecz commented Sep 18, 2019

stefanolczak commented Sep 18, 2019 • edited Loading

tjbaker commented Sep 19, 2019 • edited Loading

runlevel-six commented Sep 23, 2019 • edited Loading

eladb commented Sep 25, 2019

runlevel-six commented Sep 26, 2019

cseickel commented Oct 7, 2019 • edited by eladb Loading

runlevel-six commented Oct 7, 2019

cseickel commented Oct 7, 2019

runlevel-six commented Oct 7, 2019 • edited Loading

cseickel commented Oct 8, 2019

runlevel-six commented Oct 8, 2019

stefanolczak commented Oct 10, 2019

lkoniecz commented Oct 11, 2019 • edited Loading

cseickel commented Oct 16, 2019

lkoniecz commented Oct 17, 2019

lkoniecz commented Nov 28, 2019

eladb commented Nov 28, 2019

stefanolczak commented Sep 18, 2019 •

edited

Loading

tjbaker commented Sep 19, 2019 •

edited

Loading

runlevel-six commented Sep 23, 2019 •

edited

Loading

cseickel commented Oct 7, 2019 •

edited by eladb

Loading

runlevel-six commented Oct 7, 2019 •

edited

Loading

lkoniecz commented Oct 11, 2019 •

edited

Loading