(aws-eks): Cluster rollback fails all future deployments #31626

esun74 · 2024-10-02T18:41:08Z

Describe the bug

Cluster rollbacks can persistently break the cloudformation stack.

The issue occurs when 1) A cluster re-creation is triggered and 2) The deployment rolls back after the new cluster is created. When rolling back to the original cluster, parameters from the new (but now deleted) cluster are retained and fail all future deployments even if the cluster recreation commit is rolled back.

Essentially, while rolling back the stack, it needs to also roll back cached cluster information back to the original cluster's details.

Regression Issue

Select this option if this issue appears to be a regression.

Last Known Working CDK Version

No response

Expected Behavior

Rollbacks should leave the stack in a functional state, reverting or fixing the CDK should allow new deployments to succeed.

Current Behavior

Rollbacks leave the stack in a dysfunctional state, reverting the CDK code still results in failed stacks.

Reproduction Steps

import { KubectlV24Layer } from '@aws-cdk/lambda-layer-kubectl-v24';
import { Cluster, KubernetesVersion } from 'aws-cdk-lib/aws-eks';
import { Construct } from 'constructs';

const addBreakingChange = false;

const cluster = new Cluster(this, 'MyCluster', {
  kubectlLayer: new KubectlV24Layer(this, 'KubectlLayer'),
  version: KubernetesVersion.V1_24,
  defaultCapacity: 0,
  clusterName: addBreakingChange ? 'newcluster' : undefined,
});

if (addBreakingChange) {
  cluster.addManifest('eks-sample-linux-service', {
    apiVersion: 'v1',
    kind: 'Service',
    metadata: {
      name: 'eks-sample-linux-service',
      namespace: 'eks-sample-app',
      labels: {
        app: 'non-existent-app',
      },
    },
  });
}

Deployment 1: Create the cluster
Deployment 2: Change addBreakingChange to true, deployment fails
Deployment 3: Revert addBreakingChange to false, deployments will still fail

Without any additional complications, the failure message is

Resource handler returned message: "Error occurred while DescribeSecurityGroups. EC2 Error Code: InvalidGroup.NotFound. EC2 Error Message: The security group 'sg-[...]' does not exist (Service: Lambda, Status Code: 400, Request ID: [...])" (RequestToken: [...], HandlerErrorCode: InvalidRequest)

Possible Solution

Workaround: I have found out that adding a tag to the cluster successfully triggers IsComplete to update the cluster parameters, e.g. security group.

Additional Information/Context

Support Case ID: 172677804000994

CDK CLI Version

2.113.0 (build ccd534a)

Framework Version

No response

Node.js Version

18

OS

Amazon Linux 2 x86_64

Language

TypeScript

Language Version

TypeScript (5.0.4)

Other information

No response

The text was updated successfully, but these errors were encountered:

ashishdhingra · 2024-10-09T18:17:51Z

Reproducible in CDK version 2.161.1.

Deployment 2: Change addBreakingChange to true, deployment fails fails with error:

EksStack: deploying... [1/1]
EksStack: creating CloudFormation changeset...
10:48:50 AM | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | MyClustermanifeste...nuxservice2052152E
Received response status [FAILED] from custom resource. Message returned: Error: b'Error from server (NotFound): error when creating "/tmp/manifest.yaml": namespaces "eks-sample-app" not found\n'

Logs: /aws/lambda/EksStack-awscdkawseksKubectlProvid-Handler886CB40B-VhQ2XXQQRuDy

at invokeUserFunction (/var/task/framework.js:2:6)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async onEvent (/var/task/framework.js:1:369)
at async Runtime.handler (/var/task/cfn-response.js:1:1826) (RequestId: cc7fe48a-9d64-4516-8f2f-dc8fb02a3610)

❌  EksStack failed: The stack named EksStack failed to deploy: UPDATE_ROLLBACK_COMPLETE: Received response status [FAILED] from custom resource. Message returned: Error: b'Error from server (NotFound): error when creating "/tmp/manifest.yaml": namespaces "eks-sample-app" not found\n'

Logs: /aws/lambda/EksStack-awscdkawseksKubectlProvid-Handler886CB40B-VhQ2XXQQRuDy

    at invokeUserFunction (/var/task/framework.js:2:6)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async onEvent (/var/task/framework.js:1:369)
    at async Runtime.handler (/var/task/cfn-response.js:1:1826) (RequestId: cc7fe48a-9d64-4516-8f2f-dc8fb02a3610)

The stack in CloudFormation appears to successfully rollback.

Deployment 3: Revert addBreakingChange to false fails with below error:

  Synthesis time: 4.36s

Stack undefined
EksStack: deploying... [1/1]
EksStack: creating CloudFormation changeset...
11:09:45 AM | UPDATE_FAILED        | AWS::Lambda::Function                 | Handler886CB40B
Resource handler returned message: "Error occurred while DescribeSecurityGroups. EC2 Error Code: InvalidGroup.NotFound. EC2 Error Message: The security group 'sg-<<REDACTED>>' does not exist (Service: Lambd
a, Status Code: 400, Request ID: c6fc7c43-c5e1-4d95-9fd9-da0fd5882545)" (RequestToken: 21fa0cea-d7fd-012e-8bde-ff2871198e18, HandlerErrorCode: InvalidRequest)

11:09:52 AM | UPDATE_FAILED        | AWS::CloudFormation::Stack            | awscdkawseksKubect...ckResourceA7AEBA6B
Embedded stack arn:aws:cloudformation:us-east-2:<<REDACTED>>:stack/EksStack-awscdkawseksKubectlProviderNestedStackawscdkawseksKubectlProviderNestedStackR-115ARLM6YV62A/a8c65830-8663-11ef-81cd-067bbd6ddbb5 was no
t successfully updated. Currently in UPDATE_ROLLBACK_IN_PROGRESS with reason: The following resource(s) failed to update: [Handler886CB40B].

❌  EksStack failed: The stack named EksStack failed to deploy: UPDATE_ROLLBACK_COMPLETE: Resource handler returned message: "Error occurred while DescribeSecurityGroups. EC2 Error Code: InvalidGroup.NotFound. EC2 Error Message: The security group 'sg-<<REDACTED>>' does not exist (Service: Lambda, Status Code: 400, Request ID: c6fc7c43-c5e1-4d95-9fd9-da0fd5882545)" (RequestToken: 21fa0cea-d7fd-012e-8bde-ff2871198e18, HandlerErrorCode: InvalidRequest), Embedded stack arn:aws:cloudformation:us-east-2:<<REDACTED>>:stack/EksStack-awscdkawseksKubectlProviderNestedStackawscdkawseksKubectlProviderNestedStackR-115ARLM6YV62A/a8c65830-8663-11ef-81cd-067bbd6ddbb5 was not successfully updated. Currently in UPDATE_ROLLBACK_IN_PROGRESS with reason: The following resource(s) failed to update: [Handler886CB40B].

pahud · 2024-10-09T19:23:58Z

Looks like this happens when two things happen at the same time:

You try to rename the clusterName, which would trigger cluster replacement.
You try to add a manifest on the new cluster with broken configuration

Because step 2 would fail, it would trigger the cluster replace rollback as well. There could be a bug when this case happens the cluster can not roll back correctly and we need to figure out the root cause.

Before that, let me ask:

What if we just add a broken manifest(step 2) without rename the cluster, will it rollback correctly?
If yes, I guess there is a bug when Cluster rolls back from the new cluster.

esun74 · 2024-10-09T20:45:43Z

Yes, this does not occur if there is only a broken manifest, it will roll back correctly. This happens when there is both a cluster replacement and a subsequent deployment failure. It does not matter what triggers the cluster replacement or the deployment failure after.

Even if there are issues with removing a broken manifest (semi-frequent), it is possible to skip rolling back the manifest and the stack will be usable.

For us this occurred when someone tried adding a new availability zone, which triggered a cluster replacement. We have large manifests applied, so when the input to the custom resource lambda included OldResourceProperties in addition to ResourceProperties this caused the lambda input to go over the size limit, failing the deployment.

esun74 · 2024-10-09T20:50:15Z

One more note, it is possible to temporarily fix the stack by overriding the LazyAny at cluster._kubectlResourceProvider.nestedStackResource.parameters, which I discovered before finding out that adding a tag refreshes the cached parameter values.

esun74 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Oct 2, 2024

github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Oct 2, 2024

ashishdhingra added p2 effort/medium Medium work item – several days of effort p1 and removed needs-triage This issue or PR still needs to be triaged. p2 labels Oct 9, 2024

moelasmar self-assigned this Oct 9, 2024

github-actions bot mentioned this issue Nov 1, 2024

Monthly issue metrics report #31971

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(aws-eks): Cluster rollback fails all future deployments #31626

(aws-eks): Cluster rollback fails all future deployments #31626

esun74 commented Oct 2, 2024 •

edited

Loading

ashishdhingra commented Oct 9, 2024

pahud commented Oct 9, 2024

esun74 commented Oct 9, 2024

esun74 commented Oct 9, 2024

(aws-eks): Cluster rollback fails all future deployments #31626

(aws-eks): Cluster rollback fails all future deployments #31626

Comments

esun74 commented Oct 2, 2024 • edited Loading

Describe the bug

Regression Issue

Last Known Working CDK Version

Expected Behavior

Current Behavior

Reproduction Steps

Without any additional complications, the failure message is

Possible Solution

Additional Information/Context

CDK CLI Version

Framework Version

Node.js Version

OS

Language

Language Version

Other information

ashishdhingra commented Oct 9, 2024

pahud commented Oct 9, 2024

esun74 commented Oct 9, 2024

esun74 commented Oct 9, 2024

esun74 commented Oct 2, 2024 •

edited

Loading