-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(aws-eks): Cluster rollback fails all future deployments #31626
Comments
Reproducible in CDK version
The stack in CloudFormation appears to successfully rollback.
|
Looks like this happens when two things happen at the same time:
Because step 2 would fail, it would trigger the cluster replace rollback as well. There could be a bug when this case happens the cluster can not roll back correctly and we need to figure out the root cause. Before that, let me ask: What if we just add a broken manifest(step 2) without rename the cluster, will it rollback correctly? |
Yes, this does not occur if there is only a broken manifest, it will roll back correctly. This happens when there is both a cluster replacement and a subsequent deployment failure. It does not matter what triggers the cluster replacement or the deployment failure after. Even if there are issues with removing a broken manifest (semi-frequent), it is possible to skip rolling back the manifest and the stack will be usable. For us this occurred when someone tried adding a new availability zone, which triggered a cluster replacement. We have large manifests applied, so when the input to the custom resource lambda included OldResourceProperties in addition to ResourceProperties this caused the lambda input to go over the size limit, failing the deployment. |
One more note, it is possible to temporarily fix the stack by overriding the LazyAny at |
Describe the bug
Cluster rollbacks can persistently break the cloudformation stack.
The issue occurs when 1) A cluster re-creation is triggered and 2) The deployment rolls back after the new cluster is created. When rolling back to the original cluster, parameters from the new (but now deleted) cluster are retained and fail all future deployments even if the cluster recreation commit is rolled back.
Essentially, while rolling back the stack, it needs to also roll back cached cluster information back to the original cluster's details.
Regression Issue
Last Known Working CDK Version
No response
Expected Behavior
Rollbacks should leave the stack in a functional state, reverting or fixing the CDK should allow new deployments to succeed.
Current Behavior
Rollbacks leave the stack in a dysfunctional state, reverting the CDK code still results in failed stacks.
Reproduction Steps
Deployment 1: Create the cluster
Deployment 2: Change
addBreakingChange
totrue
, deployment failsDeployment 3: Revert
addBreakingChange
tofalse
, deployments will still failWithout any additional complications, the failure message is
Possible Solution
Workaround: I have found out that adding a tag to the cluster successfully triggers IsComplete to update the cluster parameters, e.g. security group.
Additional Information/Context
Support Case ID: 172677804000994
CDK CLI Version
2.113.0 (build ccd534a)
Framework Version
No response
Node.js Version
18
OS
Amazon Linux 2 x86_64
Language
TypeScript
Language Version
TypeScript (5.0.4)
Other information
No response
The text was updated successfully, but these errors were encountered: