Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-sdk: Increase in TimeoutErrors in Amplify hotswap builds #32219

Closed
1 task done
ShadowCat567 opened this issue Nov 20, 2024 · 10 comments · Fixed by #32301
Closed
1 task done

aws-sdk: Increase in TimeoutErrors in Amplify hotswap builds #32219

ShadowCat567 opened this issue Nov 20, 2024 · 10 comments · Fixed by #32301
Assignees
Labels
@aws-cdk/aws-amplify Related to AWS Amplify @aws-cdk/core Related to core CDK functionality bug This issue is a bug. p0 potential-regression Marking this issue as a potential regression to be checked by team member

Comments

@ShadowCat567
Copy link
Contributor

ShadowCat567 commented Nov 20, 2024

Describe the bug

Starting from 11/14, we have started seeing TimeoutError: Resource is not in the expected state due to waiter status: TIMEOUT, with a frequency we have never seen before. This error was added here in version 2.167.0.

Regression Issue

  • Select this option if this issue appears to be a regression.

Last Known Working CDK Version

2.166.0

Expected Behavior

Very few to none TimeoutError: Resource is not in the expected state due to waiter status: TIMEOUT appearing to our customers.

Current Behavior

TimeoutError: Resource is not in the expected state due to waiter status: TIMEOUT has become one of the most common error messages our customers are receiving.

Reproduction Steps

  1. Setup amplify (Gen2) https://docs.amplify.aws/react/start/manual-installation/
  2. Setup a function https://docs.amplify.aws/react/build-a-backend/functions/set-up-function/
  3. Make a change in lambda code

Possible Solution

Roll back this PR: #31702

Additional Information/Context

No response

CDK CLI Version

2.167.0

Framework Version

No response

Node.js Version

=Node18

OS

Linux/Mac/Windows

Language

TypeScript

Language Version

No response

Other information

No response

@ShadowCat567 ShadowCat567 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Nov 20, 2024
@github-actions github-actions bot added the @aws-cdk/aws-amplify Related to AWS Amplify label Nov 20, 2024
@mrgrain mrgrain added potential-regression Marking this issue as a potential regression to be checked by team member @aws-cdk/core Related to core CDK functionality labels Nov 20, 2024
@pahud
Copy link
Contributor

pahud commented Nov 20, 2024

Can't reproduce this using 2.168.0. The lambda func hot swapping looks good to me. Please provide more reproducible details as well as any screenshots if possible. Thank you.

image

@pahud pahud added p3 and removed needs-triage This issue or PR still needs to be triaged. labels Nov 20, 2024
@pahud
Copy link
Contributor

pahud commented Nov 20, 2024

Also, our latest version is 2.168.0 now. Can you verify with this version?

@pahud pahud added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 20, 2024
@ShadowCat567
Copy link
Contributor Author

I have edited the description to provide some additional info about repro instructions, but this issue is very fickle and is hard to reproduce consistently. It does appear in the most recent cdk version.

@pahud
Copy link
Contributor

pahud commented Nov 21, 2024

I can't reproduce it but this could be the potential root cause:

The timeout occurs because AWS Lambda requires the function to be in an Active state and have LastUpdateStatus=Successful before it can be updated again. CDK waits for this state before proceeding, but has different wait times depending on the Lambda configuration:

await lambda.waitUntilFunctionUpdated(delaySeconds, {
FunctionName: functionName,
});
}

AWS SDK v3 has a different waiter implementation, timeout handling and delay strategy that handle requests differently from v2. This could affect how timeouts are processed.

I guess we should either increase the timeout somehow from here or add status checking like

const response = await lambda.getFunctionConfiguration({
  FunctionName: functionName
});
if (response.State === 'Active' && response.LastUpdateStatus === 'Successful') {
  // Proceed with hotswap
}

I'll bring this up to the core team for further investigation.

@pahud pahud added p1 and removed p3 labels Nov 21, 2024
@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 21, 2024
@otaviomacedo
Copy link
Contributor

Just to be clear, this message was introduced in version 2.167.0. Previously, the same error would result in a generic error message. So, the fact that we are now seeing it in the wild, by itself, doesn't imply that there is anything wrong there. Indeed, it would be surprising if this message didn't start appearing.

To establish that there is an issue, we need to find a case in which this error message is shown in version 2.167.0 or later, and no error happens at all in earlier versions. Given that it's hard to reproduce it consistently, we would need to run each version a few times and compare the error rates.

@sobolk
Copy link

sobolk commented Nov 26, 2024

@otaviomacedo

The issue surfaced immediately in our e2e tests after we merged CDK version bump here aws-amplify/amplify-backend#2269 .

The failure https://github.com/aws-amplify/amplify-backend/actions/runs/12036537741/job/33558355227 .

It happened in all three jobs of same kind at first try. I can't find/recall examples of these tests failing due to timeout before.

image

@sobolk
Copy link

sobolk commented Nov 26, 2024

It has consistently failed in 5/6 runs.
From E2E test logs, It seems this is affected by:

  1. Growing number of resources being hot-swapped
  2. Hotswapps are executed couple of times in a row (they may trigger throttling and new code might be less tolerant to that then before.).

@sobolk
Copy link

sobolk commented Nov 26, 2024

I was able to establish local repro that explains why our tests are failing.

  1. Assume IAM principal that has only AmplifyBackendDeployFullAccess policy attached.
  2. Execute https://github.com/aws-amplify/amplify-backend/blob/main/packages/integration-tests/src/test-e2e/sandbox/data_storage_auth_with_triggers.sandbox.test.ts .

It seems that new waiter implementation is now requiring new IAM permissions to function.

image

Copy link

Comments on closed issues and PRs are hard for our team to see.
If you need help, please open a new issue that references this one.

1 similar comment
Copy link

Comments on closed issues and PRs are hard for our team to see.
If you need help, please open a new issue that references this one.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 27, 2024
rix0rrr pushed a commit that referenced this issue Nov 27, 2024
…ction is not allowed (#32301)

Closes #32219

### Reason for this change



In SDKv3, the standard `waitUntilFunctionUpdated` function invokes the `GetFunctionConfiguration` API, as opposed to SDKv2, which invoked `GetFunction`. This means that consumers of SDKv3 must allow the `lambda:GetFunctionConfiguration` action in their IAM role policy.

### Description of changes



Use a different waiter function provided by the SDK, which invokes `GetFunction` instead of `GetFunctionConfiguration`, and thus restoring required IAM permissions to what they were in SDKv2.

See https://github.com/aws/aws-sdk-js-v3/blob/main/clients/client-lambda/src/waiters/waitForFunctionUpdatedV2.ts#L10

> As opposed to https://github.com/aws/aws-sdk-js-v3/blob/main/clients/client-lambda/src/waiters/waitForFunctionUpdated.ts#L13

### Description of how you validated changes

Manul test. Assumed a role with the following policies:

![Screenshot 2024-11-27 at 9 34 25](https://github.com/user-attachments/assets/69415c37-6fe8-44d3-972c-1373ec55f46e)

```console
 ❯ cdk deploy --hotswap                                                                                                                                                                                                                                            [09:29:11]

✨  Synthesis time: 2.72s

⚠️ The --hotswap and --hotswap-fallback flags deliberately introduce CloudFormation drift to speed up deployments
⚠️ They should only be used for development - never use them for your production Stacks!

AwsCdkPlaygroundStack: deploying... [1/1]

✨ hotswapping resources:
   ✨ Lambda Function 'AwsCdkPlaygroundStack-Function76856677-7Rl7hiwwO5LQ'
❌  AwsCdkPlaygroundStack failed: TimeoutError: Resource is not in the expected state due to waiter status: TIMEOUT. Waiter has timed out.
```

Then, run the CLI from the PR.

```console
❯ /Users/epolon/dev/src/github.com/aws/aws-cdk/packages/aws-cdk/bin/cdk deploy --hotswap                                                                                                                                                                          [10:03:00]

✨  Synthesis time: 3.46s

⚠️ The --hotswap and --hotswap-fallback flags deliberately introduce CloudFormation drift to speed up deployments
⚠️ They should only be used for development - never use them for your production Stacks!

AwsCdkPlaygroundStack: deploying... [1/1]

✨ hotswapping resources:
   ✨ Lambda Function 'AwsCdkPlaygroundStack-Function76856677-7Rl7hiwwO5LQ'
✨ Lambda Function 'AwsCdkPlaygroundStack-Function76856677-7Rl7hiwwO5LQ' hotswapped!

 ✅  AwsCdkPlaygroundStack

✨  Deployment time: 12.72s

Stack ARN:
arn:aws:cloudformation:us-east-1:01234567890:stack/AwsCdkPlaygroundStack/22f2b380-a7cd-11ef-badd-0e08a8e0b5b1

✨  Total time: 16.19s

>>> elapsed time 23s                                                                                                                                                                                                                                                          
```



### Checklist
- [x] My code adheres to the [CONTRIBUTING GUIDE](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md) and [DESIGN GUIDELINES](https://github.com/aws/aws-cdk/blob/main/docs/DESIGN_GUIDELINES.md)

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
@aws-cdk/aws-amplify Related to AWS Amplify @aws-cdk/core Related to core CDK functionality bug This issue is a bug. p0 potential-regression Marking this issue as a potential regression to be checked by team member
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants