-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Artificial limiting of Lambda Function creation #2206
Comments
Note that this issue is tracking us making a decision on whether to make a different decision than upstream on this behaviour - so is not "awaiting-upstream". The current locking incurs a very significant performance penalty in cases like the one above, and provides no way to work around. Whereas the original issue that this locking was introduced to fix is much rarer, and more within the control of the user who can apply |
We may want to prioritize a decision here. #3740 reported the same issue here, however, StapStart+Publish settings make the code updates much longer (minutes instead of seconds), so the whole UX get much worse. Note that the issue applies to Updates as well as Creates. |
I'm leaning towards taking a different decision than upstream and consequently removing the serialization for lambda upload. The underlying issue that was fixed upstream (hashicorp/terraform#9364) only surfaces when dealing with many (10+) Lambdas in a memory constrained environment. Additionally, this fix is ~8 years old. Worth checking if this is still reproducible. For example, the Lambda resource in upstream was switched to the AWS SDK v2 since then. Apart from that, there's two user facing workarounds that I can think of.
I'm going to create a prototype that removes the artificial serialization and will test the mentioned workarounds for feasability. |
After removing the artificial serialization I was able to verify that the underlying issue still exists and both workarounds mentioned above are viable. This prototype PR includes the patch that removes the serialization as well as the necessary Pulumi programs to replicate the issue and test workarounds: #3976 I'm testing with 25MB artifacts as that's right in the middle of the max size (50MB) for direct uploads. The tests create 50 Lambdas with those artifacts. Findings
For reference, the baseline without the fix took 26 minutes to complete What's interesting is the memory usage when directly uploading the lambda code. 50 lambdas, 25MB each, add up to 1.25GB, which is only roughly 5% of the observed memory usage. The same could be seen in the original upstream bug report: hashicorp/terraform#9364. |
After profiling the provider with pprof I found two main places where memory is allocated.
|
After removing the request logging for Lambda the issue is fixed! To compare the effects of removing the artificial limiting I've executed a set of tests on a 1. Baseline with artificial limiting (pulumi-aws
|
So to summarize the findings:
Based on those findings I'm proposing to fix the HTTP request logger and remove the artificial limiting of Lambda creation/updates. This should result in a drastic speed up for users with many lambdas. My tests have shown a 5.5x increase for directly uploading 25MB lambda archives. Users with smaller archives or users that upload the archives to S3 should see even bigger speed ups |
I was able to create a patch that fixes the memory issues: #3976 I'm drafting upstream PRs right now, but I'm proposing that we still go forward with the patch for the time being because this change will touch two different repos upstream (hashicorp/aws-sdk-go-base & hashicorp/terraform-provider-aws). @t0yv0 @mjeffryes @corymhall What do you think about that approach? |
Great work! I'd patch and release first and propose upstream PR, any reason to keep our users waiting? One reason would be if there is uncertainty or risk in the patch. Looking forward to reviewing this next week. |
I'd say the main risk with the patch is the maintenance effort. Given that the patch is fairly isolated (it only touches on AWS SDK middleware) I'd not keep the users waiting longer. |
If that becomes the problem we can revert right. I'm for merging. |
The root cause behind why the serialization of Lambda creation/update was introduced upstream is excessive memory usage (see hashicorp/terraform#9364). After investigation we found that this is caused by the HTTP request logging middleware. It logs the lambda archive as a base64 encoded string. In order to do so, multiple copies of the body are created in memory, which leads to memory bloating. This change fixes that by redacting the body in the logs for the Create/Update Lambda calls. The PR introduces two patches. One removes the Lambda serialization and the other fixes the HTTP request logging middleware for the Lambda `CreateFunction` and `UpdateFunctionCode` operations. After this, Lambdas are created/updated in parallel and don't suffer from excessive memory usage. Users can still limit the parallelism with the CLI flag `--parallel` if they wish so. Relates to #2206
This was fixed and released as part of v6.39.0. |
What happened?
The underlying bridged provider artificially limits the number of lambda functions that can be uploaded in parallel.
The cause of this is hashicorp/terraform#9364 and the fix was implemented in here:
To resolve this, some logic was implemented to create a
mutex.Lock
was added to the lambda function upload, which slows down the provisioning of lambdas. You can see the effect of this here:https://github.com/hashicorp/terraform/pull/9667/files
Steps to reproduce
Expected Behavior
It creates the functions in parallel
Actual Behavior
This takes around 4/5 minutes and you can observe the creation slowing down and operating in serial
Output of
pulumi about
No response
Additional context
No response
Contributing
Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).
The text was updated successfully, but these errors were encountered: