Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Artificial limiting of Lambda Function creation #2206

Closed
jaxxstorm opened this issue Nov 9, 2022 · 12 comments
Closed

Artificial limiting of Lambda Function creation #2206

jaxxstorm opened this issue Nov 9, 2022 · 12 comments
Assignees
Labels
customer/feedback Feedback from customers customer/lighthouse Lighthouse customer bugs impact/performance Something is slower than expected kind/enhancement Improvements or new features resolution/fixed This issue was fixed
Milestone

Comments

@jaxxstorm
Copy link
Contributor

jaxxstorm commented Nov 9, 2022

What happened?

The underlying bridged provider artificially limits the number of lambda functions that can be uploaded in parallel.

The cause of this is hashicorp/terraform#9364 and the fix was implemented in here:

To resolve this, some logic was implemented to create a mutex.Lock was added to the lambda function upload, which slows down the provisioning of lambdas. You can see the effect of this here:

https://github.com/hashicorp/terraform/pull/9667/files

Steps to reproduce

import pulumi
import pulumi_aws as aws

 for i in range(50):
     name = f'pulumi-{str(i).rjust(2,"0")}'
     aws.lambda_.Function(
         name,
         aws.lambda_.FunctionArgs(
             name=name,
             code=pulumi.FileArchive("hello-world.zip"),
             role="arn:aws:iam::616138583583:role/lbriggs-lambda-test",
             handler="hello-world.handler",
             runtime="nodejs12.x",
         ),
     )

Expected Behavior

It creates the functions in parallel

Actual Behavior

This takes around 4/5 minutes and you can observe the creation slowing down and operating in serial

Output of pulumi about

No response

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@jaxxstorm jaxxstorm added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Nov 9, 2022
@lblackstone lblackstone added impact/performance Something is slower than expected awaiting-upstream The issue cannot be resolved without action in another repository (may be owned by Pulumi). and removed needs-triage Needs attention from the triage team labels Nov 11, 2022
@lukehoban
Copy link
Contributor

Note that this issue is tracking us making a decision on whether to make a different decision than upstream on this behaviour - so is not "awaiting-upstream".

The current locking incurs a very significant performance penalty in cases like the one above, and provides no way to work around. Whereas the original issue that this locking was introduced to fix is much rarer, and more within the control of the user who can apply --parallel to limit the parallelism (and thus concurrent memory usage).

@lukehoban lukehoban removed the awaiting-upstream The issue cannot be resolved without action in another repository (may be owned by Pulumi). label Nov 13, 2022
@lukehoban lukehoban added kind/enhancement Improvements or new features and removed kind/bug Some behavior is incorrect or out of spec labels Nov 21, 2022
@mnlumi mnlumi added customer/feedback Feedback from customers customer/lighthouse Lighthouse customer bugs labels Oct 3, 2023
@mikhailshilkov
Copy link
Member

mikhailshilkov commented Mar 27, 2024

We may want to prioritize a decision here. #3740 reported the same issue here, however, StapStart+Publish settings make the code updates much longer (minutes instead of seconds), so the whole UX get much worse.

Note that the issue applies to Updates as well as Creates.

@flostadler
Copy link
Contributor

I'm leaning towards taking a different decision than upstream and consequently removing the serialization for lambda upload.

The underlying issue that was fixed upstream (hashicorp/terraform#9364) only surfaces when dealing with many (10+) Lambdas in a memory constrained environment. Additionally, this fix is ~8 years old. Worth checking if this is still reproducible. For example, the Lambda resource in upstream was switched to the AWS SDK v2 since then.

Apart from that, there's two user facing workarounds that I can think of.

  1. The --parallel flag @lukehoban mentioned to control the number of parallel resource operations
  2. Uploading the lambda code to S3 using aws.s3.BucketObjectv2. This is the mandatory approach for Lambdas +50MB. Instead of loading the whole blob into memory, it streams the content to S3.

I'm going to create a prototype that removes the artificial serialization and will test the mentioned workarounds for feasability.

@flostadler
Copy link
Contributor

flostadler commented May 23, 2024

After removing the artificial serialization I was able to verify that the underlying issue still exists and both workarounds mentioned above are viable.

This prototype PR includes the patch that removes the serialization as well as the necessary Pulumi programs to replicate the issue and test workarounds: #3976

I'm testing with 25MB artifacts as that's right in the middle of the max size (50MB) for direct uploads. The tests create 50 Lambdas with those artifacts.

Findings

  • With default settings (--parallel 48) the aws provider peaks at ~28GB virtual memory usage (~3GB RSS, indicating heavy swapping) and fails after ~20min because the requests to AWS time out (Sigv4 signature expired).
  • When limiting allowed parallel resource operations to 8 everything worked, but the memory usage of the AWS provider peaked at 2.8GB. pulumi up took ~15min.
  • When using the workaround of uploading the lambda.zip to S3 there were no limits for parallel resource operations and pulumi up took ~4.5min. The memory usage of the AWS provider stayed below 300MB throughout the operation.

For reference, the baseline without the fix took 26 minutes to complete pulumi up.

What's interesting is the memory usage when directly uploading the lambda code. 50 lambdas, 25MB each, add up to 1.25GB, which is only roughly 5% of the observed memory usage. The same could be seen in the original upstream bug report: hashicorp/terraform#9364.
I'm gonna have a look into this as there might be some unnecessary memory allocations/copying going on.

@flostadler
Copy link
Contributor

flostadler commented May 23, 2024

After profiling the provider with pprof I found two main places where memory is allocated.

awsRestjson1_serializeDocumentFunctionCode

This converts the zip archive to a base64 encoded string https://github.com/aws/aws-sdk-go-v2/blob/main/service/lambda/serializers.go#L6178

In my test run with 50 lambdas à 25MB (1.25GB total) this allocated 3.49GB. This sounds reasonable given that this needs to:

  • copy the binary slice
  • convert it and store a new base64 string
  • create the request body with that base64 string

defaultRequestBodyLogger::Log

This is the upstream provider's HTTP logging middleware for the AWS SDK: https://github.com/hashicorp/aws-sdk-go-base/blob/main/logging/http.go#L118

It is used to emit debug logs with the request body here: https://github.com/hashicorp/aws-sdk-go-base/blob/main/logger.go#L140

Under the hood this creates multiple copies of the request body, which explains the high memory usage.

Even though this log message is only emitted when debug logging is turned on, the code for preparing the message is executed regardless of the log level.

This is my main suspicion right now. I'm gonna test whether removing the lambda binary from the log fixes the problem.

@flostadler
Copy link
Contributor

flostadler commented May 24, 2024

After removing the request logging for Lambda the issue is fixed!

To compare the effects of removing the artificial limiting I've executed a set of tests on a c6i.2xlarge EC2 instance (each test was executed 5 times, measurements are the max). The EC2 instance does not have swap enabled (by default):

1. Baseline with artificial limiting (pulumi-aws 6.37.0). Default --parallel of 32

pulumi up takes ~8.5 minutes and memory peaks at ~0.9GB RSS

2. Without artificial limiting (d815227). Default --parallel of 32

pulumi up takes ~2 minutes and memory peaks at ~7.1GB RSS
This is very interesting compared to the 28GB memory usage on my M3 MacBook mentioned before.

My current working theory is that this is driven by the lower network bandwidth locally in combination with retries.
The Sigv4 of an AWS API call expires if the call takes longer than 5 minutes. When that happens the SDK retries the call and the HTTP request logger copies the body again and logs it, leading to increased memory usage.

3. Without artificial limiting (d815227). --parallel of 48, bandwidth limited to 25Mbit/s UP and 50Mbit/s down

pulumi up fails after ~19 minutes and consumes 14.8GB RSS (93% of the instances memory).
The errors are Sigv4 signature expired. This is in line with the behavior locally on my M3 Mac. This shows that the memory issues around HTTP request logging get amplified through retries and proofs the theory in 2).

4. Without artificial limiting and Lambda HTTP body logging turned off (0c4044c). Default --parallel of 32

pulumi up takes ~1:30 minutes and memory peaks at 4.1GB RSS

5. Without artificial limiting and Lambda HTTP body logging turned off (0c4044c). --parallel 8

pulumi up takes ~1:50 minutes and memory peaks at 1.2GB RSS

@flostadler
Copy link
Contributor

flostadler commented May 24, 2024

So to summarize the findings:

  • The high memory usage during Lambda creation/updates was indeed caused by logging the request bodies as this involves multiple copies & memory allocations.
  • Removing the artificial serialization of Lambda creation/updates does not lead to unusual memory consumption when the request body logging is removed. The memory usage is in line with what's expected to assemble the AWS API requests for directly uploading Lambda code (converting archive to base64 and creating request body out of that).
  • Using --parallel or uploading the Lambda Code to S3 gives users control over the memory usage

Based on those findings I'm proposing to fix the HTTP request logger and remove the artificial limiting of Lambda creation/updates. This should result in a drastic speed up for users with many lambdas. My tests have shown a 5.5x increase for directly uploading 25MB lambda archives. Users with smaller archives or users that upload the archives to S3 should see even bigger speed ups

@flostadler
Copy link
Contributor

I was able to create a patch that fixes the memory issues: #3976

I'm drafting upstream PRs right now, but I'm proposing that we still go forward with the patch for the time being because this change will touch two different repos upstream (hashicorp/aws-sdk-go-base & hashicorp/terraform-provider-aws).
My hunch is that this it take a substantial amount of time to get merged in (even small patches for panics take weeks). And in case it goes faster we can remove the patch again.

@t0yv0 @mjeffryes @corymhall What do you think about that approach?

@t0yv0
Copy link
Member

t0yv0 commented May 31, 2024

Great work! I'd patch and release first and propose upstream PR, any reason to keep our users waiting? One reason would be if there is uncertainty or risk in the patch. Looking forward to reviewing this next week.

@flostadler
Copy link
Contributor

I'd say the main risk with the patch is the maintenance effort. Given that the patch is fairly isolated (it only touches on AWS SDK middleware) I'd not keep the users waiting longer.

@t0yv0
Copy link
Member

t0yv0 commented Jun 4, 2024

I'd say the main risk with the patch is the maintenance effort.

If that becomes the problem we can revert right. I'm for merging.

flostadler added a commit that referenced this issue Jun 7, 2024
The root cause behind why the serialization of Lambda creation/update
was introduced upstream is excessive memory usage (see
hashicorp/terraform#9364).
After investigation we found that this is caused by the HTTP request
logging middleware. It logs the lambda archive as a base64 encoded
string. In order to do so, multiple copies of the body are created in
memory, which leads to memory bloating.
This change fixes that by redacting the body in the logs for the
Create/Update Lambda calls.

The PR introduces two patches. One removes the Lambda serialization and
the other fixes the HTTP request logging middleware for the Lambda
`CreateFunction` and `UpdateFunctionCode` operations.
After this, Lambdas are created/updated in parallel and don't suffer
from excessive memory usage. Users can still limit the parallelism with
the CLI flag `--parallel` if they wish so.

Relates to #2206
@flostadler
Copy link
Contributor

This was fixed and released as part of v6.39.0.

@flostadler flostadler added the resolution/fixed This issue was fixed label Jun 11, 2024
@mjeffryes mjeffryes added this to the 0.106 milestone Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer/feedback Feedback from customers customer/lighthouse Lighthouse customer bugs impact/performance Something is slower than expected kind/enhancement Improvements or new features resolution/fixed This issue was fixed
Projects
None yet
Development

No branches or pull requests

8 participants