[Bug]: Memory consumption has increased since v4.67.0 #31722

at-wat · 2023-06-02T02:12:10Z

Terraform Core Version

1.4.5

AWS Provider Version

4.67.0, 5.0.1, 5.1.0

Affected Resource(s)

Seems not related to specific resources.
For example, memory usage increases even when only declaring provider without resources.

Expected Behavior

Memory usage is kept the same by updating the provider.

Actual Behavior

In our actual environment setting up GuardDuty and EventBus on all available regions (17) on 5 AWS accounts, max RSS during refresh was measured as:

/usr/bin/time -v terraform plan -refresh-only

provider version	v4.66.1	v4.67.0	v5.0.1
Max RSS	2,645M	3,654M	3,634M
Increase	-	+38%	+37%

Relevant Error/Panic Output Snippet

No response

Terraform Configuration Files

https://github.com/at-wat/terraform-provider-aws-v4.67-memory-usage/

Steps to Reproduce

See https://github.com/at-wat/terraform-provider-aws-v4.67-memory-usage/blob/main/README.md

This measures max RSS when refreshing aws_iam_policy_document data resources on 17 regions.
It results in a similar increase rate to our actual infrastructure code.

Debug Output

No response

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

None

The text was updated successfully, but these errors were encountered:

github-actions · 2023-06-02T02:12:24Z

Community Note

Voting for Prioritization

Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
Please see our prioritization guide for information on how we prioritize.
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

If you are interested in working on this issue, please leave a comment.
If this would be your first contribution, please review the contribution guide.

at-wat · 2023-06-02T03:07:04Z

Added setting for v5.1.0 to the reproduction repo.
Memory consumption increases even more on v5.1.0. (almost doubled from v4.66.1)

at-wat · 2023-06-02T07:04:19Z

Pprof output right before tf5server.Serve()

v4.66.1	v4.67.0	v5.1.0

n00borama · 2023-06-07T09:15:40Z

Thanks for the research @at-wat , we've had to double the size of our codebuild runners to avoid these kind of errors that are due to memory exhaustion:
[“The plugin.(*GRPCProvider).ValidateDataResourceConfig request was cancelled.”]

at-wat · 2023-06-08T08:02:55Z

#31722 (comment)
New large object appeared in pprof graph of v5.1.0 is from internal/service/quicksight/schema.visualsSchema()

It may be related to the new quicksight resources:

kieran-lowe · 2023-06-12T12:08:24Z

Like @n00borama, we've had to double the size of our runners to avoid these errors which of course has a negative cost impact 🙁

Also tried to play around with parallelism by lowering it from the default of 10 to 4 but no joy.

breathingdust · 2023-06-21T18:37:28Z

Hi @at-wat 👋, thank you for the legwork in profiling this issue! Just to let you know we are actively investigating this with high priority in order to find a resolution. Will update this thread once we have a way forward.

stephen-hames · 2023-06-27T02:29:02Z

This has left us unable to run one of our tf pipelines at all, as we were already on the largest possible size in bitbucket cloud... having to look into other workarounds now.

Reference: #31722 Rather than allocating memory for each shared tag schema, reference the same schema data. Each `schema.Schema` struct allocation is ~304 bytes, which adds up when multiplied times a few hundred resources using it. Not a big reduction by any means, but seems like a quick way to reduce some unnecessary memory usage. Updated the two code locations which were directly mutating the result of shared tag schemas.

Reference: #31722 Happened to notice this was generating a decent number of memory allocations while profiling and its a small change. Output from acceptance testing: ``` --- PASS: TestAccWAFV2RuleGroup_basic (30.05s) --- PASS: TestAccWAFV2RuleGroup_disappears (32.14s) --- PASS: TestAccWAFV2RuleGroup_Operators_maxNested (34.74s) --- PASS: TestAccWAFV2RuleGroup_changeMetricNameForceNew (58.55s) --- PASS: TestAccWAFV2RuleGroup_changeNameForceNew (60.83s) --- PASS: TestAccWAFV2RuleGroup_changeCapacityForceNew (65.89s) --- PASS: TestAccWAFV2RuleGroup_updateRule (68.22s) --- PASS: TestAccWAFV2RuleGroup_logicalRuleStatements (85.48s) --- PASS: TestAccWAFV2RuleGroup_LabelMatchStatement (58.55s) --- PASS: TestAccWAFV2RuleGroup_byteMatchStatement (60.96s) --- PASS: TestAccWAFV2RuleGroup_ipSetReferenceStatement (35.66s) --- PASS: TestAccWAFV2RuleGroup_RateBased_maxNested (35.63s) --- PASS: TestAccWAFV2RuleGroup_updateRuleProperties (105.28s) --- PASS: TestAccWAFV2RuleGroup_sizeConstraintStatement (59.81s) --- PASS: TestAccWAFV2RuleGroup_xssMatchStatement (59.43s) --- PASS: TestAccWAFV2RuleGroup_IPSetReferenceStatement_ipsetForwardedIP (112.62s) --- PASS: TestAccWAFV2RuleGroup_sqliMatchStatement (58.73s) --- PASS: TestAccWAFV2RuleGroup_geoMatchStatement (58.94s) --- PASS: TestAccWAFV2RuleGroup_GeoMatchStatement_forwardedIP (58.18s) --- PASS: TestAccWAFV2RuleGroup_RuleLabels (58.66s) --- PASS: TestAccWAFV2RuleGroup_tags (83.04s) --- PASS: TestAccWAFV2RuleGroup_rateBasedStatement (109.78s) --- PASS: TestAccWAFV2RuleGroup_regexMatchStatement (32.72s) --- PASS: TestAccWAFV2RuleGroup_minimal (25.63s) --- PASS: TestAccWAFV2RuleGroup_regexPatternSetReferenceStatement (32.67s) --- PASS: TestAccWAFV2RuleGroup_ruleAction (75.28s) --- PASS: TestAccWAFV2RuleGroup_RuleAction_customRequestHandling (50.20s) --- PASS: TestAccWAFV2RuleGroup_RuleAction_customResponse (65.41s) --- PASS: TestAccWAFV2RuleGroup_ByteMatchStatement_fieldToMatch (264.78s) ```

breathingdust · 2023-06-30T15:51:14Z

Hi all! 👋 Wanted to give an update with our current status. The Terraform AWS Provider, Terraform DevEx and Terraform Core teams are all engaged in working towards a solution. The main problem is peak memory usage and this peak occurs when Terraform makes calls to configured providers to load their resource/data source schemas. The Terraform protocol contains a single RPC which asks for all schemas regardless of the specific resources configured. This means that as the AWS provider grows, the memory requirements for using it also grow. Some specific resources such as QuickSight and WAFv2 have extremely large nested schemas which can have an outsized effect on memory as we have seen.

Changing Terraform to load only the schemas for configured resources is likely feasible, but it is a very large undertaking requiring changes to the protocol (and thus changes to Terraform, terraform-plugin-framework, terraform-plugin-sdk and the provider). This change should decrease the memory footprint significantly for most users, and memory requirements would vary based on the particular resource configured. At this point we are investigating the feasibility of this change, but can make no indication of timings.

Short term however, there are a number of smaller optimizations we can do which we hope will alleviate some of the pressure while we research more longer term improvements. Some of these modifications will ship in the next version of the AWS provider. We will also be more intentional about accepting PR submissions for resources/data sources with particularly massive schema at least until their memory requirements are more closely understood. In some cases we may need to leave these out until we can resolve the root issue. More updates to come when we have details we can share.

fatbasstard · 2023-07-03T14:19:42Z

To jump on board: We're now hitting memory limitations as well. To make it more interesting: We're using TF Cloud and their runners are fixed (and limited) to 2Gb of memory. So that means TF is now "stuck" on certain workspaces...

Going to play with Parallelism and/or reverting the version back to prior 5.1.0

But the solution mentioned by @breathingdust (to only load the used resources) seems/feels like it should be, looking forward to the result of that refactoring

kieran-lowe · 2023-08-08T13:55:50Z

Hi @breathingdust, hope all is well. Are there any updates on this you could share out of curiosity? Have there been any developments?

breathingdust · 2023-08-25T20:27:19Z

Happy Friday everyone!

I do have an update for you. Terraform 1.6 (currently in alpha) will include new functionality that will allow a cached provider schema to be used which should significantly reduce memory consumption for configurations including multiple instances of the same provider. In the near future terraform-plugin-go / terraform-plugin-sdk and terraform-plugin-framework will all be updated to take advantage of this, and provider authors (including the AWS Provider maintainers) will need to update these dependencies for this to take effect. I can’t give timings for when these changes will all be available, but this should be a significant step forward for those experiencing memory issues with multiple AWS providers configured.

Additionally, we have introduced a regex cache to the provider (released in v5.14.0) which in testing seems to have a significant impact on memory consumption.

We are continuing to work with the Terraform Core and DevEx teams on further improvements, and will update this issue with new information as I have it.

azarudeena · 2023-08-29T07:52:45Z

I am using aws provider v5.14.0 on Mac M1 laptop. Significant cpu hogging there on terraform validate and terraform plan it spins up like many number processes. attached is the screen shot from my activity monitor. not sure if this qualifies for a separate issue. thought I will tagging it here as there is addressing memory issue.

at-wat · 2023-08-31T04:03:27Z

@azarudeena Terraform provider processes are spawned for each provider {}.
So, if you use 100 AWS accounts/regions in your code, it's normal to have 100 processes.

azarudeena · 2023-09-03T07:05:19Z

@at-wat this is only for one aws account and one provider block. I am doing validate in a our staging region. with some 20 ECS services custom modules which dependencies like RDS, sqs, dynamodb, KMS etc. addition to that even our CI time in our self hosted runners have doubled after this upgrade.

at-wat · 2023-10-10T02:40:10Z

Updated the reproduction repo: https://github.com/at-wat/terraform-provider-aws-v4.67-memory-usage/

provider version	4.66.1	4.67.0	5.0.1	5.1.0	5.7.0	5.16.2	5.20.0
terraform version	1.4.5	1.4.5	1.4.5	1.4.5	1.4.5	1.6.0-beta1	1.6.0
Max RSS (average)	558M	729M	738M	1102M	1101M	1027M	194M
RSS increase	-	+31%	+32%	+97%	+97%	+84%	-65%

and confirmed that the memory usage is significantly reduced on terraform-provider 5.20.0 and terraform 1.6.0

fatbasstard · 2023-10-10T07:28:16Z

Looking really good @at-wat !

Question, is this related to the new "Global Schema Cache"? (hashicorp/terraform#33482)

at-wat added bug Addresses a defect in current functionality. needs-triage Waiting for first response or review from a maintainer. labels Jun 2, 2023

ewbankkit added the provider Pertains to the provider itself, rather than any interaction with AWS. label Jun 2, 2023

justinretzolk removed the needs-triage Waiting for first response or review from a maintainer. label Jun 2, 2023

bflad mentioned this issue Jun 21, 2023

Delay Schema Caching Until Resource Operations hashicorp/terraform-plugin-framework#784

Closed

bflad mentioned this issue Jun 23, 2023

helper/schema: Provide Functionality to Reduce Resident Memory Usage from Schemas hashicorp/terraform-plugin-sdk#1217

Closed

This was referenced Jun 26, 2023

[Bug]: The terraform-provider-aws_v5.4.0_x5 plugin crashed #32212

Closed

[Bug]: Error: The terraform-provider-aws_v4.67.0_x5 plugin crashed! #32165

Open

bflad mentioned this issue Jun 26, 2023

Remove Extraneous Caching of Resource Schemas hashicorp/terraform-plugin-mux#168

Closed

This was referenced Jun 27, 2023

tfprotov5+tfprotov6: Add DynamicValue Null Checking Without Type Information hashicorp/terraform-plugin-go#305

Closed

internal/tags: Used shared memory allocation for shared tag schemas #32250

Merged

bflad mentioned this issue Jun 28, 2023

service/wafv2: Allocate memory once for emptySchema() #32258

Merged

ewbankkit mentioned this issue Jun 28, 2023

Use SchemaFunc for some resources with large schemas #32274

Merged

breathingdust mentioned this issue Jun 29, 2023

[Enhancement]: Regularly monitor peak and resident memory usage of the provider #32289

Closed

rjinski mentioned this issue Jul 3, 2023

Significant memory usage at init with 4.29.0 provider #26742

Closed

ytsarev mentioned this issue Jul 17, 2023

Memory continously increasing after 6 workspaces upbound/provider-terraform#180

Open

apparentlymart mentioned this issue Aug 21, 2023

Limit memory usage with many provider aliases hashicorp/terraform#32744

Open

YakDriver mentioned this issue Sep 20, 2023

[Enhancement]: Improve memory allocation in the provider #33553

Open

YakDriver mentioned this issue Oct 11, 2023

[Bug]: Memory testing and reporting improvements #33898

Open

bebold-jhr mentioned this issue Oct 19, 2023

[Bug]: data sources refresh every time which is causing resource to be replaced unnecessarily #29421

Closed

gdavison mentioned this issue Oct 2, 2024

provider: Reduce schema memory usage by 18 MB #39569

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Memory consumption has increased since v4.67.0 #31722

[Bug]: Memory consumption has increased since v4.67.0 #31722

at-wat commented Jun 2, 2023 •

edited

Loading

github-actions bot commented Jun 2, 2023

at-wat commented Jun 2, 2023

at-wat commented Jun 2, 2023

n00borama commented Jun 7, 2023

at-wat commented Jun 8, 2023

kieran-lowe commented Jun 12, 2023 •

edited

Loading

breathingdust commented Jun 21, 2023

stephen-hames commented Jun 27, 2023

breathingdust commented Jun 30, 2023

fatbasstard commented Jul 3, 2023 •

edited

Loading

kieran-lowe commented Aug 8, 2023

breathingdust commented Aug 25, 2023

azarudeena commented Aug 29, 2023 •

edited

Loading

at-wat commented Aug 31, 2023

azarudeena commented Sep 3, 2023 •

edited

Loading

at-wat commented Oct 10, 2023

fatbasstard commented Oct 10, 2023

[Bug]: Memory consumption has increased since v4.67.0 #31722

[Bug]: Memory consumption has increased since v4.67.0 #31722

Comments

at-wat commented Jun 2, 2023 • edited Loading

Terraform Core Version

AWS Provider Version

Affected Resource(s)

Expected Behavior

Actual Behavior

Relevant Error/Panic Output Snippet

Terraform Configuration Files

Steps to Reproduce

Debug Output

Panic Output

Important Factoids

References

Would you like to implement a fix?

github-actions bot commented Jun 2, 2023

Community Note

at-wat commented Jun 2, 2023

at-wat commented Jun 2, 2023

n00borama commented Jun 7, 2023

at-wat commented Jun 8, 2023

kieran-lowe commented Jun 12, 2023 • edited Loading

breathingdust commented Jun 21, 2023

stephen-hames commented Jun 27, 2023

breathingdust commented Jun 30, 2023

fatbasstard commented Jul 3, 2023 • edited Loading

kieran-lowe commented Aug 8, 2023

breathingdust commented Aug 25, 2023

azarudeena commented Aug 29, 2023 • edited Loading

at-wat commented Aug 31, 2023

azarudeena commented Sep 3, 2023 • edited Loading

at-wat commented Oct 10, 2023

fatbasstard commented Oct 10, 2023

at-wat commented Jun 2, 2023 •

edited

Loading

kieran-lowe commented Jun 12, 2023 •

edited

Loading

fatbasstard commented Jul 3, 2023 •

edited

Loading

azarudeena commented Aug 29, 2023 •

edited

Loading

azarudeena commented Sep 3, 2023 •

edited

Loading