Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v6.8.0 and upwards lags when multiple AWS profiles exist in the environment #3044

Closed
megbetts opened this issue Nov 25, 2023 · 13 comments
Closed
Assignees
Labels
impact/performance Something is slower than expected impact/regression Something that used to work, but is now broken kind/bug Some behavior is incorrect or out of spec p1 A bug severe enough to be the next item assigned to an engineer resolution/fixed This issue was fixed

Comments

@megbetts
Copy link

What happened?

With v6.8.0 and upwards, out deployment pipeline has started experiencing massive delays with Pulumi.

We saw this change v6.7.0...v6.8.0#diff-34c57e622183cb0d8dd0d3f9eaa0861b3340120e9b2ad811bac7ac7be4cea4b1R619 which could be why.

Our pipeline has multiple AWS profiles loaded in the environment - could the new version of pulumi-aws be trying to validate all the credentials it finds before the run?

We tried setting the env var to skip credentials validation but we still ran into the delay.

Attached is a trace from a test deploy where we tried to deploy a single s3 bucket. We saw it stuck on
Marshaling property for RPC[config]: version={6.9.0} and then on Terraform input __defaults = []interface {}{}

Example

console (4).txt

Output of pulumi about

pulumi_aws: v6.9.0

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@megbetts megbetts added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Nov 25, 2023
@megbetts megbetts changed the title v6.8.0 and upwards running slow when multiple AWS profiles exist in the environment v6.8.0 and upwards lags when multiple AWS profiles exist in the environment Nov 25, 2023
@lukehoban
Copy link
Contributor

lukehoban commented Nov 25, 2023

Thanks for reporting and for sharing the logs @megbetts. We’ve also been investigating an issue that sounds similar in #2987, but have so far been unable to reproduce and root cause. The details you provided here already may help narrow this down.

See in particular #2987 (comment), which matches what I see in the logs you provided here.

If by any chance you could also capture a trace that includes upstream provider logs - that may be very helpful in pinning this down. (Do note that you may need to redact some account details before sharing this):

TF_LOG=TRACE pulumi up --logtostderr --logflow -v=10 2> out.txt

@lukehoban lukehoban added impact/regression Something that used to work, but is now broken p1 A bug severe enough to be the next item assigned to an engineer impact/performance Something is slower than expected labels Nov 25, 2023
@t0yv0
Copy link
Member

t0yv0 commented Nov 27, 2023

Thanks for reporting this @megbetts indeed. I'm going to go ahead with editing the suspect code to switch to a lock-free implementation, and add ability to turn it off, on the hypothesis that sync.Once locks things up further. In the meanwhile any more bits of information you can provide are super helpful, in particular I am wondering if:

  • 6.7.0 has no performance degradation compared to prior versions
  • 6.8.0 makes the pre-existing performance degradation worse

Thank you for your patience.

@lukehoban we have a few traces and log files in the context of #2987 from a different customer and I'm working with @jaxxstorm to make a better instrumented build of AWS provider. I'll post it here once available. In particular I'd love to land improved TF_LOG=trace capture and additional spans around Configure before trying to repro again.

@t0yv0
Copy link
Member

t0yv0 commented Nov 27, 2023

We've confirmed that sync.Once can cause locking (per @iwahbe pointer do documentation). #3051 should remove sync.Once.

@iwahbe iwahbe removed the needs-triage Needs attention from the triage team label Nov 27, 2023
@weirdion
Copy link

We (w/ @megbetts) tested 6.7.0 last week and found no performance regressions. The issue didn't start till 6.8.0.

@t0yv0
Copy link
Member

t0yv0 commented Nov 27, 2023

Hi @weirdion we're releasing v6.11.0-alpha.1 https://github.com/pulumi/pulumi-aws/actions/runs/7010825633 that should have a mitigation for the sync.Once change (allow an hour or so for automation to finish up propagating the artifacts). If you could try that version out that'd be a great help.

While the above change is likely to mitigate the issue, we're still looking for a repro to understand and fix why Configure became slow. A minimal Pulumi program reproducing the issue would be surer helpful, absent that:

  • "Our pipeline has multiple AWS profiles loaded in the environment" - how would we go about reproducing that? Are you setting multiple environment variables, or setting up multiple AWS provider instances with distinct profiles?

  • If you are able to share new verbose logs obtained via https://www.pulumi.com/docs/support/troubleshooting/#verbose-logging (note TF_LOG=trace) or the trace file as described on that page that could narrow it down a bit further.

@weirdion
Copy link

@t0yv0 I can test this first thing tomorrow morning, I see the release on pypi.

Regarding the minimal Pulumi program, I believe @megbetts actually tested with a simple s3 bucket creation python script to isolate the issue - so nothing special about the program.

  • "Our pipeline has multiple AWS profiles loaded in the environment" - The aws profiles in the test run were configured in a container environment under ~/.aws/credentials with STS roles to be assumed during execution.

@t0yv0
Copy link
Member

t0yv0 commented Nov 28, 2023

Thanks so much. If you can share that simple program that could help. I've got 4 profiles in ~/.aws/credentials but it does not seem to be affecting the speed of Configure/CheckConfig. I'm starting to think perhaps how you configure the AWS Provider is relevant for the repro here - are you using an explicit provider or configure it with pulumi config set .. and which options you configure it with.

@weirdion
Copy link

weirdion commented Nov 28, 2023

OK, tried several times - the issue still persists in 6.11.0-alpha1.

  • The python program -
"""A Python Pulumi program"""

import pulumi
import pulumi_aws

a = pulumi_aws.s3.Bucket("pewlewmibucket")

pulumi.export("a", a)

with config

config:
  aws:region: us-west-2
  aws:skipCredentialsValidation: true
  aws:skipMetadataApiCheck: true
  aws:skipRegionValidation: true
  aws:skipRequestingAccountId: true
  • Creating a new s3 bucket with 6.11.0-alpha1 with TF_LOG=trace (took ~16 minutes) - console.log (sanitized to remove account information).

We are not configuring the provider in any specific way here, simply using it out of the box as you can see from the code above.
As for the environment itself that runs it, we do assume role using STS for each deploy then expose the temporary credentials with AWS_PROFILE env variable to the pulumi python script, which you should see in the logs Using profile: tf_aws.profile=tmp-account1 tf_aws.profile.source=envvar provider=aws@v6.11.0-alpha.1 - but besides that nothing else fancy here

@t0yv0
Copy link
Member

t0yv0 commented Nov 28, 2023

Very interesting, thank you for cleaning up the logs! This program takes 14s for me but your logs indicate 8min slowdown 2x around provider configuration under SSO-enabled AWS_PROFILE. We'll need to try to repo with an assumed role under STS.

@t0yv0
Copy link
Member

t0yv0 commented Nov 29, 2023

Noticing that https://github.com/hashicorp/terraform-provider-aws/blob/816aa99a12ca13bf225d34a226e797f1494ce8ca/internal/conns/config.go#L132 call is suspect for slowing things down. Unfortunately it seems to remain a black box in the logs.

git diff v6.7.0..v6.8.0 -- provider/go.mod | grep 'aws-sdk-go-base'                                                                                                                                              ~/code/pulumi-aws
-       github.com/hashicorp/aws-sdk-go-base/v2 v2.0.0-beta.37
+       github.com/hashicorp/aws-sdk-go-base/v2 v2.0.0-beta.38
        github.com/hashicorp/aws-sdk-go-base v1.1.0 // indirect
-       github.com/hashicorp/aws-sdk-go-base/v2/awsv1shim/v2 v2.0.0-beta.38 // indirect
+       github.com/hashicorp/aws-sdk-go-base/v2/awsv1shim/v2 v2.0.0-beta.39 // indirect

And skimming through upstream issues found hashicorp/aws-sdk-go-base#453

@lukehoban
Copy link
Contributor

The logs helped to track this down. I feel confident it is due to this set of issues:

Notably, the rollback fix from upstream (hashicorp/terraform-provider-aws#34300) didn't take affect for us because something in our full set of dependencies was forcing a higher version than what the upstream was using (which had the perf regression).

However, the fix in aws/aws-sdk-go-v2#2355 was just picked up in our provider in #3056 - which is about to be released. So most likely this will be addressed via that. But we may want to do a deeper review of all of the relevant dependencies.

@lukehoban
Copy link
Contributor

However, the fix in aws/aws-sdk-go-v2#2355 was just picked up in our provider in #3056 - which is about to be released. So most likely this will be addressed via that. But we may want to do a deeper review of all of the relevant dependencies.

We released https://github.com/pulumi/pulumi-aws/releases/tag/v6.12.0 this morning, which includes the fix from aws/aws-sdk-go-v2#2355. We've also verified that all versions of http://github.com/aws/aws-sdk-go-v2/config and github.com/aws/aws-sdk-go-v2/internal/ini match upstream.

So we'll closed this out as fixed with the latest release. If you continue to see any performance issue, please do reopen!

@weirdion
Copy link

Was able to test 6.12, and can confirm that the same deploy from my last comment (was ~16 minutes with 6.8.0+) is now back to ~30 seconds. Thank you everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/performance Something is slower than expected impact/regression Something that used to work, but is now broken kind/bug Some behavior is incorrect or out of spec p1 A bug severe enough to be the next item assigned to an engineer resolution/fixed This issue was fixed
Projects
None yet
Development

No branches or pull requests

5 participants