Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache credentials that are purely environment based #369

Closed
theipster opened this issue Mar 18, 2023 · 2 comments
Closed

Cache credentials that are purely environment based #369

theipster opened this issue Mar 18, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@theipster
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Description

For context, the AWS provider block already supports reading configuration from:

  • AWS_* environment variables (e.g. AWS_ACCESS_KEY_ID etc. for long-lived credentials, and AWS_WEB_IDENTITY_TOKEN_FILE etc. for OIDC auth flows); or,
  • Terraform configuration (i.e. assume_role { ... } or assume_role_with_web_identity { ... } blocks); or,
  • a combination of the two.

In particular, the AWS provider already supports role chaining. For example, if AWS_WEB_IDENTITY_TOKEN_FILE etc. environment variables are provided in combination with a Terraform configuration of assume_role { ... }, then the provider will firstly use the environment variables to perform AssumeRoleWithWebIdentity into the "initial" role A, and then secondly using role A's short-lived credentials perform AssumeRole into role B.

Also, environment variables are fixed per Terraform execution, which might seem obvious but this constraint will be important below.

Problem statement

When a single Terraform configuration contains multiple provider "aws" { ... } instances that are individually able to obtain its "initial" set of credentials purely from environment variables (regardless whether this initial set of credentials is then used to role chain into a second set of credentials or not), then these "initial" set(s) of credentials should in theory be equivalent to each other.

This equivalence is both in terms of IAM permissions (because the same environment variables would lead to assuming the same role) and in terms of the STS session (since all provider blocks are parsed at the beginning, so any time differences in session initialisation would be negligible anyway).

Despite this equivalence, with the current implementation, each provider instance will blissfully ignore the other instances and proceed to obtain its own set of "initial" credentials.

Not only are these duplicated API calls inefficient, but at large scale (20+ providers, think cross-account role chaining) it can unnecessarily lead to AWS STS throttling the API requests (see hashicorp/terraform-provider-aws#27071 etc.) and ultimately failing the Terraform execution.

Scenario A (no role chaining)
provider "aws" {
  alias = "alias1"
}

provider "aws" {
  alias = "alias2"
}

provider "aws" {
  alias = "alias3"
}
$ export AWS_ROLE_ARN=...
$ export AWS_ROLE_SESSION_NAME=...
$ export AWS_WEB_IDENTITY_TOKEN_FILE=...
$ terraform plan

Currently: invokes STS AssumeRoleWithWebIdentity API 3 times (same role).
Potentially: invokes STS AssumeRoleWithWebIdentity API 1 time.

Scenario B (with role chaining)
provider "aws" {
  alias = "alias1"

  assume_role {
    role_arn = <role A>
  }
}

provider "aws" {
  alias = "alias2"

  assume_role {
    role_arn = <role B>
  }
}

provider "aws" {
  alias = "alias3"

  assume_role {
    role_arn = <role C>
  }
}
$ export AWS_ROLE_ARN=...
$ export AWS_ROLE_SESSION_NAME=...
$ export AWS_WEB_IDENTITY_TOKEN_FILE=...
$ terraform plan

Currently: invokes STS AssumeRoleWithWebIdentity API 3 times (same role) and invokes STS AssumeRole API 3 times (different roles).
Potentially: invokes STS AssumeRoleWithWebIdentity API 1 time and invokes STS AssumeRole API 3 times (different roles).

Potential Library Implementation

(Apologies in advance if I've misunderstood the code...)

From my understanding, the GetAwsConfig() (aws_config.go) function seems to orchestrate the role chaining (if needed), and the getCredentialsProvider() (credentials.go) function seems to handle the parsing/merging of environment variables with Terraform configuration.

Between these two functions, there seems to already be a concept of cached credentials, but the problems are:

  1. the "cache key" depends on the final parsed/merged environment variables and Terraform configuration (i.e. in the role chaining scenario, even though the "initial" configurations may be equivalent, the final parsed/merged configurations would be different and therefore they would cache miss); and,
  2. the caching only works in the scope of role chaining (i.e. within the scope of GetAwsConfig()) for a single provider instance, but is not shareable between multiple provider instances.

If there were some way to 1) detect and short-circuit when the "initial" credentials only depend on environment variables, and 2) apply memoization / flyweight pattern / singleton pattern to share the cache space across multiple provider instances, then the problem would be solved.

Potential Terraform Backend/Provider Configuration

See Scenario A (no role chaining) and Scenario B (with role chaining) above.

References

Other notes

We could (and in fact, currently do, as a temporary workaround) achieve the same objective by pre-authenticating outside of Terraform, e.g.:

$ export AWS_ROLE_ARN=...
$ export AWS_ROLE_SESSION_NAME=...
$ export AWS_WEB_IDENTITY_TOKEN=...
$ credentials=$(aws sts assume-role-with-web-identity ...)
$ export AWS_ACCESS_KEY_ID=${credentials[0]}
$ export AWS_SECRET_ACCESS_KEY=${credentials[1]}
$ export AWS_SESSION_TOKEN=${credentials[2]}
$ terraform plan

which successfully achieves the 1 AWS AssumeRoleWithWebIdentity call plus N AWS AssumeRole calls (role chaining), but obviously this solution increases the Terraform environment dependencies (e.g. AWS CLI) and is unnecessarily cumbersome.

@theipster theipster added the enhancement New feature or request label Mar 18, 2023
@gdavison
Copy link
Contributor

Thanks for all of the work you've put into this, @theipster. However, because of the way that Terraform works with the provider plugins (https://developer.hashicorp.com/terraform/plugin/how-terraform-works#terraform-plugins), each instance of the provider is a separate process, so they unfortunately aren't able to share these steps across instances.

We do now support retries on authentication, which may resolve this problem. The number of retries can also be configured using the environment variable AWS_MAX_ATTEMPTS.

@gdavison gdavison closed this as not planned Won't fix, can't repro, duplicate, stale Mar 21, 2023
@theipster
Copy link
Author

theipster commented Mar 27, 2023

@gdavison Thanks for the quick response, I'll have a read through the plugin documentation.

That said though, ignoring "the way that Terraform works with the provider plugins" (which I perhaps selfishly interpret as: implementation detail), do you not agree that from a black box / an outside perspective (e.g. from AWS STS's perspective) these STS API calls are effectively duplicated and unnecessary?

In that sense, I appreciate that the sharing of credentials across provider instances might not be a valid solution in the solution space, but that doesn't detract from the duplicate STS API calls being a valid problem in the problem space.

Furthermore, this problem does arguably lie within the scope of this library, because the entire authentication process from start (the initial reading of the environment variables) to finish (making the STS API calls) is wholly orchestrated by this library and not, say, the AWS SDK for Go library. I can't see where else this responsibility could lie / be fixed instead?

The AWS_MAX_ATTEMPTS approach feels firmly like treating the symptom (i.e. knowingly letting the duplicate STS API calls happen, and then dealing with the fallout) rather than treating the cause (i.e. just stop making duplicate STS API calls in the first place).

For the benefit of everyone else chasing their tails at this rate-limiting problem, can we please at least make it explicitly clear (by merging #370 perhaps 😉) that the API calls are being knowingly invoked by this library and at least acknowledging that there is room for improvement in this space?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants