Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add static stability support to IMDS credentials provider #2191

Closed

Conversation

ysaito1001
Copy link
Contributor

Motivation and Context

Implements #2117

Description

This PR adds static stability support to the IMDS credentials provider. The basic idea is that ImdsCredentialsProvider now holds on to a last retrieved credentials and if it cannot fetch the latest credentials from IMDS possibly due to the service being unavailable, it will serve the last retrieved credentials instead. That way, the Rust SDK will not fail fast in sending a request with expired credentials and allows the target service to make the ultimate decision as to whether the request sent is valid.

Even though IMDS is available, it may return stale credentials. Static stability support specifies that ImdsCredentialsProvider should extend the credentials expiry by the calculated amount. The implementation of that is defined in ImdsCredentialsProvider::extend_expiration and references that in botocore.

The code changes in the PR that accomplish the above are mostly in aws-config. However, we found out that those changes alone are not enough to meet requirements for static stability support. Specifically, the timeout logic in LazyCredentialsCache prior to this PR did not work well with the timeout logic required by the updated ImdsCredentialsProvider (see cfeab3b for more details). The bottom lines is that the additional changes have been made in aws-credential-types and the ProvideCredentials trait now has a new method provide_credentials_with_timeout to address the said impedance mismatch.

Testing

  • Expired Credential Test Cases
    • SDK can use IMDS credential provider if first IMDS call returns expired credentials (implemented by test_request_should_be_sent_when_first_call_to_imds_returns_expired_credentials).
    • SDK can send request if expired credentials are available (there's no explicit test added for this as it is implicitly handled by two tests for Refresh Failures).
    • SDK can perform 3 successive requests with expired credentials. IMDS must only be called once (implemented by test_successive_requests_should_be_sent_with_expired_credentials_and_imds_being_called_only_once).
  • Refresh Failures
    • SDK can send a request after a read timeout during a refresh (implemented by read_timeout_during_credentials_refresh_should_yield_last_retrieved_credentials).
    • SDK can send a request after receiving a 500 HTTP response from IMDS during a refresh (implemented by test_request_should_be_sent_with_expired_credentials_after_imds_returns_500_during_credentials_refresh).
  • Logging
    • SDK must log a message when expired credentials are used (implemented by log_message_informing_expired_credentials_are_used).
  • Passed tests in CI

Checklist

  • I have updated CHANGELOG.next.toml if I made changes to the AWS SDK, generated SDK code, or SDK runtime crates

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

This commit removes logic of provider timeout from `LazyCredentialsCache`.
Prior to the commit, it raced provider's `provide_credentials` method
against a timeout future; if the timeout future won, it always yielded
a `ProviderTimeOut`. However, this did not work well with the timeout
behavior of `ImdsCredentialsProvider` described in the static stability
design, which says it should use expired credentials even in the face of
a credentials read timeout.

The logic in question has been moved to the `ProvideCredentials` trait
with the aim of each trait implementer defining what to do in the case of
read timeout.
This commit enhances the reliability of `ImdsCredentialsProvider`. It
allows requests to be dispatched even with expired credentials. This in
turn allows the target service to makes the ultimate decision as to
whether requests sent are valid or not rather than the client SDK
determining the their validity.

The basic idea is that `ImdsCredentialsProvider` now stores a last
retrieved credentials and can provide it if it cannot reach the IMDS
endpoint.
This commit adds integration tests exercising use cases concerned with
static stability. They use S3 as an example service for which credentials
retrieved from IMDS are used, but it can be any service.
This commit reverts the attribute `#[cfg(feature = "test-util")]` given
to `aws_credential_types::lazy_caching::Builder::time_source` to make it
`#[doc(hidden)]` as the method is used in `aws_config::ConfigLoader` where
`#[cfg(feature = "test-util")]` cannot be specified.
@github-actions
Copy link

A new generated diff is ready to view.

A new doc preview is ready to view.

timeout: Duration,
) -> provider::Result {
for (name, provider) in &self.providers {
let timeout_per_provider = timeout.div_f64(self.providers.len() as f64);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be a simple division as each provider has a different idea of how long timeout should be. This implies that we should consider what it means for a credentials provider to time out in a broader scope.

Copy link
Contributor Author

@ysaito1001 ysaito1001 Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least preserved today's behavior with 4592ca5, but a set of APIs provide_credentials and provide_credentials_with_timeout needs to be reconsidered.

This commit moves tests in aws/sdk/integration-tests/imds to
aws/sdk/integration-tests/s3, as there is no IMDS integration
tests of code generation.
This commit addresses #2191 (comment).
It attempts to preserve the same read timeout behavior for
`CredentialsProviderChain` but certainly is not the best way to implement
the read timeout aware credentials provider API.
@github-actions
Copy link

A new generated diff is ready to view.

A new doc preview is ready to view.

Ok(creds) => creds,
_ => match &*self.last_retrieved_credentials.read().await {
Some(creds) => Ok(creds.clone()),
_ => Err(CredentialsError::provider_timed_out(timeout)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should return the original error rather than a timeout error since this error case can also happen in non-timeout cases (such as the IMDS calls failing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error case can also happen in non-timeout cases

Hmm, I could be wrong but isn't the original error (one coming out of non-timeout cases) captured in creds whose type is Result<Credentials, CredentialsError>?

// does not respect provider-specific read timeout behavior, e.g. the IMDS credentials provider
// wants to provide expired credentials, if any, in the case of read timeout.
let sleep_future = sleeper.sleep(timeout);
let timeout_future = Timeout::new(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this top-level timeout cause a race condition with the IMDS provider's timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I realized that after pushing the commit. One possible workaround is to update the signature of provide_credentials_with_timeout to take something like aws_smithy_async::rt::Sleep but sharable.
CredentialsProviderChain::provide_credentials_with_timeout passes it through the chain like "passing the torch" one provider to the next.

Regardless of whether the above idea is implemented as part of the PR, we will work on an RFC because the design around timeout and credentials providers affects not just ImdsCredentialsProvider.

@ysaito1001
Copy link
Contributor Author

Will close this draft for now. Need to rework on what method to add to the ProvideCredentials trait. provide_credentials_with_timeout means delegating overall timeout to the individual provider, which can be tricky to get right in the face of races.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants