Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock acquiring credentials when GlobalRuntimeDependencyRegistry used #3153

Open
martincostello opened this issue Jan 21, 2024 · 4 comments
Assignees
Labels
bug This issue is a bug. credentials needs-investigation p2 This is a standard priority issue queued

Comments

@martincostello
Copy link
Contributor

martincostello commented Jan 21, 2024

Describe the bug

While doing research into publishing a .NET 8 application using native AoT to EKS, I found at deploy-time that the application would fail on a code path we have in an internal library where the AWS SDK would attempt to get credentials using STS. The exception would be the following:

Unhandled Exception: Amazon.RuntimeDependencies.MissingRuntimeDependencyException: Operation failed because of a missing runtime dependency. In Native AOT builds runtime dependencies can not be dynamically loaded from assembles. Instead the runtime dependency needs to be explicitly registered. To complete this operation register an instance of Amazon.SecurityToken.AmazonSecurityTokenServiceClient from package AWSSDK.SecurityToken using the operation Amazon.RuntimeDependencies.GlobalRuntimeDependencyRegistry.Instance.RegisterSecurityTokenServiceClient.

I added the code mentioned to the application and redeployed. At this point it appeared to no longer be responding and crash-looping.

Through various trial and error, I've distilled the problematic code path down to the code snippet below.

My hunch is that the code path is somehow re-entrantly trying to access this lock, causing it to block indefinitely:

public override async System.Threading.Tasks.Task<ImmutableCredentials> GetCredentialsAsync()
{
await _updateGeneratedCredentialsSemaphore.WaitAsync().ConfigureAwait(false);

That's just a hunch - the issue could be somewhere completely different, but it's the first bit of synchronization I could find on the method call that is blocking. It also blocks using the synchronous version, GetCredentials(), so I'm pretty sure it's not something to do with sync-over-async.

The code appears to deadlock regardless of whether native AoT is actually used at runtime or not.

Expected Behavior

The operation does not block.

With the repro, the following messages should be printed to the console:

GlobalRuntimeDependencyRegistry.RegisterSecurityTokenServiceClient()
AssumeRoleWithWebIdentityCredentials.FromEnvironmentVariables()
AssumeRoleWithWebIdentityCredentials.GetCredentialsAsync()
Credentials obtained

Current Behavior

The application blocks on the call to AssumeRoleWithWebIdentityCredentials.GetCredentialsAsync().

GlobalRuntimeDependencyRegistry.RegisterSecurityTokenServiceClient()
AssumeRoleWithWebIdentityCredentials.FromEnvironmentVariables()
AssumeRoleWithWebIdentityCredentials.GetCredentialsAsync()

Reproduction Steps

Run the following code within EKS where the instance credentials are available.

using Amazon.Runtime;
using Amazon.RuntimeDependencies;
using Amazon.SecurityToken;

Console.WriteLine("GlobalRuntimeDependencyRegistry.RegisterSecurityTokenServiceClient()");

GlobalRuntimeDependencyRegistry.Instance.RegisterSecurityTokenServiceClient(
    (_) =>
        new AmazonSecurityTokenServiceClient());

Console.WriteLine("AssumeRoleWithWebIdentityCredentials.FromEnvironmentVariables()");

var credentials = AssumeRoleWithWebIdentityCredentials.FromEnvironmentVariables();

Console.WriteLine("AssumeRoleWithWebIdentityCredentials.GetCredentialsAsync()");

_ = await credentials.GetCredentialsAsync();

Console.WriteLine("Credentials obtained");

Possible Solution

No response

Additional Information/Context

No response

AWS .NET SDK and/or Package version used

AWSSDK.SecurityToken 3.7.300.38

Targeted .NET Platform

.NET 8

Operating System and version

Ubuntu 22.04.3 LTS

@martincostello martincostello added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 21, 2024
@martincostello
Copy link
Contributor Author

martincostello commented Jan 22, 2024

Doing a bit more digging, this might be a case of more of a docs/footgun issue than a bug per-se.

Digging around some more, I see here that when not using native AoT, the client is created with explicit anonymous credentials:

ServiceClientHelpers.STS_ASSEMBLY_NAME, ServiceClientHelpers.STS_SERVICE_CLASS_NAME, new AnonymousAWSCredentials(), region);

However, in my repro, I'm just calling the constructor with a region, it calls back into FallbackCredentialsFactory, which is what I traced through in a decompiler to get to the minimal repro in the first place:

: base(FallbackCredentialsFactory.GetCredentials(), new AmazonSecurityTokenServiceConfig{RegionEndpoint = region}) { }

I haven't tried changing my code yet to do the same, but it raises a few points

  1. From the original error message, it isn't clear that the user needs to do this to prevent a loop when default/fallback credentials are used. It seems like an implementation detail the average user wouldn't know of or find.
  2. Without me digging more in a debugger, if this is the case, I would have thought it would have caused a StackOverflowException rather than hang (unless the runtime is just failing to show that, so it appears to be hanging).

If this does turn out to be what is happening, then I guess the bug is less a deadlock, and more a need for clearer documentation on what the user needs to do in this case.

@martincostello
Copy link
Contributor Author

Yep, that's what it was.

If I change the code to this, the problem goes away:

using Amazon.Runtime;
using Amazon.RuntimeDependencies;
using Amazon.SecurityToken;

Console.WriteLine("GlobalRuntimeDependencyRegistry.RegisterSecurityTokenServiceClient()");

GlobalRuntimeDependencyRegistry.Instance.RegisterSecurityTokenServiceClient(
-   (_) =>
-       new AmazonSecurityTokenServiceClient());
+   (context) =>
+       new AmazonSecurityTokenServiceClient(
+           new AnonymousAWSCredentials(),
+           context.SecurityTokenServiceClientContextData.Region));

Console.WriteLine("AssumeRoleWithWebIdentityCredentials.FromEnvironmentVariables()");

var credentials = AssumeRoleWithWebIdentityCredentials.FromEnvironmentVariables();

Console.WriteLine("AssumeRoleWithWebIdentityCredentials.GetCredentialsAsync()");

_ = await credentials.GetCredentialsAsync();

Console.WriteLine("Credentials obtained");

Maybe it would help give a hint for the user in this case if SecurityTokenServiceClientContext included a property for the credentials to use, which could be set to an instance of AnonymousAWSCredentials?

public class SecurityTokenServiceClientContext

public class SecurityTokenServiceClientContext
{
    public enum ActionContext { AssumeRoleAWSCredentials, AssumeRoleWithWebIdentityCredentials, FederatedAWSCredentials };
    public ActionContext Action { get; set; }
    public RegionEndpoint Region { get; set; }
    public IWebProxy ProxySettings { get; set; }
+   public AWSCredentials Credentials { get; set; } = new AnonymousAWSCredentials();
}

@dscpinheiro dscpinheiro added needs-review and removed needs-triage This issue or PR still needs to be triaged. labels Jan 24, 2024
@bhoradc bhoradc added needs-investigation p2 This is a standard priority issue and removed needs-review labels Jan 26, 2024
@normj
Copy link
Member

normj commented Jan 27, 2024

The STS dependency is the hardest one to deal with in the SDK for AOT or more specifically now the GlobalRuntimeDependencyRegistry. Even in the assume role case, which requires AWS credentials and not the anonymous credentials, you want to make sure your not just using the default credentials because that would go back to the profile that needs to assume a role instead of the profile that should make the STS assume role call.

I agree this area needs some smoothing out and at the very least better documentation.

@bhoradc bhoradc added the queued label Jan 29, 2024
@beeradmoore
Copy link

If you are using code that gets the role from the running instance which has role assumed in its execution context like Fargate or Lambda (which I believe is using assume role?), is the current correct action to keep <Assembly Name="AWSSDK.SecurityToken" Dynamic="Required All"></Assembly> in the rd.xml file? Or is there another way to use AssumeRoleAWSCredentials without hardcoding the role name itself?

Side note, it's crazy to me that this GitHub issue is the only reference found on Google for RegisterSecurityTokenServiceClient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. credentials needs-investigation p2 This is a standard priority issue queued
Projects
None yet
Development

No branches or pull requests

5 participants