[BUG] ManagedIdentityCredential fails sometimes in AKS #18312

PSanetra · 2021-02-01T11:59:32Z

Describe the bug
We are using the ManagedIdentityCredential class to get access tokens for managed identities. We are deploying the application to AKS, where we are using https://github.com/Azure/aad-pod-identity.

The instance of the ManagedIdentityCredential is a singleton.

Sometimes after starting a new pod, we get the exception Azure.Identity.CredentialUnavailableException: ManagedIdentityCredential authentication unavailable. No Managed Identity endpoint found. everytime when the pod is trying to get the access token. If the pod is able to start without this exception, the exception is never observed during the lifetime of the pod.

The problem seems to be that there might be a delay after starting up the pod, after which the IMDS endpoint is available for the pod in AKS. When the pod is trying to get the access token before the endpoint is available, it has some bad state, where it will never be able to recover from.

The cause of the issue is probably this code:

azure-sdk-for-net/sdk/identity/Azure.Identity/src/ManagedIdentityClient.cs

Lines 51 to 67 in 705f329

    
           private protected virtual async ValueTask<ManagedIdentitySource> GetManagedIdentitySourceAsync(bool async, CancellationToken cancellationToken) 
        
           { 
        
               using var asyncLock = await _identitySourceAsyncLock.GetLockOrValueAsync(async, cancellationToken).ConfigureAwait(false); 
        
               if (asyncLock.HasValue) 
        
               { 
        
                   return asyncLock.Value; 
        
               } 
        
               ManagedIdentitySource identitySource = AppServiceV2017ManagedIdentitySource.TryCreate(_options) ?? 
        
                                                       CloudShellManagedIdentitySource.TryCreate(_options) ?? 
        
                                                       AzureArcManagedIdentitySource.TryCreate(_options) ?? 
        
                                                       ServiceFabricManagedIdentitySource.TryCreate(_options) ?? 
        
                                                       await ImdsManagedIdentitySource.TryCreateAsync(_options, async, cancellationToken).ConfigureAwait(false); 
        
               asyncLock.SetValue(identitySource); 
        
               return identitySource; 
        
           }

The ManagedIdentityClient tries several strategies to get a ManagedIdentitySource. If all of them fail, it sets the value of _identitySourceAsyncLock to null and will therefore never try to resolve the ManagedIdentitySource again.

Expected behavior
The exception should not occur or it should be possible to recover from this failed state when the IMDS endpoint gets available.

Actual behavior
The exception occurs and it is not possible to recover from the failed state.

To Reproduce

Deploy AKS
Deploy the aad-pod-identity services
Deploy a pod using the ManagedIdentityCredentials
Pray for observing the exception
Observe the exception

Environment:

AKS 1.17.11
AAD Pod Identity 1.7.1
Azure.Identity 1.2.3

The text was updated successfully, but these errors were encountered:

Fixes Azure#18312 Related: Azure/aad-pod-identity#846

jsquire · 2021-02-01T14:30:47Z

Thank you for your feedback. Tagging and routing to the team member best able to assist.

christothes · 2021-02-01T16:38:45Z

Hi @PSanetra - In your testing, how long does it take the IMDS endpoint to become available? The reason I ask is that I am wondering if changing to a less aggressive timeout would be a better way to approach this issue.

We currently are set to timeout after 1 second. However, the recommendation is to retry in the event of a timeout.

PSanetra · 2021-02-01T16:54:59Z

Hi @christothes, I am not sure how long the endpoint needs to become available in these problematic edge cases. By gut feeling I would say that the pods are currently starting in 7 out of 10 times in a correct state.

I think most of the time the first attempt to get an access token should be triggered by a readiness probe and our readiness probes are configured to run after an inital delay of 5 seconds.

christothes · 2021-02-01T17:00:20Z

It seems like either your initialization code would need to implement a retry or ImdsManagedIdentitySource could just handle that. I think my preference would be to have ImdsManagedIdentitySource handle the retry.

PSanetra · 2021-02-01T17:08:17Z

@christothes I am not sure if a less aggressive timeout would help. I think it would be better to quickly throw an exception and let the application decide to retry.

It might cause some weird trouble with ChainedTokenCredentials if the ImdsManagedIdentitySource would handle the retry as it would block trying other TokenCredentials, wouldn't it?

I think for our services it would not be a big problem to retry as they wouldn't report as ready until the it was possible to get an access token for the first time.

christothes · 2021-02-01T17:38:11Z

@christothes I am not sure if a less aggressive timeout would help. I think it would be better to quickly throw an exception and let the application decide to retry.

It might cause some weird trouble with ChainedTokenCredentials if the ImdsManagedIdentitySource would handle the retry as it would block trying other TokenCredentials, wouldn't it?

I think for our services it would not be a big problem to retry as they wouldn't report as ready until the it was possible to get an access token for the first time.

Could you expand a bit more on the scenario where the retry policy would block trying other TokenCrednetials? Since ImdsManagedIdentitySource only gets invoked when all other request mechanisms fail, and other code paths requiring ImdsManagedIdentitySource would be blocked either way , I'm wondering how this would occur. There may be a scenario here I am unfamiliar with.

azure-sdk-for-net/sdk/identity/Azure.Identity/src/ManagedIdentityClient.cs

Lines 59 to 63 in f611aed

    
           ManagedIdentitySource identitySource = AppServiceV2017ManagedIdentitySource.TryCreate(_options) ?? 
        
                                                   CloudShellManagedIdentitySource.TryCreate(_options) ?? 
        
                                                   AzureArcManagedIdentitySource.TryCreate(_options) ?? 
        
                                                   ServiceFabricManagedIdentitySource.TryCreate(_options) ?? 
        
                                                   await ImdsManagedIdentitySource.TryCreateAsync(_options, async, cancellationToken).ConfigureAwait(false);

PSanetra · 2021-02-01T18:04:21Z

@christothes The scenario I had in mind were using DefaultAzureCredential.

So when I would try the get an access token via an DefaultAzureCredential it would be tried to get a token via, EnvironmentCredential, ManagedIdentityCredential, SharedTokenCredential, and so on.

azure-sdk-for-net/sdk/identity/Azure.Identity/src/DefaultAzureCredential.cs

Lines 169 to 220 in a79bd10

    
           private static TokenCredential[] GetDefaultAzureCredentialChain(DefaultAzureCredentialFactory factory, DefaultAzureCredentialOptions options) 
        
           { 
        
               if (options is null) 
        
               { 
        
                   return s_defaultCredentialChain; 
        
               } 
        
               int i = 0; 
        
               TokenCredential[] chain = new TokenCredential[7]; 
        
               if (!options.ExcludeEnvironmentCredential) 
        
               { 
        
                   chain[i++] = factory.CreateEnvironmentCredential(); 
        
               } 
        
               if (!options.ExcludeManagedIdentityCredential) 
        
               { 
        
                   chain[i++] = factory.CreateManagedIdentityCredential(options.ManagedIdentityClientId); 
        
               } 
        
               if (!options.ExcludeSharedTokenCacheCredential) 
        
               { 
        
                   chain[i++] = factory.CreateSharedTokenCacheCredential(options.SharedTokenCacheTenantId, options.SharedTokenCacheUsername); 
        
               } 
        
               if (!options.ExcludeVisualStudioCredential) 
        
               { 
        
                   chain[i++] = factory.CreateVisualStudioCredential(options.VisualStudioTenantId); 
        
               } 
        
               if (!options.ExcludeVisualStudioCodeCredential) 
        
               { 
        
                   chain[i++] = factory.CreateVisualStudioCodeCredential(options.VisualStudioCodeTenantId); 
        
               } 
        
               if (!options.ExcludeAzureCliCredential) 
        
               { 
        
                   chain[i++] = factory.CreateAzureCliCredential(); 
        
               } 
        
               if (!options.ExcludeInteractiveBrowserCredential) 
        
               { 
        
                   chain[i++] = factory.CreateInteractiveBrowserCredential(options.InteractiveBrowserTenantId); 
        
               } 
        
               if (i == 0) 
        
               { 
        
                   throw new ArgumentException("At least one credential type must be included in the authentication flow.", nameof(options)); 
        
               } 
        
               return chain; 
        
           }

When you are now introducing a retry mechanism in the ImdsManagedIdentitySource, the ManagedIdentityCredential.GetTokenAsync() method would block until the timeouts and retries of ImdsManagedIdentitySource are exhausted. And this would always happen, even if the IMDS endpoint will never be available, which is quite often the case when using DefaultAzureCredential.

christothes · 2021-02-01T19:55:13Z

@PSanetra Thanks for the clarification. I did a bit of research on this one and I discovered that we've encountered this issue before and created an example workaround pod that works around this issue without having to consider changes to the default behavior.

Would this work for your scenario? https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/identity/azure-identity/tests/pod-identity/test-pod-identity/templates/job.yaml#L23

PSanetra · 2021-02-01T20:20:18Z

@christothes hmm, yes I guess this would work, although we are currently using a different workaround by retrying getting the access token with new ManagedIdentityCredential instances.

I think this should be fixed in the library. The PR, I have submitted should at least be a significant improvement in contrast to the current behavior as this will make it possible to retry getting the access token with the same ManagedIdentityCredential instance.

christothes · 2021-02-02T17:57:57Z

The only concern with the fix proposed in the PR is that it could also introduce a retry for ImdsManagedIdentitySource on every GetToken request as part of a ChainedTokenCredentials, which doesn't cache the first successful credential in the chain.

Would it be possible for you to try the proposed workaround?

PSanetra · 2021-02-02T18:28:41Z

@christothes I guess the proposed workaround could work, but I prefer to continue to use our current workaround approach.

I think retrying to get a Token via ImdsManagedIdentitySource was actually the expected behavior of ChainedTokenCredential. ~~I guess most of the time the check for the endpoint will not even wait for the timeout. Instead it will probably receive a connection reset very quickly.~~ Edit: It will always wait for the timeout if there is no host listening for the ip of course.

PSanetra · 2021-02-02T18:57:47Z

@christothes Maybe it would be a good idea to implement the retry strategy in the ManagedIdentityClient in such a way that it will perform the next check only after the time, specified in https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#retry-guidance

After 5 retries it may never retry again. This way it may be possible that some of the first GetTokenAsync() calls may be slow, but after a while it will not try anymore to check if the IMDS endpoint is available as it would be very unlikely that the IMDS endpoint becomes available after such a long time.

christothes · 2021-02-10T15:50:13Z

@PSanetra - Initially, I had some same thought, but since the retry behavior has other implications in the context of chained credentials, I think it is safest to workaround it either with the pod workaround, or by retrying manually with a new instance of DefaultAzureCredential.

bohlenc · 2021-02-19T00:32:34Z

@PSanetra Thanks for the clarification. I did a bit of research on this one and I discovered that we've encountered this issue before and created an example workaround pod that works around this issue without having to consider changes to the default behavior.

Would this work for your scenario? https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/identity/azure-identity/tests/pod-identity/test-pod-identity/templates/job.yaml#L23

Hi,
in case anyone faces the same problem and wants or needs to implement the workaround for a number of deployments without touching each of them directly, I implemented a mutating webhook that can be configured accordingly: https://github.com/bohlenc/k8s-pod-mutator-webhook (see Examples)

deyanp · 2021-09-27T12:33:44Z

Please note that the initContainers sample

https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/identity/azure-identity/tests/pod-identity/test-pod-identity/templates/job.yaml#L23

must be extended with

&mi_res_id=pod-identity-test

otherwise it will check for any managed identity (there might be others on VM) ...

PSanetra added a commit to PSanetra/azure-sdk-for-net that referenced this issue Feb 1, 2021

Fix MsiUnavailableError when IMDS endpoint availability is delayed.

73e874b

Fixes Azure#18312 Related: Azure/aad-pod-identity#846

PSanetra mentioned this issue Feb 1, 2021

Fix MsiUnavailableError when IMDS endpoint availability is delayed. #18313

Closed

jsquire assigned christothes Feb 1, 2021

jsquire added Azure.Identity Client This issue points to a problem in the data-plane of the library. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Feb 1, 2021

ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 1, 2021

christothes closed this as completed Feb 22, 2021

PSanetra mentioned this issue Mar 10, 2021

IMDS endpoint availability is delayed Azure/aad-pod-identity#992

Closed

github-actions bot locked and limited conversation to collaborators Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ManagedIdentityCredential fails sometimes in AKS #18312

[BUG] ManagedIdentityCredential fails sometimes in AKS #18312

PSanetra commented Feb 1, 2021 •

edited

Loading

jsquire commented Feb 1, 2021 •

edited

Loading

christothes commented Feb 1, 2021

PSanetra commented Feb 1, 2021

christothes commented Feb 1, 2021

PSanetra commented Feb 1, 2021

christothes commented Feb 1, 2021

PSanetra commented Feb 1, 2021

christothes commented Feb 1, 2021

PSanetra commented Feb 1, 2021

christothes commented Feb 2, 2021

PSanetra commented Feb 2, 2021 •

edited

Loading

PSanetra commented Feb 2, 2021

christothes commented Feb 10, 2021

bohlenc commented Feb 19, 2021 •

edited

Loading

deyanp commented Sep 27, 2021

[BUG] ManagedIdentityCredential fails sometimes in AKS #18312

[BUG] ManagedIdentityCredential fails sometimes in AKS #18312

Comments

PSanetra commented Feb 1, 2021 • edited Loading

jsquire commented Feb 1, 2021 • edited Loading

christothes commented Feb 1, 2021

PSanetra commented Feb 1, 2021

christothes commented Feb 1, 2021

PSanetra commented Feb 1, 2021

christothes commented Feb 1, 2021

PSanetra commented Feb 1, 2021

christothes commented Feb 1, 2021

PSanetra commented Feb 1, 2021

christothes commented Feb 2, 2021

PSanetra commented Feb 2, 2021 • edited Loading

PSanetra commented Feb 2, 2021

christothes commented Feb 10, 2021

bohlenc commented Feb 19, 2021 • edited Loading

deyanp commented Sep 27, 2021

PSanetra commented Feb 1, 2021 •

edited

Loading

jsquire commented Feb 1, 2021 •

edited

Loading

PSanetra commented Feb 2, 2021 •

edited

Loading

bohlenc commented Feb 19, 2021 •

edited

Loading