Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ManagedIdentityCredential fails sometimes in AKS #18312

Closed
PSanetra opened this issue Feb 1, 2021 · 15 comments
Closed

[BUG] ManagedIdentityCredential fails sometimes in AKS #18312

PSanetra opened this issue Feb 1, 2021 · 15 comments
Assignees
Labels
Azure.Identity Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@PSanetra
Copy link

PSanetra commented Feb 1, 2021

Describe the bug
We are using the ManagedIdentityCredential class to get access tokens for managed identities. We are deploying the application to AKS, where we are using https://github.com/Azure/aad-pod-identity.

The instance of the ManagedIdentityCredential is a singleton.

Sometimes after starting a new pod, we get the exception Azure.Identity.CredentialUnavailableException: ManagedIdentityCredential authentication unavailable. No Managed Identity endpoint found. everytime when the pod is trying to get the access token. If the pod is able to start without this exception, the exception is never observed during the lifetime of the pod.

The problem seems to be that there might be a delay after starting up the pod, after which the IMDS endpoint is available for the pod in AKS. When the pod is trying to get the access token before the endpoint is available, it has some bad state, where it will never be able to recover from.

The cause of the issue is probably this code:

private protected virtual async ValueTask<ManagedIdentitySource> GetManagedIdentitySourceAsync(bool async, CancellationToken cancellationToken)
{
using var asyncLock = await _identitySourceAsyncLock.GetLockOrValueAsync(async, cancellationToken).ConfigureAwait(false);
if (asyncLock.HasValue)
{
return asyncLock.Value;
}
ManagedIdentitySource identitySource = AppServiceV2017ManagedIdentitySource.TryCreate(_options) ??
CloudShellManagedIdentitySource.TryCreate(_options) ??
AzureArcManagedIdentitySource.TryCreate(_options) ??
ServiceFabricManagedIdentitySource.TryCreate(_options) ??
await ImdsManagedIdentitySource.TryCreateAsync(_options, async, cancellationToken).ConfigureAwait(false);
asyncLock.SetValue(identitySource);
return identitySource;
}

The ManagedIdentityClient tries several strategies to get a ManagedIdentitySource. If all of them fail, it sets the value of _identitySourceAsyncLock to null and will therefore never try to resolve the ManagedIdentitySource again.

Expected behavior
The exception should not occur or it should be possible to recover from this failed state when the IMDS endpoint gets available.

Actual behavior
The exception occurs and it is not possible to recover from the failed state.

To Reproduce

  1. Deploy AKS
  2. Deploy the aad-pod-identity services
  3. Deploy a pod using the ManagedIdentityCredentials
  4. Pray for observing the exception
  5. Observe the exception

Environment:

  • AKS 1.17.11
  • AAD Pod Identity 1.7.1
  • Azure.Identity 1.2.3
@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Feb 1, 2021
PSanetra added a commit to PSanetra/azure-sdk-for-net that referenced this issue Feb 1, 2021
@jsquire jsquire added Azure.Identity Client This issue points to a problem in the data-plane of the library. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Feb 1, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 1, 2021
@jsquire
Copy link
Member

jsquire commented Feb 1, 2021

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@christothes
Copy link
Member

Hi @PSanetra - In your testing, how long does it take the IMDS endpoint to become available? The reason I ask is that I am wondering if changing to a less aggressive timeout would be a better way to approach this issue.

We currently are set to timeout after 1 second. However, the recommendation is to retry in the event of a timeout.

@PSanetra
Copy link
Author

PSanetra commented Feb 1, 2021

Hi @christothes, I am not sure how long the endpoint needs to become available in these problematic edge cases. By gut feeling I would say that the pods are currently starting in 7 out of 10 times in a correct state.

I think most of the time the first attempt to get an access token should be triggered by a readiness probe and our readiness probes are configured to run after an inital delay of 5 seconds.

@christothes
Copy link
Member

It seems like either your initialization code would need to implement a retry or ImdsManagedIdentitySource could just handle that. I think my preference would be to have ImdsManagedIdentitySource handle the retry.

@PSanetra
Copy link
Author

PSanetra commented Feb 1, 2021

@christothes I am not sure if a less aggressive timeout would help. I think it would be better to quickly throw an exception and let the application decide to retry.

It might cause some weird trouble with ChainedTokenCredentials if the ImdsManagedIdentitySource would handle the retry as it would block trying other TokenCredentials, wouldn't it?

I think for our services it would not be a big problem to retry as they wouldn't report as ready until the it was possible to get an access token for the first time.

@christothes
Copy link
Member

@christothes I am not sure if a less aggressive timeout would help. I think it would be better to quickly throw an exception and let the application decide to retry.

It might cause some weird trouble with ChainedTokenCredentials if the ImdsManagedIdentitySource would handle the retry as it would block trying other TokenCredentials, wouldn't it?

I think for our services it would not be a big problem to retry as they wouldn't report as ready until the it was possible to get an access token for the first time.

Could you expand a bit more on the scenario where the retry policy would block trying other TokenCrednetials? Since ImdsManagedIdentitySource only gets invoked when all other request mechanisms fail, and other code paths requiring ImdsManagedIdentitySource would be blocked either way , I'm wondering how this would occur. There may be a scenario here I am unfamiliar with.

ManagedIdentitySource identitySource = AppServiceV2017ManagedIdentitySource.TryCreate(_options) ??
CloudShellManagedIdentitySource.TryCreate(_options) ??
AzureArcManagedIdentitySource.TryCreate(_options) ??
ServiceFabricManagedIdentitySource.TryCreate(_options) ??
await ImdsManagedIdentitySource.TryCreateAsync(_options, async, cancellationToken).ConfigureAwait(false);

@PSanetra
Copy link
Author

PSanetra commented Feb 1, 2021

@christothes The scenario I had in mind were using DefaultAzureCredential.

So when I would try the get an access token via an DefaultAzureCredential it would be tried to get a token via, EnvironmentCredential, ManagedIdentityCredential, SharedTokenCredential, and so on.

private static TokenCredential[] GetDefaultAzureCredentialChain(DefaultAzureCredentialFactory factory, DefaultAzureCredentialOptions options)
{
if (options is null)
{
return s_defaultCredentialChain;
}
int i = 0;
TokenCredential[] chain = new TokenCredential[7];
if (!options.ExcludeEnvironmentCredential)
{
chain[i++] = factory.CreateEnvironmentCredential();
}
if (!options.ExcludeManagedIdentityCredential)
{
chain[i++] = factory.CreateManagedIdentityCredential(options.ManagedIdentityClientId);
}
if (!options.ExcludeSharedTokenCacheCredential)
{
chain[i++] = factory.CreateSharedTokenCacheCredential(options.SharedTokenCacheTenantId, options.SharedTokenCacheUsername);
}
if (!options.ExcludeVisualStudioCredential)
{
chain[i++] = factory.CreateVisualStudioCredential(options.VisualStudioTenantId);
}
if (!options.ExcludeVisualStudioCodeCredential)
{
chain[i++] = factory.CreateVisualStudioCodeCredential(options.VisualStudioCodeTenantId);
}
if (!options.ExcludeAzureCliCredential)
{
chain[i++] = factory.CreateAzureCliCredential();
}
if (!options.ExcludeInteractiveBrowserCredential)
{
chain[i++] = factory.CreateInteractiveBrowserCredential(options.InteractiveBrowserTenantId);
}
if (i == 0)
{
throw new ArgumentException("At least one credential type must be included in the authentication flow.", nameof(options));
}
return chain;
}

When you are now introducing a retry mechanism in the ImdsManagedIdentitySource, the ManagedIdentityCredential.GetTokenAsync() method would block until the timeouts and retries of ImdsManagedIdentitySource are exhausted. And this would always happen, even if the IMDS endpoint will never be available, which is quite often the case when using DefaultAzureCredential.

@christothes
Copy link
Member

@PSanetra Thanks for the clarification. I did a bit of research on this one and I discovered that we've encountered this issue before and created an example workaround pod that works around this issue without having to consider changes to the default behavior.

Would this work for your scenario? https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/identity/azure-identity/tests/pod-identity/test-pod-identity/templates/job.yaml#L23

@PSanetra
Copy link
Author

PSanetra commented Feb 1, 2021

@christothes hmm, yes I guess this would work, although we are currently using a different workaround by retrying getting the access token with new ManagedIdentityCredential instances.

I think this should be fixed in the library. The PR, I have submitted should at least be a significant improvement in contrast to the current behavior as this will make it possible to retry getting the access token with the same ManagedIdentityCredential instance.

@christothes
Copy link
Member

The only concern with the fix proposed in the PR is that it could also introduce a retry for ImdsManagedIdentitySource on every GetToken request as part of a ChainedTokenCredentials, which doesn't cache the first successful credential in the chain.

Would it be possible for you to try the proposed workaround?

@PSanetra
Copy link
Author

PSanetra commented Feb 2, 2021

@christothes I guess the proposed workaround could work, but I prefer to continue to use our current workaround approach.

I think retrying to get a Token via ImdsManagedIdentitySource was actually the expected behavior of ChainedTokenCredential. I guess most of the time the check for the endpoint will not even wait for the timeout. Instead it will probably receive a connection reset very quickly. Edit: It will always wait for the timeout if there is no host listening for the ip of course.

@PSanetra
Copy link
Author

PSanetra commented Feb 2, 2021

@christothes Maybe it would be a good idea to implement the retry strategy in the ManagedIdentityClient in such a way that it will perform the next check only after the time, specified in https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#retry-guidance

After 5 retries it may never retry again. This way it may be possible that some of the first GetTokenAsync() calls may be slow, but after a while it will not try anymore to check if the IMDS endpoint is available as it would be very unlikely that the IMDS endpoint becomes available after such a long time.

@christothes
Copy link
Member

@PSanetra - Initially, I had some same thought, but since the retry behavior has other implications in the context of chained credentials, I think it is safest to workaround it either with the pod workaround, or by retrying manually with a new instance of DefaultAzureCredential.

@bohlenc
Copy link

bohlenc commented Feb 19, 2021

@PSanetra Thanks for the clarification. I did a bit of research on this one and I discovered that we've encountered this issue before and created an example workaround pod that works around this issue without having to consider changes to the default behavior.

Would this work for your scenario? https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/identity/azure-identity/tests/pod-identity/test-pod-identity/templates/job.yaml#L23

Hi,
in case anyone faces the same problem and wants or needs to implement the workaround for a number of deployments without touching each of them directly, I implemented a mutating webhook that can be configured accordingly: https://github.com/bohlenc/k8s-pod-mutator-webhook (see Examples)

@deyanp
Copy link

deyanp commented Sep 27, 2021

Please note that the initContainers sample

https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/identity/azure-identity/tests/pod-identity/test-pod-identity/templates/job.yaml#L23

must be extended with

&mi_res_id=pod-identity-test

otherwise it will check for any managed identity (there might be others on VM) ...

@github-actions github-actions bot locked and limited conversation to collaborators Mar 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Azure.Identity Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants