-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ManagedIdentityCredential fails sometimes in AKS #18312
Comments
Thank you for your feedback. Tagging and routing to the team member best able to assist. |
Hi @PSanetra - In your testing, how long does it take the IMDS endpoint to become available? The reason I ask is that I am wondering if changing to a less aggressive timeout would be a better way to approach this issue. We currently are set to timeout after 1 second. However, the recommendation is to retry in the event of a timeout. |
Hi @christothes, I am not sure how long the endpoint needs to become available in these problematic edge cases. By gut feeling I would say that the pods are currently starting in 7 out of 10 times in a correct state. I think most of the time the first attempt to get an access token should be triggered by a readiness probe and our readiness probes are configured to run after an inital delay of 5 seconds. |
It seems like either your initialization code would need to implement a retry or |
@christothes I am not sure if a less aggressive timeout would help. I think it would be better to quickly throw an exception and let the application decide to retry. It might cause some weird trouble with I think for our services it would not be a big problem to retry as they wouldn't report as ready until the it was possible to get an access token for the first time. |
Could you expand a bit more on the scenario where the retry policy would block trying other TokenCrednetials? Since azure-sdk-for-net/sdk/identity/Azure.Identity/src/ManagedIdentityClient.cs Lines 59 to 63 in f611aed
|
@christothes The scenario I had in mind were using So when I would try the get an access token via an azure-sdk-for-net/sdk/identity/Azure.Identity/src/DefaultAzureCredential.cs Lines 169 to 220 in a79bd10
When you are now introducing a retry mechanism in the |
@PSanetra Thanks for the clarification. I did a bit of research on this one and I discovered that we've encountered this issue before and created an example workaround pod that works around this issue without having to consider changes to the default behavior. Would this work for your scenario? https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/identity/azure-identity/tests/pod-identity/test-pod-identity/templates/job.yaml#L23 |
@christothes hmm, yes I guess this would work, although we are currently using a different workaround by retrying getting the access token with new I think this should be fixed in the library. The PR, I have submitted should at least be a significant improvement in contrast to the current behavior as this will make it possible to retry getting the access token with the same |
The only concern with the fix proposed in the PR is that it could also introduce a retry for Would it be possible for you to try the proposed workaround? |
@christothes I guess the proposed workaround could work, but I prefer to continue to use our current workaround approach. I think retrying to get a Token via |
@christothes Maybe it would be a good idea to implement the retry strategy in the After 5 retries it may never retry again. This way it may be possible that some of the first |
@PSanetra - Initially, I had some same thought, but since the retry behavior has other implications in the context of chained credentials, I think it is safest to workaround it either with the pod workaround, or by retrying manually with a new instance of |
Hi, |
Please note that the initContainers sample must be extended with
otherwise it will check for any managed identity (there might be others on VM) ... |
Describe the bug
We are using the
ManagedIdentityCredential
class to get access tokens for managed identities. We are deploying the application to AKS, where we are using https://github.com/Azure/aad-pod-identity.The instance of the
ManagedIdentityCredential
is a singleton.Sometimes after starting a new pod, we get the exception
Azure.Identity.CredentialUnavailableException: ManagedIdentityCredential authentication unavailable. No Managed Identity endpoint found.
everytime when the pod is trying to get the access token. If the pod is able to start without this exception, the exception is never observed during the lifetime of the pod.The problem seems to be that there might be a delay after starting up the pod, after which the IMDS endpoint is available for the pod in AKS. When the pod is trying to get the access token before the endpoint is available, it has some bad state, where it will never be able to recover from.
The cause of the issue is probably this code:
azure-sdk-for-net/sdk/identity/Azure.Identity/src/ManagedIdentityClient.cs
Lines 51 to 67 in 705f329
The
ManagedIdentityClient
tries several strategies to get aManagedIdentitySource
. If all of them fail, it sets the value of_identitySourceAsyncLock
tonull
and will therefore never try to resolve theManagedIdentitySource
again.Expected behavior
The exception should not occur or it should be possible to recover from this failed state when the IMDS endpoint gets available.
Actual behavior
The exception occurs and it is not possible to recover from the failed state.
To Reproduce
Environment:
The text was updated successfully, but these errors were encountered: