Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

Sometimes mic does not assign identity in time #279

Closed
omerlh opened this issue Jul 11, 2019 · 13 comments
Closed

Sometimes mic does not assign identity in time #279

omerlh opened this issue Jul 11, 2019 · 13 comments
Labels
bug Something isn't working

Comments

@omerlh
Copy link

omerlh commented Jul 11, 2019

Describe the bug
I deployed an application that is using pod identity to access to Azure. Last night I noticed errors when the app tried to request access tokens:

"Timestamp":"2019-07-10T21:02:54.4265873+00:00",
"Exception": "
Microsoft.Azure.Services.AppAuthentication.AzureServiceTokenProviderException: 
Parameters: 
  Connection String: RunAs=App;AppId=<>, 
  Resource: https://vault.azure.net, Authority: https://login.windows.net/<>. 
Exception Message: Tried to get token using Managed Service Identity. Access token could not be acquired. Failed after 5 retries. 
MSI ResponseCode: Forbidden, Response: no AzureAssignedIdentity found for pod:<>/<>
  at Microsoft.Azure.Services.AppAuthentication.AzureServiceTokenProvider.GetAuthResultAsyncImpl(String authority, String resource, String scope, CancellationToken cancellationToken)
   at Microsoft.Azure.Services.AppAuthentication.AzureServiceTokenProvider.<get_KeyVaultTokenCallback>b__8_0(String authority, String resource, String scope)
   at Microsoft.Azure.KeyVault.KeyVaultCredential.PostAuthenticate(HttpResponseMessage response)
   at Microsoft.Azure.KeyVault.KeyVaultCredential.ProcessHttpRequestAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at Microsoft.Azure.KeyVault.KeyVaultClient.GetKeysWithHttpMessagesAsync(String vaultBaseUrl, Nullable`1 maxresults, Dictionary`2 customHeaders, CancellationToken cancellationToken)
   at Microsoft.Azure.KeyVault.KeyVaultClientExtensions.GetKeysAsync(IKeyVaultClient operations, String vaultBaseUrl, Nullable`1 maxresults, CancellationToken cancellationToken)
   at <my code starts here>|

I looked at mic logs and I noticed the following:

I0710 21:02:58.287747       1 crd.go:174] Got id <same pod as my code> to assign
I0710 21:02:58.287771       1 crd.go:178] Creating assigned Id: <same pod as my code>
I0710 21:03:36.456848       1 event.go:218] Event(v1.ObjectReference{Kind:"AzureIdentityBinding", Namespace:"<>", Name:"<>", UID:"fc80c914-a2d8-11e9-b880-000d3a1bb773", APIVersion:"aadpodidentity.k8s.io/v1", ResourceVersion:"45967126", FieldPath:""}): type: 'Normal' reason: 'binding applied' Binding <> applied on node <> for pod <>

Look like mic took a bit more time to assign the identity - so I would like to ask:

  • Is this expected?
  • What would you recommend me to do? One solution could be an init container that wait for the assigned identity to create, what do you think about that?
    --

Steps To Reproduce
Deploy AAD pod identity, and an app to consume it.

Expected behavior
Identity assigned before the pod starts.

AAD Pod Identity version
mcr.microsoft.com/k8s/aad-pod-identity/mic:1.3

Kubernetes version
1.12.7
Additional context

@omerlh omerlh added the bug Something isn't working label Jul 11, 2019
@kkmsft
Copy link
Contributor

kkmsft commented Jul 11, 2019

@omerlh -Its not expected, specifically if have single identity in picture. How many identities are being handled in this scenario ?

Typical issues seen are due to cloud operations being slow. At end of every cycle, MIC prints the stats. Can you please post those stats to verify which part of the operation has taken more than expected time ?

Init container is a good approach. There are some customers who have retries in built into the code to retry for longer periods of time.

We are making more enhancements in the upcoming version to handle these scenarios

  • per node assignment and parallelization which reduces the overall assignment times
  • retry in NMI to wait for the assignment to complete before returning, to improve application resiliency.
  • handling cloud errors more pro actively to retry before waiting for long time.

@omerlh
Copy link
Author

omerlh commented Jul 11, 2019

That sounds good. I would expect this to be solved by the provider - e.g. either in NMI or in Azure AppAuthentication library itself. For some reason, I don't see the stats in the logs, but I'll look it up next time we'll see it.

@markusdresch
Copy link

i experience the very same issue most of the time when several pods are started at once.

@aramase
Copy link
Member

aramase commented Jul 20, 2019

@markusdresch I've a few questions -

  1. How many nodes do you have in your cluster?
  2. How many pods are being deployed at the same time?
  3. Are all these pods using the same azure identity and binding or does each pod have it's own identity and binding?
  4. How much delay are you observing after deployment?

The current releases process pod bursts serially which is the cause of delays. For the upcoming release, we have made enhancements to batch process burst jobs to reduce the delay. Cloud operations are typically time consuming. You can view the stats in the mic logs which will provide information about how long the cloud operations take.

@markusdresch
Copy link

@aramase:

  1. for now 3 to 5 nodes.
  2. when deploying a new cluster, about 30 or more pods are started at the same time. when developing on the cluster, maybe 2 to 4, sometimes more pods are started.
  3. all use the same binding.
  4. when i'm deploying a new cluster with 30 pods, it takes about one hour to have all pods up and running. the pods run different dotnet core apps that connect to a key vault on startup to get their config.

looking forward to the next release. any ETA?

@aramase
Copy link
Member

aramase commented Jul 22, 2019

@markusdresch Thank you for the information. We are planning for a release around July end/early August as we wrap up the project items in https://github.com/Azure/aad-pod-identity/milestone/1. We are planning to release an RC version from the latest master at the end of this week. Will update this thread with the information on running those images when they are available.

@markusdresch
Copy link

1.5-rc2 looks good so far.

@markusdresch
Copy link

markusdresch commented Jul 29, 2019

there's still the odd pod that does not get an identity on first try when creating multiple pods at once, but from what i've seen they are few and recover on first retry. i'm very happy with this version.

update: most of the pods that still crashlooped simply started before MIC was ready, so it's all good.

@kkmsft
Copy link
Contributor

kkmsft commented Jul 29, 2019

Thank you, @markusdresch . For checking the readiness you can rely on the contents of /healthz of the MIC & NMI pod. It will show “Active” on one of the MIC(the one which shows leader election) and on all NMI pods, when they are ready to take traffic. More details of how to do this(programmatic way) can be found in the e2e test here:

It("should pass liveness probe test", func() {

@kkmsft
Copy link
Contributor

kkmsft commented Aug 19, 2019

@markusdresch - Thank you for the update. Please let us know if 1.5 series (latest was 1.5.2) continues to work as you are expecting.

@omerlh - are you still facing this issue with 1.5.2 ?

@markusdresch
Copy link

@kkmsft looks good to me

@kkmsft
Copy link
Contributor

kkmsft commented Aug 19, 2019

Thank you - @markusdresch

@kkmsft
Copy link
Contributor

kkmsft commented Sep 4, 2019

With 1.5.2 and improvements on retrying token request calls, closing this issue. Please reopen if its still an issue.

@kkmsft kkmsft closed this as completed Sep 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants