Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application Default Credentials not available even if running on Google Compute Engine #1807

Closed
flagbug opened this issue Mar 23, 2021 · 15 comments
Assignees
Labels
type: question Request for information or clarification. Not an issue.

Comments

@flagbug
Copy link

flagbug commented Mar 23, 2021

So this is a weird one. Our application is running in a managed instance group inside a Docker container on Google Compute Engine, on multiple machines, running in the region us-east4.

Sometimes when we are deploying and the managed instance group starts replacing our instances, we're receiving the following InvalidOperationException from the new instances.

One or more errors occurred. (The Application Default Credentials are not available. 
They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. 
See https://developers.google.com/accounts/docs/application-default-credentials for more information.) 
The Application Default Credentials are not available. They are available if running in Google Compute Engine. 
Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. 
See https://developers.google.com/accounts/docs/application-default-credentials for more information. 

Now the thing is that for some reason this issue starts to go away after a couple of minutes and the Google Cloud Client library suddenly starts finding the default credentials on the Compute Engine machine.

We're using Firebase authentication and the code for initializing the Firebase Admin SDK in our ASP.NET Core 5 application looks like this:

services.AddSingleton<FirebaseApp>(options => 
{
     return FirebaseApp.Create(new AppOptions
    {
        Credential = GoogleCredential.GetApplicationDefault(),
        ProjectId = "the_project_id"
    });
});
services.AddSingleton<FirebaseAuth>(options =>
{
    var defaultFirebaseApp = options.GetRequiredService<FirebaseApp>();
    return FirebaseAuth.GetAuth(defaultFirebaseApp);
});

When a controller is invoked that requires an instance of FirebaseAuth, we're getting an exception with a stack trace like this:

System.AggregateException:
   at System.Threading.Tasks.Task`1.GetResultCore (System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Google.Apis.Auth.OAuth2.GoogleCredential.GetApplicationDefault (Google.Apis.Auth, Version=1.50.0.0, Culture=neutral, PublicKeyToken=4b01fa6e34db77ab)
   at Alexandria.Startup+<>c.<ConfigureServices>b__9_0 (Alexandria, Version=1.0.0.0, Culture=neutral, PublicKeyToken=nullAlexandria, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null: /app/src/Alexandria/Startup.csAlexandria, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null: 174)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitFactory (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitCache (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.Resolve (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.DynamicServiceProviderEngine+<>c__DisplayClass1_0.<RealizeService>b__0 (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetRequiredService (Microsoft.Extensions.DependencyInjection.Abstractions, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetRequiredService (Microsoft.Extensions.DependencyInjection.Abstractions, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Alexandria.Startup+<>c.<ConfigureServices>b__9_1 (Alexandria, Version=1.0.0.0, Culture=neutral, PublicKeyToken=nullAlexandria, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null: /app/src/Alexandria/Startup.csAlexandria, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null: 177)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitFactory (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitCache (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitCache (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitScopeCache (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitCache (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitScopeCache (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.Resolve (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.DynamicServiceProviderEngine+<>c__DisplayClass1_0.<RealizeService>b__0 (Microsoft.Extensions.DependencyInjection, Version=5.0.0.1, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ActivatorUtilities.GetService (Microsoft.Extensions.DependencyInjection.Abstractions, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at lambda_method292 (Anonymously Hosted DynamicMethods Assembly, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null)
   at Microsoft.AspNetCore.Mvc.Controllers.ControllerFactoryProvider+<>c__DisplayClass5_0.<CreateControllerFactory>g__CreateController|0 (Microsoft.AspNetCore.Mvc.Core, Version=5.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next (Microsoft.AspNetCore.Mvc.Core, Version=5.0.0.0, 

--- *snip a bunch of middleware*---

Inner exception System.InvalidOperationException handled at System.Threading.Tasks.Task`1.GetResultCore:
   at Google.Apis.Auth.OAuth2.DefaultCredentialProvider+<CreateDefaultCredentialAsync>d__11.MoveNext (Google.Apis.Auth, Version=1.50.0.0, Culture=neutral, PublicKeyToken=4b01fa6e34db77ab)

To me it really doesn't make any sense that the Compute Engine metadata isn't available right after starting the application and suddenly becomes available after a few minutes. I found an issue that looks very similar here googleapis/google-cloud-dotnet#1394, but apparently this has been resolved inside of the library itself, by switching from a URL to an IP address for the metadata server.

Is there anything dumb that I'm doing wrong here when initializing the GoogleCredential, or is this some kind of deeper issue? Anything that I can do to help debug or resolve this problem? Unfortunately I have no idea how to reproduce this, it happens sporadically after deploying and suddenly resolved after a short time.

@jskeet jskeet self-assigned this Mar 23, 2021
@jskeet jskeet added the type: question Request for information or clarification. Not an issue. label Mar 23, 2021
@jskeet
Copy link
Collaborator

jskeet commented Mar 24, 2021

A few minutes is definitely longer than I'd expect... and no, you're not doing anything wrong in your initialization of the GoogleCredential as far as I can see.

We're aware that sometimes it can take a bit of time for the metadata server to become available. We were at one point looking at making the timeout configurable - and can feed your issue into the process for deciding about whether to revisit that.

I'd like to hear more about how this becomes unblocked... I would expect that once the application default credentials are broken for any particular app, they'd stay broken. Are you saying that in your case, within the same deployment, they're broken and then start working? Or is redeployment happening automatically when it's broken, and that's what "fixes" it?

@flagbug
Copy link
Author

flagbug commented Mar 24, 2021

A few minutes is definitely longer than I'd expect

I had a deeper look into our tracing and what happens is that while the call to GoogleCredential.GetApplicationDefault internally retries it can take ~half a minute until it's done and throws the exception, e.g see this trace:

Screen Shot 2021-03-24 at 10 28 25

When I was reading our errors I didn't use a granular enough graph and multiple instances where throwing the same error after another, so it's not a couple of minutes, it just looked like it, sorry about the confusion here.

I'd like to hear more about how this becomes unblocked... I would expect that once the application default credentials are broken for any particular app, they'd stay broken. Are you saying that in your case, within the same deployment, they're broken and then start working? Or is redeployment happening automatically when it's broken, and that's what "fixes" it?

Yeah, I'm pretty sure it resolves on the same instance even after it failed once with an exception, there is no mechanism that we have that does a redeployment when a request fails that hits this code path.

Looking at this again it sounds like I just have to retry getting the credentials a bunch of times at application startup until he metadata server responds. Annoying and not what I'd expect from the metadata server, but at least there's a workaround 😄

@jskeet
Copy link
Collaborator

jskeet commented Mar 24, 2021

Right - half a minute is more in line with what I've heard internally before, I think. (Still longer than I'd like, but...)

I don't understand how this is fixing itself, so I think I'll need to dig in a bit further... it's unlikely that I'll be able to do so until at least late next week I'm afraid, so I'd definitely suggest using your proposed workaround until I can get back to you. (If we need to change the auth library, that will also take a while to be designed, implemented, released etc.)

@willbryant
Copy link

willbryant commented Mar 24, 2021

Quite possibly just a heck of a coincidence on timing, but our .NET application in australia-southeast1 suffered this today on roughly one third of our VMs, 10.5 hours ago - not too far off when @flagbug opened this ticket.

Our production system only migrated to GCE very very recently and so we don't have any data about whether this is an ongoing issue for us. We haven't noticed it in our equivalent test systems over the last 6 months, but that is quite possibly we just weren't exercising them all enough to notice some instances had failed to get credentials some of the time.

We have opened a support case for today's incident - we were wondering if there was a blip in the metadata system in our region today, but the support team were not immediately aware of anything.

I will do some more digging tomorrow, but our current understanding is that in our case today, this was not just a boot-time issue, as our incident was triggered when we did an IIS app pool recycle as a part of an application deployment; those instances had been running and working happily after a deployment earlier in the day, so they were definitely not freshly booted.

Our full trace looks a little different (ignoring the fact that our errors came from use with GCS rather than Firebase - I'm more interested in the repetition):

System.InvalidOperationException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
Business2.FileStore.GoogleStorageException: Error uploading to Google Storage: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information. ---> System.InvalidOperationException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
at Google.Apis.Auth.OAuth2.DefaultCredentialProvider.<CreateDefaultCredentialAsync>d__11.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Google.Api.Gax.Rest.ScopedCredentialProvider.<CreateDefaultCredentialsUncached>d__5.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Google.Api.Gax.Rest.ClientBuilderBase`1.<GetHttpClientInitializerAsync>d__31.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Google.Api.Gax.Rest.ClientBuilderBase`1.<CreateServiceInitializerAsync>d__28.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Google.Cloud.Storage.V1.StorageClientBuilder.<BuildAsync>d__9.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Business2.FileStore.GoogleStorageDownloader.<ReadFromGoogleStorageAsync>d__2.MoveNext() in C:\BuildAgentB\work\bf7602ab896e2303\Business2\FileStore\GoogleStorageDownloader.cs:line 27
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Google.Api.Gax.TaskExtensions.WaitWithUnwrappedExceptions(Task task)
at Google.Api.Gax.Rest.ClientBuilderBase`1.GetHttpClientInitializer()
at Google.Cloud.Storage.V1.StorageClientBuilder.Build()
at Business2.FileStore.GoogleStorageUploader.SaveToGoogleStorage(Stream fileContents, Uri relativePath, String saveLocationPath, FileState state, Boolean normaliseRelativePath) in C:\BuildAgentB\work\bf7602ab896e2303\Business2\FileStore\GoogleStorageUploader.cs:line 59
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Google.Api.Gax.TaskExtensions.WaitWithUnwrappedExceptions(Task task)
at Google.Api.Gax.Rest.ClientBuilderBase`1.GetHttpClientInitializer()
at Google.Cloud.Storage.V1.StorageClientBuilder.Build()
at Business2.FileStore.GoogleStorageUploader.SaveToGoogleStorage(Stream fileContents, Uri relativePath, String saveLocationPath, FileState state, Boolean normaliseRelativePath) in C:\BuildAgentB\work\bf7602ab896e2303\Business2\FileStore\GoogleStorageUploader.cs:line 59
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Google.Api.Gax.TaskExtensions.WaitWithUnwrappedExceptions(Task task)
at Google.Api.Gax.Rest.ClientBuilderBase`1.GetHttpClientInitializer()
at Google.Cloud.Storage.V1.StorageClientBuilder.Build()
at Business2.FileStore.GoogleStorageUploader.SaveToGoogleStorage(Stream fileContents, Uri relativePath, String saveLocationPath, FileState state, Boolean normaliseRelativePath) in C:\BuildAgentB\work\bf7602ab896e2303\Business2\FileStore\GoogleStorageUploader.cs:line 59
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Google.Api.Gax.TaskExtensions.WaitWithUnwrappedExceptions(Task task)
at Google.Api.Gax.Rest.ClientBuilderBase`1.GetHttpClientInitializer()
at Google.Cloud.Storage.V1.StorageClientBuilder.Build()
at Business2.FileStore.GoogleStorageUploader.SaveToGoogleStorage(Stream fileContents, Uri relativePath, String saveLocationPath, FileState state, Boolean normaliseRelativePath) in C:\BuildAgentB\work\bf7602ab896e2303\Business2\FileStore\GoogleStorageUploader.cs:line 59
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Google.Api.Gax.TaskExtensions.WaitWithUnwrappedExceptions(Task task)
at Google.Api.Gax.Rest.ClientBuilderBase`1.GetHttpClientInitializer()
at Google.Cloud.Storage.V1.StorageClientBuilder.Build()
at Business2.FileStore.GoogleStorageUploader.SaveToGoogleStorage(Stream fileContents, Uri relativePath, String saveLocationPath, FileState state, Boolean normaliseRelativePath) in C:\BuildAgentB\work\bf7602ab896e2303\Business2\FileStore\GoogleStorageUploader.cs:line 59
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Google.Api.Gax.TaskExtensions.WaitWithUnwrappedExceptions(Task task)
at Google.Api.Gax.Rest.ClientBuilderBase`1.GetHttpClientInitializer()
at Google.Cloud.Storage.V1.StorageClientBuilder.Build()
at Business2.FileStore.GoogleStorageUploader.SaveToGoogleStorage(Stream fileContents, Uri relativePath, String saveLocationPath, FileState state, Boolean normaliseRelativePath) in C:\BuildAgentB\work\bf7602ab896e2303\Business2\FileStore\GoogleStorageUploader.cs:line 59
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Google.Api.Gax.TaskExtensions.WaitWithUnwrappedExceptions(Task task)
at Google.Api.Gax.Rest.ClientBuilderBase`1.GetHttpClientInitializer()
at Google.Cloud.Storage.V1.StorageClientBuilder.Build()
at Business2.FileStore.GoogleStorageUploader.SaveToGoogleStorage(Stream fileContents, Uri relativePath, String saveLocationPath, FileState state, Boolean normaliseRelativePath) in C:\BuildAgentB\work\bf7602ab896e2303\Business2\FileStore\GoogleStorageUploader.cs:line 59
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Google.Api.Gax.TaskExtensions.WaitWithUnwrappedExceptions(Task task)
at Google.Api.Gax.Rest.ClientBuilderBase`1.GetHttpClientInitializer()
at Google.Cloud.Storage.V1.StorageClientBuilder.Build()
at Business2.FileStore.GoogleStorageUploader.SaveToGoogleStorage(Stream fileContents, Uri relativePath, String saveLocationPath, FileState state, Boolean normaliseRelativePath) in C:\BuildAgentB\work\bf7602ab896e2303\Business2\FileStore\GoogleStorageUploader.cs:line 59
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Google.Api.Gax.TaskExtensions.WaitWithUnwrappedExceptions(Task task)
at Google.Api.Gax.Rest.ClientBuilderBase`1.GetHttpClientInitializer()
at Google.Cloud.Storage.V1.StorageClientBuilder.Build()
at Business2.FileStore.GoogleStorageUploader.SaveToGoogleStorage(Stream fileContents, Uri relativePath, String saveLocationPath, FileState state, Boolean normaliseRelativePath) in C:\BuildAgentB\work\bf7602ab896e2303\Business2\FileStore\GoogleStorageUploader.cs:line 59
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Google.Api.Gax.TaskExtensions.WaitWithUnwrappedExceptions(Task task)
at Google.Api.Gax.Rest.ClientBuilderBase`1.GetHttpClientInitializer()

I was wondering if the repeated sections are because of a retry mechanism, I started looking into the code tonight and found this ticket.

(Business2 is our application's code so ignore that bit.)

@willbryant
Copy link

willbryant commented Mar 24, 2021

I should have said, ours didn't seem to auto-resolve, but we'll check that again too. TBH that would seem easier to understand, given:

            // 3. Then try the compute engine.
            Logger.Debug("Checking whether the application is running on ComputeEngine.");
            if (await ComputeCredential.IsRunningOnComputeEngine().ConfigureAwait(false))
            {
                Logger.Debug("ComputeEngine check passed. Using ComputeEngine Credentials.");
                return new GoogleCredential(new ComputeCredential());
            }

calling this:

        /// <summary>
        /// Detects if application is running on Google Compute Engine. This is achieved by attempting to contact
        /// GCE metadata server, that is only available on GCE. The check is only performed the first time you
        /// call this method, subsequent invocations used cached result of the first call.
        /// </summary>
        public static Task<bool> IsRunningOnComputeEngine()

NB. the retry policy on the task that populates that is 3 tries with 500ms timeout, so that doesn't seem enough to cope with a ~half a minute problem on boot (let alone whatever happened to our non-boot-time case today).

Also, my read of the code is that that really is just a timeout - it spins around and does the retry straight away if the “ping” HTTP request fails immediately? That doesn't seem right.

@flagbug
Copy link
Author

flagbug commented Apr 8, 2021

So this problem didn't occur in any of our deployment for the past two weeks, but today suddenly one of our freshly created VMs in a managed instance group had this error again. Also this time it didn't go away after a few seconds, but persisted until I had to manually delete the VM from the MIG.

@jskeet
Copy link
Collaborator

jskeet commented Apr 9, 2021

@flagbug: I still don't understand how it could have fixed itself in the first report, to be honest. I wouldn't expect you to have to remove the VM, but I would expect restarting the app to work. Is it possible that in the first report, the app was restarting, and that in the recent incident it wasn't? Or do you think the app was restarting in both cases, but the VM itself had "gone bad" in some way, so that even restarting the app wouldn't help?

@flagbug
Copy link
Author

flagbug commented Apr 9, 2021

I wouldn't expect you to have to remove the VM, but I would expect restarting the app to work.

So you mean just restarting the application process, not the entire VM, right? Just curious, why would that cause the Metadata server to suddenly be available?

Is it possible that in the first report, the app was restarting, and that in the recent incident it wasn't?

@jskeet Pretty unlikely, except if there's any underlying mechanism in Compute Engine that detects the state that the Metadata Server isn't available and restarts. We don't have anything implemented that would restart our application when that happens, we just fail the entire request and that wouldn't cause the application to restart.

@jskeet
Copy link
Collaborator

jskeet commented Apr 9, 2021

So you mean just restarting the application process, not the entire VM, right? Just curious, why would that cause the Metadata server to suddenly be available?

It wouldn't cause the metadata server to be available - but it would cause the application to check for it again. The Application Default Credentials are cached, including any failure to acquire them.

The resulting task is stored in a Lazy<T> here - which is why I'd expect any failure for a particular app instance to be permanent. I don't understand how an app that reported that error would ever then start working.

@flagbug
Copy link
Author

flagbug commented Apr 12, 2021

@jskeet You're right of course, I just had another deployment where one of our VMs didn't have a healthy metadata server and for some reason - and I'm still not sure why this happens - Compute Engine recreated the the VM with the reason instance's current action is NONE, but health status is TIMEOUT.. That would explain why the issue often resolves itself.

At this point I'm wondering if failing to reach the metadata server is just a symptom, and the underlying root cause is something else I our application, since the only reason the health check should fail is if our health endpoint timeouts...

@jskeet
Copy link
Collaborator

jskeet commented Apr 12, 2021

@flagbug: Phew - I'm glad we've understood that part, at least.

The report from @willbryant about their app having the same issue at the same time suggests it may have been a networking blip, but I'm afraid I don't have more idea than that.

Does your application log health check requests that it receives? Did the app itself get traffic (and therefore there was some network connectivity)? It's all a bit of a mystery :(

@flagbug
Copy link
Author

flagbug commented Apr 13, 2021

@jskeet So I dug a bit deeper into our monitoring and it looks like this is some kind of thread starvation issue on our end, where one of the symptoms is that the HTTP request to the Compute Engine metadata server doesn't complete successfully in time.

Still not sure why only some of our newly created machines have this issue intermittently at startup, but our monitoring definitely shows that the thread queue length rises rapidly while at the same time HTTP requests start to time out.

Thanks for all of the help figuring this out @jskeet! I think this issue can be closed as I'm pretty sure the root cause isn't in this library.

@jskeet
Copy link
Collaborator

jskeet commented Apr 13, 2021

Great - thanks so much for digging deeper :) Do let me know if there's anything I can do to help when you're further along in the thread starvation exploration.

@jskeet jskeet closed this as completed Apr 13, 2021
@amanda-tarafa
Copy link
Contributor

@flagbug We had a similar issue a while back with someone using the Google.Cloud.Logging.V2 library, just didn't ring any bells until you mentioned the thread starvation sympton. Unfurtunately the issue was tracked in an internal system so I cannot point you there, but I'll do my best to "translate" what was happening there to this case. I also think I know how you could fix this issue, and that's by not deferring the obtention of the credentials until your first request, but doing it directly in Startup (you'll see the rationale for this after reading below my "translation" from the other issue).

But basically, I think that your code that looks like this:

services.AddSingleton<FirebaseApp>(options => 
{
     return FirebaseApp.Create(new AppOptions
    {
        Credential = GoogleCredential.GetApplicationDefault(),
        ProjectId = "the_project_id"
    });
});

should look like this to avoid this problem:

services.AddSingleton<FirebaseApp>(
    FirebaseApp.Create(new AppOptions
    {
        Credential = GoogleCredential.GetApplicationDefault(),
        ProjectId = "the_project_id"
    }));

Or even:

var firebaseApp = FirebaseApp.Create(new AppOptions
{
    Credential = GoogleCredential.GetApplicationDefault(),
    ProjectId = "the_project_id"
});
services.AddSingleton<FirebaseApp>(firebaseApp);
services.AddSingleton<FirebaseAuth>(FirebaseAuth.GetAuth(firebaseApp));

What I think it's happening:

  • Soon after application startup, you start receiving requests that pontentially try to obtain a FirebaseApp or a FirebaseAuth from DI.
  • When either is fetched from DI, all but one of the request thread blocks until the FirebaseApp singleton is ready, which will be ready once the default credential is fetched. More requests keep coming in, and keep blocking on either of the DI singletons. The thread pool starts to deplete.
  • The "lucky" thread that went on to fetch the credential gets to a point where it needs to check whether we are running on Compute, and for that an HTTP request is made to the metadata server. This is an async operation that doesn't block the thread, which is freed to do other things (probably attent to yet another incoming request).
  • Still more requests keep coming in and more threads keep getting blocked waiting for the credential to be fetched, stressing the thread pool even more. The thread pool is probably growing at this point, but very slowly at first. Requests start being queued up, and some may start timing out.
  • When the async HTTP request to the metadata server ends it needs a thread to resume back into, and it's now competing for a thread from a starved thread pool. Basically, we need a thread to finish fetching the credential, while most threads are blocked waiting for the credential to be fetched.
  • Eventually the thread pool grows enough, the credential fetching can resume (if under the time we timeout at else, you get the error you initially reported), and the rest of the threads unblock with time.

That's why creating the singletons on Startup, which means that effectively the credential is fetched before requests start comming in an depleting the thread pool, should fix it for you. It also explains why it happens sometimes, it all depends on how soon the credential is fetched and how lucky that operation is in finding a thread to return to, which in turn depends on the amount of requests you are receiving right after startup.

If we had async DI, this wouldn't be a problem either.

Hope this helps.

@flagbug
Copy link
Author

flagbug commented Apr 13, 2021

@amanda-tarafa Amazing analysis, I think you're spot-on! I've just deployed a version that eagerly initializes the credentials and haven't observed any of our instances having this issue. Now this could be a coincidence of course, but I think that was actually the root cause here.

Thanks so much for your help, I'm absolutely amazed about the great and in-depth help I got from you all ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

4 participants