-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Application Default Credentials not available even if running on Google Compute Engine #1807
Comments
A few minutes is definitely longer than I'd expect... and no, you're not doing anything wrong in your initialization of the We're aware that sometimes it can take a bit of time for the metadata server to become available. We were at one point looking at making the timeout configurable - and can feed your issue into the process for deciding about whether to revisit that. I'd like to hear more about how this becomes unblocked... I would expect that once the application default credentials are broken for any particular app, they'd stay broken. Are you saying that in your case, within the same deployment, they're broken and then start working? Or is redeployment happening automatically when it's broken, and that's what "fixes" it? |
I had a deeper look into our tracing and what happens is that while the call to When I was reading our errors I didn't use a granular enough graph and multiple instances where throwing the same error after another, so it's not a couple of minutes, it just looked like it, sorry about the confusion here.
Yeah, I'm pretty sure it resolves on the same instance even after it failed once with an exception, there is no mechanism that we have that does a redeployment when a request fails that hits this code path. Looking at this again it sounds like I just have to retry getting the credentials a bunch of times at application startup until he metadata server responds. Annoying and not what I'd expect from the metadata server, but at least there's a workaround 😄 |
Right - half a minute is more in line with what I've heard internally before, I think. (Still longer than I'd like, but...) I don't understand how this is fixing itself, so I think I'll need to dig in a bit further... it's unlikely that I'll be able to do so until at least late next week I'm afraid, so I'd definitely suggest using your proposed workaround until I can get back to you. (If we need to change the auth library, that will also take a while to be designed, implemented, released etc.) |
Quite possibly just a heck of a coincidence on timing, but our .NET application in Our production system only migrated to GCE very very recently and so we don't have any data about whether this is an ongoing issue for us. We haven't noticed it in our equivalent test systems over the last 6 months, but that is quite possibly we just weren't exercising them all enough to notice some instances had failed to get credentials some of the time. We have opened a support case for today's incident - we were wondering if there was a blip in the metadata system in our region today, but the support team were not immediately aware of anything. I will do some more digging tomorrow, but our current understanding is that in our case today, this was not just a boot-time issue, as our incident was triggered when we did an IIS app pool recycle as a part of an application deployment; those instances had been running and working happily after a deployment earlier in the day, so they were definitely not freshly booted. Our full trace looks a little different (ignoring the fact that our errors came from use with GCS rather than Firebase - I'm more interested in the repetition):
I was wondering if the repeated sections are because of a retry mechanism, I started looking into the code tonight and found this ticket. (Business2 is our application's code so ignore that bit.) |
I should have said, ours didn't seem to auto-resolve, but we'll check that again too. TBH that would seem easier to understand, given:
calling this:
NB. the retry policy on the task that populates that is 3 tries with 500ms timeout, so that doesn't seem enough to cope with a ~half a minute problem on boot (let alone whatever happened to our non-boot-time case today). Also, my read of the code is that that really is just a timeout - it spins around and does the retry straight away if the “ping” HTTP request fails immediately? That doesn't seem right. |
So this problem didn't occur in any of our deployment for the past two weeks, but today suddenly one of our freshly created VMs in a managed instance group had this error again. Also this time it didn't go away after a few seconds, but persisted until I had to manually delete the VM from the MIG. |
@flagbug: I still don't understand how it could have fixed itself in the first report, to be honest. I wouldn't expect you to have to remove the VM, but I would expect restarting the app to work. Is it possible that in the first report, the app was restarting, and that in the recent incident it wasn't? Or do you think the app was restarting in both cases, but the VM itself had "gone bad" in some way, so that even restarting the app wouldn't help? |
So you mean just restarting the application process, not the entire VM, right? Just curious, why would that cause the Metadata server to suddenly be available?
@jskeet Pretty unlikely, except if there's any underlying mechanism in Compute Engine that detects the state that the Metadata Server isn't available and restarts. We don't have anything implemented that would restart our application when that happens, we just fail the entire request and that wouldn't cause the application to restart. |
It wouldn't cause the metadata server to be available - but it would cause the application to check for it again. The Application Default Credentials are cached, including any failure to acquire them. The resulting task is stored in a |
@jskeet You're right of course, I just had another deployment where one of our VMs didn't have a healthy metadata server and for some reason - and I'm still not sure why this happens - Compute Engine recreated the the VM with the reason At this point I'm wondering if failing to reach the metadata server is just a symptom, and the underlying root cause is something else I our application, since the only reason the health check should fail is if our health endpoint timeouts... |
@flagbug: Phew - I'm glad we've understood that part, at least. The report from @willbryant about their app having the same issue at the same time suggests it may have been a networking blip, but I'm afraid I don't have more idea than that. Does your application log health check requests that it receives? Did the app itself get traffic (and therefore there was some network connectivity)? It's all a bit of a mystery :( |
@jskeet So I dug a bit deeper into our monitoring and it looks like this is some kind of thread starvation issue on our end, where one of the symptoms is that the HTTP request to the Compute Engine metadata server doesn't complete successfully in time. Still not sure why only some of our newly created machines have this issue intermittently at startup, but our monitoring definitely shows that the thread queue length rises rapidly while at the same time HTTP requests start to time out. Thanks for all of the help figuring this out @jskeet! I think this issue can be closed as I'm pretty sure the root cause isn't in this library. |
Great - thanks so much for digging deeper :) Do let me know if there's anything I can do to help when you're further along in the thread starvation exploration. |
@flagbug We had a similar issue a while back with someone using the Google.Cloud.Logging.V2 library, just didn't ring any bells until you mentioned the thread starvation sympton. Unfurtunately the issue was tracked in an internal system so I cannot point you there, but I'll do my best to "translate" what was happening there to this case. I also think I know how you could fix this issue, and that's by not deferring the obtention of the credentials until your first request, but doing it directly in Startup (you'll see the rationale for this after reading below my "translation" from the other issue). But basically, I think that your code that looks like this:
should look like this to avoid this problem:
Or even:
What I think it's happening:
That's why creating the singletons on Startup, which means that effectively the credential is fetched before requests start comming in an depleting the thread pool, should fix it for you. It also explains why it happens sometimes, it all depends on how soon the credential is fetched and how lucky that operation is in finding a thread to return to, which in turn depends on the amount of requests you are receiving right after startup. If we had async DI, this wouldn't be a problem either. Hope this helps. |
@amanda-tarafa Amazing analysis, I think you're spot-on! I've just deployed a version that eagerly initializes the credentials and haven't observed any of our instances having this issue. Now this could be a coincidence of course, but I think that was actually the root cause here. Thanks so much for your help, I'm absolutely amazed about the great and in-depth help I got from you all ❤️ |
So this is a weird one. Our application is running in a managed instance group inside a Docker container on Google Compute Engine, on multiple machines, running in the region
us-east4
.Sometimes when we are deploying and the managed instance group starts replacing our instances, we're receiving the following
InvalidOperationException
from the new instances.Now the thing is that for some reason this issue starts to go away after a couple of minutes and the Google Cloud Client library suddenly starts finding the default credentials on the Compute Engine machine.
We're using Firebase authentication and the code for initializing the Firebase Admin SDK in our ASP.NET Core 5 application looks like this:
When a controller is invoked that requires an instance of
FirebaseAuth
, we're getting an exception with a stack trace like this:To me it really doesn't make any sense that the Compute Engine metadata isn't available right after starting the application and suddenly becomes available after a few minutes. I found an issue that looks very similar here googleapis/google-cloud-dotnet#1394, but apparently this has been resolved inside of the library itself, by switching from a URL to an IP address for the metadata server.
Is there anything dumb that I'm doing wrong here when initializing the
GoogleCredential
, or is this some kind of deeper issue? Anything that I can do to help debug or resolve this problem? Unfortunately I have no idea how to reproduce this, it happens sporadically after deploying and suddenly resolved after a short time.The text was updated successfully, but these errors were encountered: