Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip VmMetadataApiHandler when running outside a VM #4187

Closed
marcopelegrini opened this issue Nov 15, 2023 · 24 comments · Fixed by #4568
Closed

Skip VmMetadataApiHandler when running outside a VM #4187

marcopelegrini opened this issue Nov 15, 2023 · 24 comments · Fixed by #4568
Assignees
Labels
customer-reported Issue created by a customer needs-investigation

Comments

@marcopelegrini
Copy link

marcopelegrini commented Nov 15, 2023

Describe the bug

CosmosDB client tries to obtain VM info which only exists when running the service on Azure VMs

Either:

  • Automatically detect it and don't try to send a request
  • Offer a way to disable this handler when running the client in a different environment (e.g. locally)
Scope: HttpMethod:GET Uri:http://169.254.169.254/metadata/instance?api-version=2020-06-01 {OriginalFormat}:HTTP {HttpMethod} {Uri}
2023-11-15 15:07:07.241 (err ) GET 169.254.169.254//metadata/instance 

0: Exception: Connection refused (169.254.169.254:80) (System.Net.Http.HttpRequestException)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at Microsoft.Extensions.Http.Logging.LoggingHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at Microsoft.R9.Extensions.HttpClient.Logging.Internal.HttpLoggingHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
0:0: Exception: Connection refused (System.Net.Sockets.SocketException)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)

To Reproduce
Steps to reproduce the behavior. If you can include code snippets or links to repositories containing a repro of the issue that can helps us in detecting the scenario it would speed up the resolution.

Run an app with CosmosDB client enabled and log configured outside of Azure

Expected behavior
Either:

  • Automatically detect it and don't try to send a request
  • Offer a way to disable this handler when running the client in a different environment (e.g. locally)

Actual behavior
Exception is thrown in the logs

Environment summary
SDK Version: 3.33.0
OS Version (e.g. Windows, Linux, MacOSX) macOS 14.1.1 (23B81)

@sourabh1007
Copy link
Contributor

@marcopelegrini do you see any impact due to this? This call happens only at initialization and only once in a lifetime of client instance.

@marcopelegrini
Copy link
Author

@sourabh1007 I don’t want to see an error every time I start my application. I follow a zero warning/error build and run methodology and this is preventing my validations to pass

@sourabh1007
Copy link
Contributor

In SDK, we are logging it at Information level so it would not come up in logs unnecessarily ref.

DefaultTrace.TraceInformation($"Azure Environment metadata information not available. {0}", e.Message);

Looks like, you are using custom handler and this exception is printed by Microsoft.R9.Extensions.HttpClient.Logging.Internal.HttpLoggingHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken). Please correct me if my understanding is not correct.
You need to handle it in this custom handler.

@sourabh1007 sourabh1007 self-assigned this Nov 19, 2023
@marcopelegrini
Copy link
Author

@sourabh1007 That's correct, my handle logs any unsuccessfully request and I'd like to keep it that way
Adding a logic to it to ignore this particular request would be adding coupling with the cosmosdb library which is undesirable, thus why I'm asking for a way to disable this instead

@sourabh1007
Copy link
Contributor

As I mentioned above, in SDK, it is made sure that this should not get logged as Error. If you are using HttpClientFactory and passing your custom httpclient, then it should be your httpclient responsibility to decide what to log and what not to log. I don't think, anything can be fixed here.

@marcopelegrini
Copy link
Author

marcopelegrini commented Nov 28, 2023

@sourabh1007 logging is a side effect of this feature, I've never asked you to not log it, I've asked you to offer a way to not send an invalid request for the environment in which the SDK is being used.

You're asking me to couple my HTTP Client to CosmosDB client and add a condition for a request that may or may not be invalid... can you imagine how architecturally incorrect that is.

I'm requesting for a simple toggle or a way to override this behaviour, which if it wasn't implemented using static accessors I would've been able to do.

@FabianMeiswinkel
Copy link
Member

@marcopelegrini - I think there are two reasons, why there is no intent to change the behavior as requested by you:

  • HTTP request failures will happen in any distributed system - in Cosmos DB these could happen for any regional outage and there could always be a limited number of transient errors. Most of these HTTP errors would never bubble up to applications as exception - because these transient errors are retried (within the region and/or in different regions). So, when you use a custom HttpMessageHandler in the HttpClient that you inject into the Cosmos DB SDK there is simply no way to guarantee that there would never be any failures. It is like capturing FirstChanceExceptions - it can certainly be sometimes useful - but simply comes with the consequence that there will be false negatives (failures that will be mitigated without the application even recognizing it). The VM metadata request above is a single request for the lifetime of a CosmosClient instance (which usually is a singleton and long-lived). Given that there could always be other transient errors happening this single http requests is not adding any significant noise/overhead.
  • There simply is no good other way to identify whether VM metadata can be retrieved than to attempt to get it. The information retrieved is very valuable for any latency/network investigations - so, there is no plan to allow disabling retrieving it when there is no actual negative side effect.

@marcopelegrini
Copy link
Author

@sourabh1007 I agree with your first point when considering a production environment, although I don't think that can be use as a reason to prevent the invalid request in controlled environments, which is my case.

I disagree with "this is a normal exception" approach... no exception is normal and if it can be avoided, it should. That's why it's called exception.

I'm not alone by the way: https://stackoverflow.com/questions/73235961/azure-cosmos-db-client-throws-httprequestexception-attempt-was-made-to-access

A library should always give their consumers control over the behaviours that influence their environment, and in this case the sdk is not giving that control so I can avoid a unwanted request.

There's virtually no cost for adding a flag to the config to skip this call in specific scenarios:

VmMetadataApiHandler.TryInitialize(this.httpClient);

@sourabh1007
Copy link
Contributor

It is not about the cost of implementation.
In SDK we give only options related to a functionality. We don't give a flag to control non-functional calls like this one.

This call is not related to Cosmos DB. It is Instance metadata to get Azure instance information ref https://learn.microsoft.com/en-us/azure/virtual-machines/instance-metadata-service?tabs=linux. SDK just make this call to get VM information.

There are few things you can do:

  1. You can add this exception in the ignore list of your monitoring Agent/Application. Like this it won't bother you.
  2. You are using custom httclientfactory, it means you have control over all the HTTP calls, so you can write a custom exception to ignore this exception.

I don't think there is any impact even if you have to live with it except one exception stacktrace in your log which came only once during application restart. Neither it is noisy nor it is too big to bloat your log file.

I don't see any reason to add a new config just to avoid it.

@marcopelegrini
Copy link
Author

@sourabh1007 We're going to have to agree to disagree here as looks like I won't convince you anyway.

I appreciate your time and suggestions, but I prefer to keep my architecture clean than work around shortcomings of the dependencies I have, I'll find another way.

@wahyuen
Copy link
Contributor

wahyuen commented Jan 24, 2024

Stumbled upon this after a few hours of investigation and wondering why we are seemingly randomly getting exceptions during startup. This is much more of a concern as it is not easily apparent WHY we get these errors (bunch of SocketExceptions/HttpExceptions).

Unfortunately, it appears yet another case of where Microsoft developers are not listening to the community. A number of other Azure libraries give and allow the user the flexibility to configure and change things for local development (see Azure.Identity, DefaultAzureCredentials and exclusion of different methods of authentication). You can also arguable say that has zero effect in a production environment where Managed Identities are used, yet this makes a HUGE different in developer experience on a local platform.

@jwyza-pi
Copy link

jwyza-pi commented Mar 7, 2024

I wanted to come and say this also was impacting us in a negative way. Specifically around diagnosing problems within services. Tracing systems, such as datadog, report this as part of their flame graphs and shows minute long timeouts whenever it decides it needs to make this call.

There is precedent in other libraries to make decisions on the internal workings of the code based on the environment in which they run. Why not simply check to see if this call should happen before blindly assuming that it will work and leading developers off on wild goose chases trying to understand why they have random errors showing up in their logs?

@michaeldaw
Copy link

michaeldaw commented Apr 14, 2024

From https://stackoverflow.com/a/76841727/3925707
I put this in a if(Debugger.IsAttached) block:

	var assembly = Assembly.GetAssembly(typeof(CosmosClient));
	var type = assembly?.GetType("Microsoft.Azure.Cosmos.Telemetry.VmMetadataApiHandler");
	var field = type?.GetField("isInitialized", BindingFlags.Static | BindingFlags.NonPublic);
	field?.SetValue(type, true);

You probably shouldn't do this... but it bypasses the metadata call and the error doesn't occur.

@btull89
Copy link

btull89 commented May 16, 2024

This has negatively impacted my team as well. Our dependency failure alerts started firing when 100% of these requests started failing. Unfortunately we will have to update our alerting to ignore this endpoint.. Microsoft should address this issue.

@thucnguyen77
Copy link

thucnguyen77 commented Jun 6, 2024

I don't mind adding a comment to show the obvious impact of this issue (even though Microsoft has closed this issue, i.e. covered their ears?).

As you can see in the image below, the request could have been completed in the matter of millisecond. But 21 seconds spent on that 'metadata' thing is a ridiculously huge impact.
image

@AndreiSchillerChan
Copy link

Same here. We got a 4-second delay on an endpoint due to this timeout.

@sourabh1007
Copy link
Contributor

I am sorry you are facing this issue, can you please explain, how latency of this API is affecting you in anyway? If it is affecting your application performance, please open a new Github issue with as much details as possible. I will look into it.

For alerting system related experience reported here, please use singleton client instance. you should see this failure only during initialization (only if you are on non-azure VM). Still single (or few, if coming from different instances) failures are bothering you then you can ignore this failure in your monitoring system. There is a reason, we have an option to ignore some failures in these monitoring systems.

We don't want (or give option) to by-pass this call as this call gives Azure VM information to SDK. I would welcome any other suggestion; you can provide to avoid this call in case of non-azure VM.

@wahyuen
Copy link
Contributor

wahyuen commented Jun 22, 2024

I feel as if the point from multiple people in this thread is still being missed. I doubt any of us are opposed to the fact that this gives useful information for Azure VMs to the SDK. The contention is that this is throwing exception tracing that is NOT obvious back to developers who then have to spend time to diagnose, determine what the issue is, then look at potentially perform workarounds to suppress log messages.

If you and your team feel this is an essential feature that should not be bypassed, then so be it, we are asking then for a solution where no exception is propagated back up in non-Azure VM scenarios (majority of which will be localhost or remote execution on some other VM). How you choose to handle this I will leave up to the team. I would assert that surely this is not the ONLY service within the Azure ecosystem that is performing this check, yet this to my current knowledge, is the only time that consumers of the SDK need to deal with the exception being thrown.

I would put forward this to you in another way. Imagine you have purchased a car, and as such, you are the mercy of the car manufacturer and must use their setup and software. Imagine you then start your car every morning about to go to work, but then a large error shows up on your dashboard and have nothing else but warning lights going off. You have no idea what this is about, you do some research, you ask other people, you perhaps bring your car in for a service to identify what the issue is. You are then told that your manufacturer requires some 'useful telemetry' but it is failing for your particular scenario. You just have to deal with it because it is important to them. Perhaps you implement a work around, perhaps you just choose to ignore the message. Perhaps next time, when a REAL exception is thrown, you ignore that too because you are so used to dealing with exception noise because your manufacturer told you to 'deal with it'.

I hope this does not come across as causing offence, and I apologize if it does so to you. I'm attempting to illustrate the perspective of a consumer of your SDK and the affects it has on us as developers. As it stands, this scenario does nothing to empower developers (often the catch cry of Microsoft at developer conferences) but instead causes confusion and frustration to the community.

@marcopelegrini
Copy link
Author

marcopelegrini commented Jun 22, 2024

I stand, bow to @wahyuen and applaud their comment. Is a very eloquent and polite way to ask the Azure CosmosDB team to listen to the people they're (supposedly) working to serve... the developers, the ones that choose which Microsoft tools to use in their projects.

DSAT is a big deal in this field and when we get slapped in the face because a decision was made to ignore developers when asking for a very simple way to make our lives easier it feels that we're not important.

I don't know about you, but I do care when my application has workarounds, hacks, or throws errors when start. I find it shameful when I open a browser developer console and I see a bunch of errors that "are normal" and I do care if my monitors have exception rules when they shouldn't. I do believe that my services are capable to run exception free and my logs can be clean, concise and trustful to tell me what my application is doing wrong. I do value engineering excellence and quality in my code. I do believe that we're still capable of producing code with the quality that we use to 20 years ago.

Please be empathetic and remember that in your company you should value diversity, including the diversity of thinking different than you. Cheers

@ealsur
Copy link
Member

ealsur commented Jun 24, 2024

@wahyuen @marcopelegrini Thanks for your feedback. Since we are all developers we can try to focus on the technical aspect of this Issue, correct?

Let's try to first understand how it is affecting your application:

  • The request happens on the background and it's not inline to any of the actual operations you execute on the SDK, so whether it fails or has high latency, it cannot be impacting your operations at all. If it is, can you please share details of how your operations are failing or if they have high latency, share the Diagnostics?
  • The client SDK is handling the failure, if it fails. There is a try/catch which avoids the exception hitting your application environment. The technical aspect is that Application Insights hooks itself to all HTTP requests automatically. It is completely understandable if this causes Alerts in Application Insight but the question here is, if this happens once during initialization, do you see a high number of these? If so, it would indicate that the application might not be following the Singleton pattern?

Could you please share answers to the above questions that were asked previously, so we can better understand the context?

@jwyza-pi
Copy link

I definitely agree with @wahyuen and others on this one. It feels like a very valid request to simply not make this call if it's not an Azure VM. It should be easily detectable what kind of environment the application is running in without having to resort to a HTTP call to find out. (the Azure Environment Variables should contain some indication of this). Being forced to suppress this error in every single application if we don't want to have it generating false positives is not a great answer, especially when we may not be able to control if calls get excluded or not.

Right now, what it means is that all developers who might be looking at diagnostic information must know that there is always this error that will show up and that you should ignore it. It's tribal knowledge that is not intuitive and slows the process down every time a new dev runs into it (or forgets and runs into it again 6 months later) and has to go track down what ultimately ends up being a non-issue.

I think @ealsur your response is a perfect example of the reason why this is an issue, developers see this data and (rightly) assume it's harming the application because that's how it looks in Azure's App Insights (and also many others, like Datadog). They don't understand why this is happening and think it's a bug. (which from our perspective, it is) Most likely the response times noted above aren't actually the application's response time. It only appears that way because of the non-inline nature of the call. What this means is that if you look at the P99 latency on calls where this happens, it's going to look like the P99 is 20+ seconds, even though the call actually likely returned in ms. So not correcting this is ultimately going to keep developers coming back and opening tickets complaining about the performance/logging/diagnostic impact over and over and over again.

Consider, the Nuget package for this repo has been downloaded 90M times. If even 1% of those downloads corresponds to a unique installation, that's 900k installs. If even 1% of those installs causes a developer to spend 30 minutes tracking this problem down, then that's 4500 hours of wasted time. ( ((900k * .01) * 30m)/60m ). All wasted over just a simple "don't make this call when you're not in a VM".

TL;DR:

  • This impacts diagnostics by:
    • artificially inflating the duration of impacted calls. Implying that the application is much slower than it actually is when these occur.
    • showing timeout errors in the applications flow even though they are benign
    • sending developers on a wild goose chase over a non-issue.
  • This impacts maintainers by having to field and reply to opened tickets about this issue.

@marcopelegrini
Copy link
Author

To add to the above, it also impacts local development, I don't want to run workarounds on my local machine to ignore failures like this. (And please don't ask me to change how I handle http errors, that's my choice).

PS: I won't spend more time giving examples and explanations why this is a problem, it's alarming how obvious it is and we're getting a pushback. It's not like we're asking for you to rewrite the entire SDK.

@AndreiSchillerChan
Copy link

Here here @jwyza-pi - that is exactly the problem we had. Datadog shows a 20s error timeout repeatedly for our given endpoint which led to an all-hands catch-up to try and decipher if this was killing our response times.

@wahyuen
Copy link
Contributor

wahyuen commented Jun 25, 2024

Adding another case of how this is impacting, I was personally pulled in on another delivery team at a client engagement site where this issue cropped up.

The current client has certain requirements for remote work into their systems and our team were asked to setup development environments on the provided Virtual Machines the customer provided. Due to a number of networking issues, the team had numerous debugging sessions to even be able to start up the application they are developing and were completely stumped, as they saw these exception messages (after days of seeing other network/startup errors due to network configuration) and incorrectly assumed this was still not fixed. After back and forth with client IT teams this was then escalated, and I was asked to step in to assist in diagnosis as this was now taking an inordinate amount of time to stand up the environment and the customer was not pleased.

Having already gone through this pain, it was a relatively easy diagnosis once I could see the error on the screen. Given the team had been experiencing a number of network related setup issues during the week, and the following errors coming up as soon as startup occurred, it was completely reasonable for them to assume there were still networking issues at play.

image

Once the situation was explained, I was asked a very pointed question from the customer as to why these errors are still displaying and why the team went around in circles with IT about it, to which my response was simply, 'this is a known issue with Microsoft and the team just has to ignore it'. To say the customer was not particularly pleased was an understatement.

So, does it actually materially impact operation in terms of the actual call that it is making, no, probably not.
Do we see a high number of these? Depends what you mean by high number, there are enough log messages to completely fill my Visual Studio debug window, but since we use a Singleton instance of CosmosClient, it is isolated to startup initialization only.

Does this continue to cause time impacts and even customer engagement impacts, completely yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-reported Issue created by a customer needs-investigation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants