-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip VmMetadataApiHandler when running outside a VM #4187
Comments
@marcopelegrini do you see any impact due to this? This call happens only at initialization and only once in a lifetime of client instance. |
@sourabh1007 I don’t want to see an error every time I start my application. I follow a zero warning/error build and run methodology and this is preventing my validations to pass |
In SDK, we are logging it at azure-cosmos-dotnet-v3/Microsoft.Azure.Cosmos/src/Telemetry/VmMetadataApiHandler.cs Line 92 in c6918b4
Looks like, you are using custom handler and this exception is printed by |
@sourabh1007 That's correct, my handle logs any unsuccessfully request and I'd like to keep it that way |
As I mentioned above, in SDK, it is made sure that this should not get logged as Error. If you are using |
@sourabh1007 logging is a side effect of this feature, I've never asked you to not log it, I've asked you to offer a way to not send an invalid request for the environment in which the SDK is being used. You're asking me to couple my HTTP Client to CosmosDB client and add a condition for a request that may or may not be invalid... can you imagine how architecturally incorrect that is. I'm requesting for a simple toggle or a way to override this behaviour, which if it wasn't implemented using static accessors I would've been able to do. |
@marcopelegrini - I think there are two reasons, why there is no intent to change the behavior as requested by you:
|
@sourabh1007 I agree with your first point when considering a production environment, although I don't think that can be use as a reason to prevent the invalid request in controlled environments, which is my case. I disagree with "this is a normal exception" approach... no exception is normal and if it can be avoided, it should. That's why it's called exception. I'm not alone by the way: https://stackoverflow.com/questions/73235961/azure-cosmos-db-client-throws-httprequestexception-attempt-was-made-to-access A library should always give their consumers control over the behaviours that influence their environment, and in this case the sdk is not giving that control so I can avoid a unwanted request. There's virtually no cost for adding a flag to the config to skip this call in specific scenarios:
|
It is not about the cost of implementation. This call is not related to Cosmos DB. It is Instance metadata to get Azure instance information ref https://learn.microsoft.com/en-us/azure/virtual-machines/instance-metadata-service?tabs=linux. SDK just make this call to get VM information. There are few things you can do:
I don't think there is any impact even if you have to live with it except one exception stacktrace in your log which came only once during application restart. Neither it is noisy nor it is too big to bloat your log file. I don't see any reason to add a new config just to avoid it. |
@sourabh1007 We're going to have to agree to disagree here as looks like I won't convince you anyway. I appreciate your time and suggestions, but I prefer to keep my architecture clean than work around shortcomings of the dependencies I have, I'll find another way. |
Stumbled upon this after a few hours of investigation and wondering why we are seemingly randomly getting exceptions during startup. This is much more of a concern as it is not easily apparent WHY we get these errors (bunch of SocketExceptions/HttpExceptions). Unfortunately, it appears yet another case of where Microsoft developers are not listening to the community. A number of other Azure libraries give and allow the user the flexibility to configure and change things for local development (see Azure.Identity, DefaultAzureCredentials and exclusion of different methods of authentication). You can also arguable say that has zero effect in a production environment where Managed Identities are used, yet this makes a HUGE different in developer experience on a local platform. |
I wanted to come and say this also was impacting us in a negative way. Specifically around diagnosing problems within services. Tracing systems, such as datadog, report this as part of their flame graphs and shows minute long timeouts whenever it decides it needs to make this call. There is precedent in other libraries to make decisions on the internal workings of the code based on the environment in which they run. Why not simply check to see if this call should happen before blindly assuming that it will work and leading developers off on wild goose chases trying to understand why they have random errors showing up in their logs? |
From https://stackoverflow.com/a/76841727/3925707 var assembly = Assembly.GetAssembly(typeof(CosmosClient));
var type = assembly?.GetType("Microsoft.Azure.Cosmos.Telemetry.VmMetadataApiHandler");
var field = type?.GetField("isInitialized", BindingFlags.Static | BindingFlags.NonPublic);
field?.SetValue(type, true); You probably shouldn't do this... but it bypasses the metadata call and the error doesn't occur. |
This has negatively impacted my team as well. Our dependency failure alerts started firing when 100% of these requests started failing. Unfortunately we will have to update our alerting to ignore this endpoint.. Microsoft should address this issue. |
Same here. We got a 4-second delay on an endpoint due to this timeout. |
I am sorry you are facing this issue, can you please explain, how latency of this API is affecting you in anyway? If it is affecting your application performance, please open a new Github issue with as much details as possible. I will look into it. For alerting system related experience reported here, please use singleton client instance. you should see this failure only during initialization (only if you are on non-azure VM). Still single (or few, if coming from different instances) failures are bothering you then you can ignore this failure in your monitoring system. There is a reason, we have an option to ignore some failures in these monitoring systems. We don't want (or give option) to by-pass this call as this call gives Azure VM information to SDK. I would welcome any other suggestion; you can provide to avoid this call in case of non-azure VM. |
I feel as if the point from multiple people in this thread is still being missed. I doubt any of us are opposed to the fact that this gives useful information for Azure VMs to the SDK. The contention is that this is throwing exception tracing that is NOT obvious back to developers who then have to spend time to diagnose, determine what the issue is, then look at potentially perform workarounds to suppress log messages. If you and your team feel this is an essential feature that should not be bypassed, then so be it, we are asking then for a solution where no exception is propagated back up in non-Azure VM scenarios (majority of which will be localhost or remote execution on some other VM). How you choose to handle this I will leave up to the team. I would assert that surely this is not the ONLY service within the Azure ecosystem that is performing this check, yet this to my current knowledge, is the only time that consumers of the SDK need to deal with the exception being thrown. I would put forward this to you in another way. Imagine you have purchased a car, and as such, you are the mercy of the car manufacturer and must use their setup and software. Imagine you then start your car every morning about to go to work, but then a large error shows up on your dashboard and have nothing else but warning lights going off. You have no idea what this is about, you do some research, you ask other people, you perhaps bring your car in for a service to identify what the issue is. You are then told that your manufacturer requires some 'useful telemetry' but it is failing for your particular scenario. You just have to deal with it because it is important to them. Perhaps you implement a work around, perhaps you just choose to ignore the message. Perhaps next time, when a REAL exception is thrown, you ignore that too because you are so used to dealing with exception noise because your manufacturer told you to 'deal with it'. I hope this does not come across as causing offence, and I apologize if it does so to you. I'm attempting to illustrate the perspective of a consumer of your SDK and the affects it has on us as developers. As it stands, this scenario does nothing to empower developers (often the catch cry of Microsoft at developer conferences) but instead causes confusion and frustration to the community. |
I stand, bow to @wahyuen and applaud their comment. Is a very eloquent and polite way to ask the Azure CosmosDB team to listen to the people they're (supposedly) working to serve... the developers, the ones that choose which Microsoft tools to use in their projects. DSAT is a big deal in this field and when we get slapped in the face because a decision was made to ignore developers when asking for a very simple way to make our lives easier it feels that we're not important. I don't know about you, but I do care when my application has workarounds, hacks, or throws errors when start. I find it shameful when I open a browser developer console and I see a bunch of errors that "are normal" and I do care if my monitors have exception rules when they shouldn't. I do believe that my services are capable to run exception free and my logs can be clean, concise and trustful to tell me what my application is doing wrong. I do value engineering excellence and quality in my code. I do believe that we're still capable of producing code with the quality that we use to 20 years ago. Please be empathetic and remember that in your company you should value diversity, including the diversity of thinking different than you. Cheers |
@wahyuen @marcopelegrini Thanks for your feedback. Since we are all developers we can try to focus on the technical aspect of this Issue, correct? Let's try to first understand how it is affecting your application:
Could you please share answers to the above questions that were asked previously, so we can better understand the context? |
I definitely agree with @wahyuen and others on this one. It feels like a very valid request to simply not make this call if it's not an Azure VM. It should be easily detectable what kind of environment the application is running in without having to resort to a HTTP call to find out. (the Azure Environment Variables should contain some indication of this). Being forced to suppress this error in every single application if we don't want to have it generating false positives is not a great answer, especially when we may not be able to control if calls get excluded or not. Right now, what it means is that all developers who might be looking at diagnostic information must know that there is always this error that will show up and that you should ignore it. It's tribal knowledge that is not intuitive and slows the process down every time a new dev runs into it (or forgets and runs into it again 6 months later) and has to go track down what ultimately ends up being a non-issue. I think @ealsur your response is a perfect example of the reason why this is an issue, developers see this data and (rightly) assume it's harming the application because that's how it looks in Azure's App Insights (and also many others, like Datadog). They don't understand why this is happening and think it's a bug. (which from our perspective, it is) Most likely the response times noted above aren't actually the application's response time. It only appears that way because of the non-inline nature of the call. What this means is that if you look at the P99 latency on calls where this happens, it's going to look like the P99 is 20+ seconds, even though the call actually likely returned in ms. So not correcting this is ultimately going to keep developers coming back and opening tickets complaining about the performance/logging/diagnostic impact over and over and over again. Consider, the Nuget package for this repo has been downloaded 90M times. If even 1% of those downloads corresponds to a unique installation, that's 900k installs. If even 1% of those installs causes a developer to spend 30 minutes tracking this problem down, then that's 4500 hours of wasted time. ( TL;DR:
|
To add to the above, it also impacts local development, I don't want to run workarounds on my local machine to ignore failures like this. (And please don't ask me to change how I handle http errors, that's my choice). PS: I won't spend more time giving examples and explanations why this is a problem, it's alarming how obvious it is and we're getting a pushback. It's not like we're asking for you to rewrite the entire SDK. |
Here here @jwyza-pi - that is exactly the problem we had. Datadog shows a 20s error timeout repeatedly for our given endpoint which led to an all-hands catch-up to try and decipher if this was killing our response times. |
Adding another case of how this is impacting, I was personally pulled in on another delivery team at a client engagement site where this issue cropped up. The current client has certain requirements for remote work into their systems and our team were asked to setup development environments on the provided Virtual Machines the customer provided. Due to a number of networking issues, the team had numerous debugging sessions to even be able to start up the application they are developing and were completely stumped, as they saw these exception messages (after days of seeing other network/startup errors due to network configuration) and incorrectly assumed this was still not fixed. After back and forth with client IT teams this was then escalated, and I was asked to step in to assist in diagnosis as this was now taking an inordinate amount of time to stand up the environment and the customer was not pleased. Having already gone through this pain, it was a relatively easy diagnosis once I could see the error on the screen. Given the team had been experiencing a number of network related setup issues during the week, and the following errors coming up as soon as startup occurred, it was completely reasonable for them to assume there were still networking issues at play. Once the situation was explained, I was asked a very pointed question from the customer as to why these errors are still displaying and why the team went around in circles with IT about it, to which my response was simply, 'this is a known issue with Microsoft and the team just has to ignore it'. To say the customer was not particularly pleased was an understatement. So, does it actually materially impact operation in terms of the actual call that it is making, no, probably not. Does this continue to cause time impacts and even customer engagement impacts, completely yes. |
Describe the bug
CosmosDB client tries to obtain VM info which only exists when running the service on Azure VMs
Either:
To Reproduce
Steps to reproduce the behavior. If you can include code snippets or links to repositories containing a repro of the issue that can helps us in detecting the scenario it would speed up the resolution.
Run an app with CosmosDB client enabled and log configured outside of Azure
Expected behavior
Either:
Actual behavior
Exception is thrown in the logs
Environment summary
SDK Version: 3.33.0
OS Version (e.g. Windows, Linux, MacOSX) macOS 14.1.1 (23B81)
The text was updated successfully, but these errors were encountered: