-
Notifications
You must be signed in to change notification settings - Fork 777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics stop exporting at seemingly random times every week or so #5729
Comments
Please see if you can get internal logs (Warning and above) https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry/README.md#self-diagnostics Are you missing all metrics from the server, or just a subset of metrics? (There is metric cardinality caps implemented which can explain the behavior, but if it is every metric stopping at the same time, unlikely to be cause). |
Okay, we found a newly failing server and the diagnostic log helped a great deal. It appears we are timing out:
Here's the full entry, repeated every minute:
So now I have two questions:
Thank you again for your help! |
For gRPC, not much option to customize. There are open issues/prs for related settings that will allow exposing this. Eg: #2009
I don't think any "queue" up occurs today. If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried. |
You can increase the timeout period by setting TimeoutMilliseconds
|
Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself? |
Yes - it is used for setting the deadline time of a grpc call we make here. |
Thanks!. Looks like #1735 is still open which state we don't really enforce the timeouts, but I could be wrong. (or its only for traces!) |
Great info! However, it looks like the default is 10s, so it worries me that we're exceeding that -- especially if each call only includes the latest metrics. I could increase it to 20s or 30s, but I wonder if I'm just doing something wrong. Do you have any suggestions for diagnosing why my sends are exceeding 10s, or just how much data I'm sending? Or is this more likely a network issue between the servers and the collector? Thanks again, and feel free to close this issue if you feel you've provided all the info you can! |
This is only true if using Delta. If using Cumulative, then everything from start will always be exported... Are you using Delta or Cumulative? |
We are indeed using Delta mode: reader.TemporalityPreference = MetricReaderTemporalityPreference.Delta; We have a handful of observables. You're suggesting that the time it takes to observe those metrics must be accounted for in the gRPC deadline? That's interesting. We've tried to make those calls quick, but it's certainly something we could take a closer look at -- that could also explain why our servers never recover from this condition. Any other ideas are most welcome, and thank you again for all the help! |
@ladenedge - Just to confirm, you don't have retries enabled, correct? it's odd that once the server hits DeadlineExceeded, it is not able to recover from that and continues to throw that error until re-started. |
I assume you're talking about retries via the HttpClient? If so, then no, I'm using the default factory. |
Also, to follow up on the observables: are observables actually queried during the exporter's network call? Looking over our handful of observable counters, they appear quick (eg. |
No. (If you were using Prometheus scraping, then the observables callbacks are done in response to scrape request, so they do contribute to response time of the scrape itself.) In short - observable callback is not at play in your case, as you are doing push exporter. (Sorry I confused you with the mention of observables :( ) |
I face same issue |
For what it's worth, we are increasing our timeout to 30s to see if that makes any difference. (But this change won't be deployed for a week or so, so.. not very helpful yet, heh.) |
I'd like to try and resurrect this issue because we continue to face this problem. To recap, we have a pool of 15 servers that live in three datacenters and AWS. Regardless of the location, they will sometimes stop writing metrics, and they are unable to recover. Restarting our app (a Windows service) resolves the problem -- for a while. OTEL diagnostics show a DeadlineExceeded error, repeated as nauseum:
This occurs seemingly at random, and not necessarily at high-traffic periods. Here, for example, is a recent look at a metric that stopped writing during a fairly low-traffic period: And here is a shot of another server from the same datacenter over the same period: Also of interest: whatever is happening does not impact traces or logs. Here is a shot of the faulty server's traces over the same period: Likewise, logs come in without issue over the same period, though we are using Serilog's OTEL Sink, so it's fairly separate. Since last I checked in, we have raised our timeout to 30s: exporterOptions.TimeoutMilliseconds = 30000; This timeout change does not seem to have made any difference. Some conclusions:
Does anyone have any other suggestions for debugging or working around this issue? |
What is the question?
We're using metrics extensively in our .NET 8 application which is deployed to a dozen or so servers. Once in a while -- say, once a week -- a single server will stop exporting to our OTel Collector. The other servers continue to work fine, and restarting our application fixes the problem.
Thank you for any help!
Additional context
Application is .NET 8 with the following OTel packages:
The text was updated successfully, but these errors were encountered: