Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics stop exporting at seemingly random times every week or so #5729

Open
ladenedge opened this issue Jun 29, 2024 · 17 comments
Open

Metrics stop exporting at seemingly random times every week or so #5729

ladenedge opened this issue Jun 29, 2024 · 17 comments
Labels
metrics Metrics signal related question Further information is requested

Comments

@ladenedge
Copy link

What is the question?

We're using metrics extensively in our .NET 8 application which is deployed to a dozen or so servers. Once in a while -- say, once a week -- a single server will stop exporting to our OTel Collector. The other servers continue to work fine, and restarting our application fixes the problem.

  • What might be going wrong here?
  • Are there any configuration options that might help?
  • How can I debug/diagnose this problem?

Thank you for any help!

Additional context

Application is .NET 8 with the following OTel packages:

<PackageReference Include="OpenTelemetry.Contrib.Instrumentation.AWS" Version="1.0.2" />
<PackageReference Include="OpenTelemetry.Exporter.Console" Version="1.8.1" />
<PackageReference Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.8.1" />
<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.8.1" />
<PackageReference Include="OpenTelemetry.Instrumentation.EventCounters" Version="1.5.1-alpha.1" />
<PackageReference Include="OpenTelemetry.Instrumentation.GrpcNetClient" Version="1.6.0-beta.2" />
<PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="1.8.1" />
@ladenedge ladenedge added the question Further information is requested label Jun 29, 2024
@cijothomas
Copy link
Member

Please see if you can get internal logs (Warning and above) https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry/README.md#self-diagnostics

Are you missing all metrics from the server, or just a subset of metrics? (There is metric cardinality caps implemented which can explain the behavior, but if it is every metric stopping at the same time, unlikely to be cause).
Is metrics missing in Console too?

@cijothomas cijothomas added the metrics Metrics signal related label Jul 1, 2024
@ladenedge
Copy link
Author

Okay, we found a newly failing server and the diagnostic log helped a great deal. It appears we are timing out:

Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")

Here's the full entry, repeated every minute:

2024-07-04T12:43:18.7570236Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")
   at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at OpenTelemetry.Proto.Collector.Metrics.V1.MetricsService.MetricsServiceClient.Export(ExportMetricsServiceRequest request, CallOptions options)
   at OpenTelemetry.Exporter.OpenTelemetryProtocol.Implementation.ExportClient.OtlpGrpcMetricsExportClient.SendExportRequest(ExportMetricsServiceRequest request, DateTime deadlineUtc, CancellationToken cancellationToken)}

So now I have two questions:

  1. Is there some way to modify the timeout on the internal OTel gRPC calls?
  2. Why might this be happening after a week or so? Once we do timeout, is this likely to get worse due to the queuing of new metrics? (Ie. is there any chance of recovering from this condition?)

Thank you again for your help!

@cijothomas
Copy link
Member

Is there some way to modify the timeout on the internal OTel gRPC calls?

For gRPC, not much option to customize. There are open issues/prs for related settings that will allow exposing this. Eg: #2009

Why might this be happening after a week or so? Once we do timeout, is this likely to get worse due to the queuing of new metrics? (Ie. is there any chance of recovering from this condition?)

I don't think any "queue" up occurs today. If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried.
(@vishweshbankwar to keep me honest here.)

@vishweshbankwar
Copy link
Member

vishweshbankwar commented Jul 8, 2024

Is there some way to modify the timeout on the internal OTel gRPC calls?

You can increase the timeout period by setting TimeoutMilliseconds

If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried.

This is correct, even if the retries are enabled. #5436

@cijothomas
Copy link
Member

Is there some way to modify the timeout on the internal OTel gRPC calls?

You can increase the timeout period by setting TimeoutMilliseconds

Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?

@vishweshbankwar
Copy link
Member

Is there some way to modify the timeout on the internal OTel gRPC calls?
You can increase the timeout period by setting TimeoutMilliseconds

Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?

Yes - it is used for setting the deadline time of a grpc call we make here.

@cijothomas
Copy link
Member

Thanks!. Looks like #1735 is still open which state we don't really enforce the timeouts, but I could be wrong. (or its only for traces!)

@ladenedge
Copy link
Author

You can increase the timeout period by setting TimeoutMilliseconds

Great info! However, it looks like the default is 10s, so it worries me that we're exceeding that -- especially if each call only includes the latest metrics. I could increase it to 20s or 30s, but I wonder if I'm just doing something wrong. Do you have any suggestions for diagnosing why my sends are exceeding 10s, or just how much data I'm sending? Or is this more likely a network issue between the servers and the collector?

Thanks again, and feel free to close this issue if you feel you've provided all the info you can!

@cijothomas
Copy link
Member

especially if each call only includes the latest metrics.

This is only true if using Delta. If using Cumulative, then everything from start will always be exported... Are you using Delta or Cumulative?
Also, are there many Observable instruments with callbacks potentially taking a lot of time?

@ladenedge
Copy link
Author

We are indeed using Delta mode:

reader.TemporalityPreference = MetricReaderTemporalityPreference.Delta;

We have a handful of observables. You're suggesting that the time it takes to observe those metrics must be accounted for in the gRPC deadline? That's interesting. We've tried to make those calls quick, but it's certainly something we could take a closer look at -- that could also explain why our servers never recover from this condition.

Any other ideas are most welcome, and thank you again for all the help!

@vishweshbankwar
Copy link
Member

@ladenedge - Just to confirm, you don't have retries enabled, correct?

it's odd that once the server hits DeadlineExceeded, it is not able to recover from that and continues to throw that error until re-started.

@ladenedge
Copy link
Author

I assume you're talking about retries via the HttpClient? If so, then no, I'm using the default factory.

@ladenedge
Copy link
Author

Also, to follow up on the observables: are observables actually queried during the exporter's network call? Looking over our handful of observable counters, they appear quick (eg. Dictionary.Count) -- but is it even possible that they contribute to the missed deadline?

@cijothomas
Copy link
Member

are observables actually queried during the exporter's network call?

No. (If you were using Prometheus scraping, then the observables callbacks are done in response to scrape request, so they do contribute to response time of the scrape itself.) In short - observable callback is not at play in your case, as you are doing push exporter. (Sorry I confused you with the mention of observables :( )

@g3david405
Copy link

I face same issue

@ladenedge
Copy link
Author

For what it's worth, we are increasing our timeout to 30s to see if that makes any difference. (But this change won't be deployed for a week or so, so.. not very helpful yet, heh.)

@ladenedge
Copy link
Author

ladenedge commented Nov 26, 2024

I'd like to try and resurrect this issue because we continue to face this problem.

To recap, we have a pool of 15 servers that live in three datacenters and AWS. Regardless of the location, they will sometimes stop writing metrics, and they are unable to recover. Restarting our app (a Windows service) resolves the problem -- for a while.

OTEL diagnostics show a DeadlineExceeded error, repeated as nauseum:

2024-11-22T14:37:16.7591358Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")
   at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at OpenTelemetry.Proto.Collector.Metrics.V1.MetricsService.MetricsServiceClient.Export(ExportMetricsServiceRequest request, CallOptions options)
   at OpenTelemetry.Exporter.OpenTelemetryProtocol.Implementation.ExportClient.OtlpGrpcMetricsExportClient.SendExportRequest(ExportMetricsServiceRequest request, DateTime deadlineUtc, CancellationToken cancellationToken)}

This occurs seemingly at random, and not necessarily at high-traffic periods. Here, for example, is a recent look at a metric that stopped writing during a fairly low-traffic period:

image

And here is a shot of another server from the same datacenter over the same period:

image

Also of interest: whatever is happening does not impact traces or logs. Here is a shot of the faulty server's traces over the same period:

image

Likewise, logs come in without issue over the same period, though we are using Serilog's OTEL Sink, so it's fairly separate.

Since last I checked in, we have raised our timeout to 30s:

exporterOptions.TimeoutMilliseconds = 30000;

This timeout change does not seem to have made any difference.

Some conclusions:

  • We are writing to an OTEL Collector in AWS, but resetting the Collector does not resolve the problem. (Only resetting the Windows service seems to resolve it.)
  • It does not appear to be related to traffic or load, since the fault sometimes occurs at low traffic/load.
  • It does not appear to be related to our network, since other servers on the same network continue to work fine. Moreover, traces and logs from the faulty server also continue to work fine.

Does anyone have any other suggestions for debugging or working around this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metrics Metrics signal related question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants