-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
googlecloud monitoring exporter drops data for transient failures: "Exporting failed. Dropping data" #31033
Comments
Unfortunately, it isn't safe to retry failed requests to CreateTimeSeries, as the API isn't idempotent. Retrying those requests often will result in additional errors because the timeseries already exists. The retry policy is determined by the client library here: https://github.com/googleapis/google-cloud-go/blob/5bfee69e5e6b46c99fb04df2c7f6de560abe0655/monitoring/apiv3/metric_client.go#L138. If you are seeing context deadline exceeded errors in particular, I would recommend increasing the timeout to ~45s. I am curious about the Authentication Backend Unavailable error. I haven't seen that one before. Is there anything unusual about your auth setup? |
Which shows that a I note that in #19203 and #25900 retry_on_failure was removed from GMP and GCM, because according to #208 "retry was handled by the client libraries", but this was only the case for traces, not metrics. (see comment) Could this be an oversight that retries were not enabled in metrics client libararies when they were in Logging and Tracing? While I understand that some failed requests should not be retried, there are some that should be: specifically ones that say "Please try again"! For example the error
This is not trivial as there does not seem to be a config parameter to do this, so would involve editing the source code and compiling my own version... In any case, for a collector running in GCP, exporting to GCM, it
Not at all: running on GKE with workload identity, using a custom service account with appropriate permissions. If there were retries on Unavailable or Deadline Exceeded, this would not be an issue of course. |
No. This was very intentional. It was always wrong to enable retry_on_failure for metrics when using the GCP exporter, and resulted in many complaints about log spam, since a retried request nearly always fails on subsequent requests as well.
The Use exporters:
googlecloud:
timeout: 45s Sorry, it looks like that option isn't documented. We use the standard TimeoutSettings: https://github.com/open-telemetry/opentelemetry-collector/blob/f5a7315cf88e10c0bce0166b35d9227727deaa61/exporter/exporterhelper/timeout_sender.go#L13 in the exporter. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners: See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
Hi @dashpole (Created separate issue as well) I am experiencing transient otel-collector failures for exporting Trace batches. e.g.: I have:
I have tried increasing timeout to 45 sec, as described here. And I have tried decreasing batch size from 200 to 100 as suggested here. Neither approach has given any statistically relevant observable improvements. Stacktrace:
Any ideas on what to do? |
Component(s)
exporter/googlecloud
What happened?
Description
When Google Cloud Monitoring exporter fails to export metrics to Google Cloud Monitoring, it drops the data. This occurs even for transient errors where the attempt should be retried.
Steps to Reproduce
Configure collector, export demo metrics.
Expected Result
Metrics are reliably exported to Google Cloud Monitoring
Actual Result
Metrics are dropped. for transient errors (such as "Authentication unavalialbe" -- when the auth cookie expires and needs to be refreshed)
Collector version
0.93.0
Environment information
Environment
GKE
OpenTelemetry Collector configuration
Log output
Additional context
No response
The text was updated successfully, but these errors were encountered: