googlecloud monitoring exporter drops data for transient failures: "Exporting failed. Dropping data" #31033

nielm · 2024-02-05T13:57:12Z

Component(s)

exporter/googlecloud

What happened?

Description

When Google Cloud Monitoring exporter fails to export metrics to Google Cloud Monitoring, it drops the data. This occurs even for transient errors where the attempt should be retried.

Steps to Reproduce

Configure collector, export demo metrics.

Expected Result

Metrics are reliably exported to Google Cloud Monitoring

Actual Result

Metrics are dropped. for transient errors (such as "Authentication unavalialbe" -- when the auth cookie expires and needs to be refreshed)

Collector version

0.93.0

Environment information

Environment

GKE

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  resourcedetection:
    detectors: [gcp]
    timeout: 10s
    override: false

  k8sattributes:
  k8sattributes/2:
      auth_type: "serviceAccount"
      passthrough: false
      extract:
        metadata:
          - k8s.pod.name
          - k8s.namespace.name
          - k8s.container.name
        labels:
          - tag_name: app.label.component
            key: app.kubernetes.io/component
            from: pod
      pod_association:
        - sources:
            - from: resource_attribute
              name: k8s.pod.ip
        - sources:
            - from: connection


  batch:
    # batch metrics before sending to reduce API usage
    send_batch_max_size: 200
    send_batch_size: 200
    timeout: 5s

  memory_limiter:
    # drop metrics if memory usage gets too high
    check_interval: 1s
    limit_percentage: 65
    spike_limit_percentage: 20

exporters:
  debug:
    verbosity: basic
  googlecloud:
    metric:
      instrumentation_library_labels: false
      service_resource_labels: false

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [k8sattributes, batch, memory_limiter, resourcedetection]
      exporters: [googlecloud]

Log output

2024-02-02T19:49:38.434Z	error	exporterhelper/common.go:95	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "googlecloud", "error": "rpc error: code = Aborted desc = Errors during metric descriptor creation: {(metric: workload.googleapis.com/cloudspannerecosystem/autoscaler/scaler/scaling-failed, error: Too many concurrent edits to the project configuration. Please try again.)}.", "dropped_items": 4}

2024-02-02T20:24:44.897Z	error	exporterhelper/common.go:95	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "googlecloud", "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded", "dropped_items": 12}

2024-02-05T07:43:53.416Z	error	exporterhelper/common.go:95	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "googlecloud", "error": "rpc error: code = Unavailable desc = Authentication backend unavailable.", "dropped_items": 17}

Additional context

No response

github-actions · 2024-02-05T13:57:29Z

Pinging code owners:

exporter/googlecloud: @aabmass @dashpole @jsuereth @punya @damemi @psx95

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole · 2024-02-05T14:27:53Z

Unfortunately, it isn't safe to retry failed requests to CreateTimeSeries, as the API isn't idempotent. Retrying those requests often will result in additional errors because the timeseries already exists. The retry policy is determined by the client library here: https://github.com/googleapis/google-cloud-go/blob/5bfee69e5e6b46c99fb04df2c7f6de560abe0655/monitoring/apiv3/metric_client.go#L138.

If you are seeing context deadline exceeded errors in particular, I would recommend increasing the timeout to ~45s.

I am curious about the Authentication Backend Unavailable error. I haven't seen that one before. Is there anything unusual about your auth setup?

nielm · 2024-02-05T16:09:50Z

The retry policy is determined by the client library

Which shows that a CreateTimeSeries RPC is never retried for any condition.

I note that in #19203 and #25900 retry_on_failure was removed from GMP and GCM, because according to #208 "retry was handled by the client libraries", but this was only the case for traces, not metrics. (see comment)

Could this be an oversight that retries were not enabled in metrics client libararies when they were in Logging and Tracing?

While I understand that some failed requests should not be retried, there are some that should be: specifically ones that say "Please try again"!

For example the error Too many concurrent edits to the project configuration. Please try again happens always when a counter is used for the first time in a project, or when a new attribute is added - it seems that GCM cannot cope with a CreateTimeSeries which updates a metric.

If you are seeing context deadline exceeded errors in particular, I would recommend increasing the timeout to ~45s.

This is not trivial as there does not seem to be a config parameter to do this, so would involve editing the source code and compiling my own version... In any case, for a collector running in GCP, exporting to GCM, it

Authentication Backend Unavailable error: Is there anything unusual about your auth setup?

Not at all: running on GKE with workload identity, using a custom service account with appropriate permissions.

If there were retries on Unavailable or Deadline Exceeded, this would not be an issue of course.

dashpole · 2024-02-05T16:49:27Z

Could this be an oversight that retries were not enabled in metrics client libararies when they were in Logging and Tracing?

No. This was very intentional. It was always wrong to enable retry_on_failure for metrics when using the GCP exporter, and resulted in many complaints about log spam, since a retried request nearly always fails on subsequent requests as well.

For example the error Too many concurrent edits to the project configuration. Please try again happens always when a counter is used for the first time in a project, or when a new attribute is added - it seems that GCM cannot cope with a CreateTimeSeries which updates a metric.

The Too many concurrent edits to the project configuration. error is actually an error from CreateMetricDescriptor, and will be retried next time a metric with that name is exported. It does not affect the delivery of timeseries information, and is only needed to populate the unit and description.

Use

exporters:
  googlecloud:
    timeout: 45s

Sorry, it looks like that option isn't documented. We use the standard TimeoutSettings: https://github.com/open-telemetry/opentelemetry-collector/blob/f5a7315cf88e10c0bce0166b35d9227727deaa61/exporter/exporterhelper/timeout_sender.go#L13 in the exporter.

github-actions · 2024-04-08T03:29:37Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/googlecloud: @aabmass @dashpole @jsuereth @punya @damemi @psx95

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-06-07T05:19:57Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

AkselAllas · 2024-08-30T06:30:59Z

Hi @dashpole (Created separate issue as well)

I am experiencing transient otel-collector failures for exporting Trace batches. e.g.:

I have:

    traces/2:
      receivers: [ otlp ]
      processors: [ tail_sampling, batch ]
      exporters: [ googlecloud ]

I have tried increasing timeout to 45 sec, as described here. And I have tried decreasing batch size from 200 to 100 as suggested here. Neither approach has given any statistically relevant observable improvements.

Stacktrace:

"caller":"exporterhelper/queue_sender.go:101", "data_type":"traces", "dropped_items":200, "error":"context deadline exceeded", "kind":"exporter", "level":"error", "msg":"Exporting failed. Dropping data.", "name":"googlecloud", "stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/queue_sender.go:101
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1

Any ideas on what to do?

nielm added bug Something isn't working needs triage New item requiring triage labels Feb 5, 2024

github-actions bot added the exporter/googlecloud label Feb 5, 2024

dashpole removed the needs triage New item requiring triage label Feb 5, 2024

dashpole self-assigned this Feb 5, 2024

github-actions bot mentioned this issue Feb 6, 2024

Weekly Report: 2024-01-30 - 2024-02-06 #31055

Closed

github-actions bot added the Stale label Apr 8, 2024

github-actions bot added the closed as inactive label Jun 7, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 7, 2024

AkselAllas mentioned this issue Sep 2, 2024

Improve transient errors in googlecloud trace exporter batch write spans. #34957

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

googlecloud monitoring exporter drops data for transient failures: "Exporting failed. Dropping data" #31033

googlecloud monitoring exporter drops data for transient failures: "Exporting failed. Dropping data" #31033

nielm commented Feb 5, 2024

github-actions bot commented Feb 5, 2024

dashpole commented Feb 5, 2024

nielm commented Feb 5, 2024 •

edited

Loading

dashpole commented Feb 5, 2024

github-actions bot commented Apr 8, 2024

github-actions bot commented Jun 7, 2024

AkselAllas commented Aug 30, 2024 •

edited

Loading

googlecloud monitoring exporter drops data for transient failures: "Exporting failed. Dropping data" #31033

googlecloud monitoring exporter drops data for transient failures: "Exporting failed. Dropping data" #31033

Comments

nielm commented Feb 5, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Feb 5, 2024

dashpole commented Feb 5, 2024

nielm commented Feb 5, 2024 • edited Loading

dashpole commented Feb 5, 2024

github-actions bot commented Apr 8, 2024

github-actions bot commented Jun 7, 2024

AkselAllas commented Aug 30, 2024 • edited Loading

nielm commented Feb 5, 2024 •

edited

Loading

AkselAllas commented Aug 30, 2024 •

edited

Loading