Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/elasticsearchexporter] Push failures not reported in metrics #32302

Open
jleloup opened this issue Apr 10, 2024 · 18 comments
Open

[exporter/elasticsearchexporter] Push failures not reported in metrics #32302

jleloup opened this issue Apr 10, 2024 · 18 comments
Labels
bug Something isn't working exporter/elasticsearch good first issue Good for newcomers help wanted Extra attention is needed

Comments

@jleloup
Copy link

jleloup commented Apr 10, 2024

Component(s)

exporter/elasticsearch

What happened?

Description

I think the Elasticsearch exporter does not report failures properly in the otelcol_exporter_send_failed_log_records metrics.

I am testing the behaviour of the elasticsearch exporter in the context of various failures. Eg.:

  • mapping conflicts when pushing log records to an ES index
  • missing permissions from the ES client to create new indices

During these tests I can see error being reported in logs, eg for a mapping conflict: see log output for an example.

Though I can't see such failures on the otelcol_exporter_send_failed_log_records:

Screenshot 2024-04-10 at 20 24 55

I saw the very same behavior with another type of error: missing Elasticsearch permissions to create indices, which prevent the exporter from creating a new index when pushing if said index does not exists.

Steps to Reproduce

Trigger an Elasticsearch push failure:

  • Push a log record conflicting with the Elasticsearch index mapping.
    As an example: map Body field as a long while pushing a typical log record from any receiver such as filelog.
  • Push a log record to a non-existing index while missing create permissions on the Elasticsearch role used by the OTEL Collector)

Expected Result

otelcol_exporter_send_failed_log_records counter is increased according to the failed log records.

Actual Result

otelcol_exporter_send_failed_log_records does not increase.

Collector version

v0.97.0

Environment information

Environment

Kubernetes 1.28

OpenTelemetry Collector configuration

exporters:
  elasticsearch/logs:
    endpoints: [--redacted--]
    user: --redacted--
    password: --redacted--

    logs_index: --redacted--
    logs_dynamic_index:
      enabled: true

    mapping:
      mode: none
      dedup: true
      dedot: true

    sending_queue:
      enabled: true
      num_consumers: 20
      queue_size: 1000

pipelines:
  logs:
    receivers: [--redacted--]
    processors:
      - memory_limiter
      - resourcedetection
      - resource
      - k8sattributes
      - batch
      - groupbyattrs/compaction
    exporters:
      - elasticsearch/logs

Log output

{"level":"error","ts":1712773343.8512416,"caller":"elasticsearchexporter@v0.97.0/elasticsearch_bulk.go:219","msg":"Drop docs: failed to index: struct { Type string \"json:\\\"type\\\"\"; Reason string \"json:\\\"reason\\\"\"; Cause struct { Type string \"json:\\\"type\\\"\"; Reason string \"json:\\\"reason\\\"\" } \"json:\\\"caused_by\\\"\" }{Type:\"mapper_parsing_exception\", Reason:\"object mapping for [Body] tried to parse field [Body] as object, but found a concrete value\", Cause:struct { Type string \"json:\\\"type\\\"\"; Reason string \"json:\\\"reason\\\"\" }{Type:\"\", Reason:\"\"}}","kind":"exporter","data_type":"logs","name":"elasticsearch/logs","attempt":1,"status":400,"stacktrace":"github.com/open-telemetry/opentelemetry-collector-contrib/exporter/elasticsearchexporter.pushDocuments.func1\n\tgithub.com/open-telemetry/opentelemetry-collector-contrib/exporter/elasticsearchexporter@v0.97.0/elasticsearch_bulk.go:219\ngithub.com/elastic/go-elasticsearch/v7/esutil.(*worker).flush\n\tgithub.com/elastic/go-elasticsearch/v7@v7.17.10/esutil/bulk_indexer.go:599\ngithub.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).init.func1\n\tgithub.com/elastic/go-elasticsearch/v7@v7.17.10/esutil/bulk_indexer.go:321"}

Additional context

No response

@jleloup jleloup added bug Something isn't working needs triage New item requiring triage labels Apr 10, 2024
@jleloup jleloup changed the title [exporter/elasticsearchexporter] [exporter/elasticsearchexporter] Push failures not reported in metrics Apr 10, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

Pinging code owners for exporter/elasticsearch: @JaredTan95 @ycombinator. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@JaredTan95
Copy link
Member

nice catch,it's indeed a bug

@JaredTan95 JaredTan95 added help wanted Extra attention is needed good first issue Good for newcomers and removed needs triage New item requiring triage labels Apr 10, 2024
@jleloup
Copy link
Author

jleloup commented Apr 11, 2024

As of now I am quite a beginner in Go and OTEL code alike, though I'm trying to get to the bottom of it to fix it.

I assume those cases on elasticsearch_bulk.go should return the error to bubble it up.
Then I assume it would be caught by a higher level object handling exporters & increase this metric, is that correct ?

@ycombinator
Copy link
Contributor

I assume those cases on elasticsearch_bulk.go should return the error to bubble it up. Then I assume it would be caught by a higher level object handling exporters & increase this metric, is that correct ?

@jleloup You're on the right path here. The trick is going to be figuring out how to make pushDocuments return the error from those cases, since the item.OnFailure callback function is called asynchronously from the call to pushDocuments.

@jleloup
Copy link
Author

jleloup commented Apr 17, 2024

Sadly I won't have much time to work on this issue.

@jleloup
Copy link
Author

jleloup commented May 2, 2024

So we tried an approach to fix this using a WaitGroup though we basically get context deadline exceeded at every occurrences.

The thing is it seem we have a conflicting situation here were on one hand OTEL expects exceptions to be returned synchronously for each calls of pushLogsData() and on the other hand the ES library is doing everything it can to create its own batch of documents and bulk them all at once.
Trying to synchronously batch all those documents at once overall feels like a bad idea and would impact networking costs as well as performance.

Is there a way to get a handle on the core exporter metric itself so we could update it directly in BulkIndexer OnError() closure ?
WDYT ?

@jleloup
Copy link
Author

jleloup commented May 2, 2024

Well actually I just saw PR #32359 and it seems it would change the game quite a lot, especially going for flushing data at each PushLogData() indeed.

@sumwun1
Copy link

sumwun1 commented May 31, 2024

First, does this issue still need work? It looks pretty old.

I've never contributed to an open source project before (besides fixing typos in documentation), but I've read this repository's contributing.md. Is there anything I still need to do before someone can assign the issue to me?

@jleloup
Copy link
Author

jleloup commented Jun 12, 2024

AFAK this issue still stands, we are recurringly having mapping issues with no impact on the metric.

Last time I checked, a refactoring was ongoing on the underlying lib used to bulk items in Elasticsearch. ,now that this has been merged, I assume it should allow for flushing item on-demand in the exporter, thus fetching a number of potential failed events during the bulk action and reporting it to OTEL.

Though I don't have time allocated to this right now, is there someone that could have a look at it ?

@lalith47
Copy link

Hey, I can given an attempt to this issue. I am new to OTel collector but able to understand the issue.
Will ask for help if unable to do this.

@lalith47
Copy link

lalith47 commented Jul 2, 2024

This PR #32632 address the actual problem.

@carsonip
Copy link
Contributor

carsonip commented Jul 2, 2024

@lalith47 thanks for the link. I agree that this may be the same issue as #32377 and after merging #32632 I'll come back and check if it fixes this #32302

@lalith47
Copy link

lalith47 commented Jul 2, 2024

Sure @carsonip. Let me know once the PR is merged. I can help testing this, as I am able to replicate the issue.

Copy link
Contributor

github-actions bot commented Sep 2, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 2, 2024
@carsonip
Copy link
Contributor

carsonip commented Sep 2, 2024

after merging #32632 I'll come back and check if it fixes this #32302

#32632 is superseded by #34238 . As #34238 is merged, I will test if it fixes this issue.

@github-actions github-actions bot removed the Stale label Sep 3, 2024
@awangc
Copy link

awangc commented Oct 31, 2024

Any updates? This would be really useful for alerting.

@carsonip
Copy link
Contributor

I tested with commit b14f1d4 and managed to reproduce this issue with a mapping conflict (HTTP 4xx), even with batcher unset / enabled / disabled. However, this issue is now addressed partially after #34238, as HTTP 5xx actually increment otelcol_exporter_send_failed_log_records when batcher is enabled / disabled, and sending queue is disabled. If sending queue is enabled, otelcol_exporter_send_failed_log_records is always 0.

The reason for the difference between 4xx and 5xx behavior is that failed documents with HTTP 4xx (except 429) are considered non-retriable, including permission errors and mapping conflicts. They are not considered an "error" of a flush, and are not propagated upstream to avoid unnecessary retries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/elasticsearch good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants