Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/kafkaexporter] Messages Above Producer.MaxMessageBytes Will Be Retried Instead Of Dropped #30275

Closed
rjduffner opened this issue Jan 3, 2024 · 6 comments

Comments

@rjduffner
Copy link

Component(s)

exporter/kafka

What happened?

Description

We are noticing that when a log gets created that is larger than Producer.MaxMessageBytes, the kafka exporter fails to send it and then retries.

2024-01-03T16:17:49.375Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "kafka", "error": "Failed to deliver 1 messages due to kafka: invalid configuration (Attempt to produce message larger than configured Producer.MaxMessageBytes: 1126653 > 1000000)", "interval": "5.817752834s"}

We are not sure why its retrying as this message will never be able to be sent (as its above Producer.MaxMessageBytes).

Is there any clarity as to this choice? I understand we could use the on_error: drop feature (and we probably will) but I am curious as to why its not log and drop already?

Steps to Reproduce

Create a message larger than Producer.MaxMessageBytes and attempt to export it via the kafka exporter.

Expected Result

Error is logged and then message is dropped

Actual Result

Error is logged and then message is retried

Collector version

0.87.0

Environment information

Environment

EKS
Amazon Standard AMI
Splunk OTEL Collector Helm Chart

OpenTelemetry Collector configuration

No response

Log output

2024-01-03T16:17:49.375Z	info	exporterhelper/retry_sender.go:177	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "kafka", "error": "Failed to deliver 1 messages due to kafka: invalid configuration (Attempt to produce message larger than configured Producer.MaxMessageBytes: 1126653 > 1000000)", "interval": "5.817752834s"}

Additional context

No response

@rjduffner rjduffner added bug Something isn't working needs triage New item requiring triage labels Jan 3, 2024
Copy link
Contributor

github-actions bot commented Jan 3, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

Hello @rjduffner, thanks for filing this issue! It looks like the problem you're facing is a result of a lack of granularity in errors returned by Sarama, and the exporter's retry functionality.

The kafka exporter is using Sarama as the kafka client. Sarama is failing with the error message you've shown, and returning as expected. However, on the collector side of things, as long as the error isn't permanent it will attempt to retry exporting the data. This error is not considered permanent by Otel or Sarama.

I think the best option may be to add logic to the exporter to detect what kind of error was hit, and upgrade the error to permanent to allow dropping data instead of infinite retries. It looks like there's an open issue against Sarama that discussses detecting this kind of error, we could add similar logic to the exporter's functionality.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Jan 3, 2024
@rjduffner
Copy link
Author

Thanks for the explanation @crobert-1.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Mar 11, 2024
@crobert-1 crobert-1 removed the Stale label Mar 11, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 13, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants