-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Filebeat] [S3 input] Failed processing of object leads to duplicated events #19901
Comments
Pinging @elastic/integrations-platforms (Team:Platforms) |
Hi @kwinstonix thanks for creating this issue. Do you know if there is a way to reproduce this problem? Looking at the code, we are setting eventID based on objectHash and offset(https://github.com/elastic/beats/blob/master/x-pack/filebeat/input/s3/input.go#L638). When a file failed to process, the SQS message should go back to the queue and then the whole/same file would be re-processed later. This way, I think(in theory) the objectHash and offset both should be the same as last time. This is how we are thinking to solve the duplicate events issue. With what you are seeing, does that mean either the objectHash or the offset value for individual log entry changed during the second time processing this same file? TIA!! |
Thanks for the reply, I see the event _id in filebeat event, but i use Kafka output instead of Elastic Search and I can not control Kafka consumer logstash config, so I can't set the doc Id in ElasticSearch. On the other hand using self-generate doc _id impact Elastic Search performance. Hash Id is a solution indeed, I just wonder whether there is a way to set the offset of s3 object🤔 |
Only when the input is local file, the offset position is recorded into registry. Is it right? |
Ahh I think I found a bug in the code about offset. Will fix that soon! |
This should be fixed as a part of #19962, still in progress and need more testing. |
Hi @kwinstonix, #19962 is merged into master branch. I think that will fix this issue. I will close this issue for now and if you get a chance to test it and still seeing duplicate events, please feel free to open a new issue or reopen this one! Thank you!! |
More fix on the offset: #20370 |
Describe the enhancement:
Sometimes there are the error logs of failed processing object, then the SQS message is put back to SQS queue. Some lines of the object have been forwarded to output,when the object is processed again there are duplicated docs in ElasticSearch. So is there any way of keeping track of object offset? when the object is processed multiple times, we can start from the last offset to avoid duplicated docs in ElasticSearch. This is just like the behavior of reading log files.
Describe a specific use case for the enhancement or feature:
WHEN: processing is failed or filebeat shutdown
TO DO: keep track of object offset to void duplicated docs in ElasticSearch
Some feasible solutions
x-amz-meta-filebeat-offset=$successed_processing_offset
$successed_processing_offset
positionIt is important to keep track of s3 object offset that has beed processed successfully , because SQS message could be processed by multiple Filebeat instances.
filebeat doc:
The text was updated successfully, but these errors were encountered: