-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws-s3
input workers shouldn't wait for objects to be fully ingested before starting the next object
#39414
Comments
Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services) |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
I agree ACK & cleanup should be async. I'm just wondering if we should make sure we have a way to force the more synchronous option. If for example you could checkout 200 SQS messages, process them all but haven't ACKed yet, and something happens like OOM of filebeat, then eventually all 200 of those SQS messages become visible in the SQS queue again and the objects get re-processed resulting in duplicate events. I think making sure we can specify one SQS message checkout at a time is sufficient for this. Does that sound right? |
Well, one SQS message checkout at a time interacts disastrously with default ingestion settings if the number of events per message is less than |
I've been thinking about this more and I think we can make a general statement. We want the input to perform well when one SQS message results in less than What you mentioned is the first part, and we have had issues of the second type where tens of SQS messages were checked out, each message pointed to thousands of events. The input tried to process all of them at the same time and was making very little progress on each checked out SQS message, which caused timeout errors and made it look like nothing was happening as far as the SQS queue was concerned. |
A major performance bottleneck in ingesting SQS/S3 data is that each input worker fetches an S3 object, then waits for it to be fully acknowledged upstream before fetching the next object. When individual objects are small this can block the pipeline, especially if
queue.mem.flush.timeout
is high: the output is waiting for more input data at the same time as the input is waiting for the output to fully process the current queued data.Instead, workers should fetch and publish objects as fast as they're able to process them, and acknowledgments and cleanup should be handled asynchronously without blocking ingestion.
The text was updated successfully, but these errors were encountered: