Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-s3 input workers shouldn't wait for objects to be fully ingested before starting the next object #39414

Closed
faec opened this issue May 6, 2024 · 5 comments · Fixed by #40699
Assignees
Labels
bug enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:obs-ds-hosted-services Label for the Observability Hosted Services team

Comments

@faec
Copy link
Contributor

faec commented May 6, 2024

A major performance bottleneck in ingesting SQS/S3 data is that each input worker fetches an S3 object, then waits for it to be fully acknowledged upstream before fetching the next object. When individual objects are small this can block the pipeline, especially if queue.mem.flush.timeout is high: the output is waiting for more input data at the same time as the input is waiting for the output to fully process the current queued data.

Instead, workers should fetch and publish objects as fast as they're able to process them, and acknowledgments and cleanup should be handled asynchronously without blocking ingestion.

@faec faec added bug enhancement Team:obs-ds-hosted-services Label for the Observability Hosted Services team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 6, 2024
@faec faec self-assigned this May 6, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@leehinman
Copy link
Contributor

Instead, workers should fetch and publish objects as fast as they're able to process them, and acknowledgments and cleanup should be handled asynchronously without blocking ingestion.

I agree ACK & cleanup should be async.

I'm just wondering if we should make sure we have a way to force the more synchronous option. If for example you could checkout 200 SQS messages, process them all but haven't ACKed yet, and something happens like OOM of filebeat, then eventually all 200 of those SQS messages become visible in the SQS queue again and the objects get re-processed resulting in duplicate events.

I think making sure we can specify one SQS message checkout at a time is sufficient for this. Does that sound right?

@faec
Copy link
Contributor Author

faec commented May 7, 2024

I think making sure we can specify one SQS message checkout at a time is sufficient for this. Does that sound right?

Well, one SQS message checkout at a time interacts disastrously with default ingestion settings if the number of events per message is less than bulk_max_size. So I'm hoping we can address this in a way that doesn't require that, ideally by keeping better state about in-progress objects. But I'm just starting on the cleanup of the AWS API accessors, so I'm not sure yet what will be the most practical... should know more soon.

@leehinman
Copy link
Contributor

I've been thinking about this more and I think we can make a general statement. We want the input to perform well when one SQS message results in less than bulk_max_size events (example one event), and we it to perform well when one SQS message results in many times (example 512k events) the bulk_max_size events.

What you mentioned is the first part, and we have had issues of the second type where tens of SQS messages were checked out, each message pointed to thousands of events. The input tried to process all of them at the same time and was making very little progress on each checked out SQS message, which caused timeout errors and made it look like nothing was happening as far as the SQS queue was concerned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants