`aws-s3` input workers shouldn't wait for objects to be fully ingested before starting the next object #39414

faec · 2024-05-06T13:19:44Z

A major performance bottleneck in ingesting SQS/S3 data is that each input worker fetches an S3 object, then waits for it to be fully acknowledged upstream before fetching the next object. When individual objects are small this can block the pipeline, especially if queue.mem.flush.timeout is high: the output is waiting for more input data at the same time as the input is waiting for the output to fully process the current queued data.

Instead, workers should fetch and publish objects as fast as they're able to process them, and acknowledgments and cleanup should be handled asynchronously without blocking ingestion.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-05-06T13:19:46Z

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

elasticmachine · 2024-05-06T13:19:46Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

leehinman · 2024-05-06T13:48:39Z

Instead, workers should fetch and publish objects as fast as they're able to process them, and acknowledgments and cleanup should be handled asynchronously without blocking ingestion.

I agree ACK & cleanup should be async.

I'm just wondering if we should make sure we have a way to force the more synchronous option. If for example you could checkout 200 SQS messages, process them all but haven't ACKed yet, and something happens like OOM of filebeat, then eventually all 200 of those SQS messages become visible in the SQS queue again and the objects get re-processed resulting in duplicate events.

I think making sure we can specify one SQS message checkout at a time is sufficient for this. Does that sound right?

faec · 2024-05-07T11:08:09Z

I think making sure we can specify one SQS message checkout at a time is sufficient for this. Does that sound right?

Well, one SQS message checkout at a time interacts disastrously with default ingestion settings if the number of events per message is less than bulk_max_size. So I'm hoping we can address this in a way that doesn't require that, ideally by keeping better state about in-progress objects. But I'm just starting on the cleanup of the AWS API accessors, so I'm not sure yet what will be the most practical... should know more soon.

leehinman · 2024-05-09T02:53:41Z

I've been thinking about this more and I think we can make a general statement. We want the input to perform well when one SQS message results in less than bulk_max_size events (example one event), and we it to perform well when one SQS message results in many times (example 512k events) the bulk_max_size events.

What you mentioned is the first part, and we have had issues of the second type where tens of SQS messages were checked out, each message pointed to thousands of events. The input tried to process all of them at the same time and was making very little progress on each checked out SQS message, which caused timeout errors and made it look like nothing was happening as far as the SQS queue was concerned.

faec added bug enhancement Team:obs-ds-hosted-services Label for the Observability Hosted Services team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 6, 2024

faec self-assigned this May 6, 2024

faec mentioned this issue May 6, 2024

Meta: Improve performance and reliability of awss3 and awscloudwatch inputs #38956

Open

faec mentioned this issue Sep 5, 2024

Add asynchronous ACK handling to S3 and SQS inputs #40699

Merged

6 tasks

faec closed this as completed in #40699 Oct 16, 2024

This was referenced Oct 16, 2024

[8.15](backport #40699) Add asynchronous ACK handling to S3 and SQS inputs #41248

Closed

[8.x](backport #40699) Add asynchronous ACK handling to S3 and SQS inputs #41249

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`aws-s3` input workers shouldn't wait for objects to be fully ingested before starting the next object #39414

`aws-s3` input workers shouldn't wait for objects to be fully ingested before starting the next object #39414

faec commented May 6, 2024

elasticmachine commented May 6, 2024

elasticmachine commented May 6, 2024

leehinman commented May 6, 2024

faec commented May 7, 2024

leehinman commented May 9, 2024

aws-s3 input workers shouldn't wait for objects to be fully ingested before starting the next object #39414

aws-s3 input workers shouldn't wait for objects to be fully ingested before starting the next object #39414

Comments

faec commented May 6, 2024

elasticmachine commented May 6, 2024

elasticmachine commented May 6, 2024

leehinman commented May 6, 2024

faec commented May 7, 2024

leehinman commented May 9, 2024

`aws-s3` input workers shouldn't wait for objects to be fully ingested before starting the next object #39414

`aws-s3` input workers shouldn't wait for objects to be fully ingested before starting the next object #39414