Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beats pipeline doesn't respect configured batch sizes on startup under agent #34703

Closed
Tracked by #16
faec opened this issue Feb 28, 2023 · 2 comments · Fixed by #34741
Closed
Tracked by #16

Beats pipeline doesn't respect configured batch sizes on startup under agent #34703

faec opened this issue Feb 28, 2023 · 2 comments · Fixed by #34741
Assignees
Labels
bug Team:Elastic-Agent Label for the Agent team

Comments

@faec
Copy link
Contributor

faec commented Feb 28, 2023

When an output worker is created, it specifies the maximum size of event batches it should receive from the pipeline. This value is ultimately propagated back to eventConsumer, the routine that assembles batches for the output workers, which uses it for its queue requests. Most outputs accept this batch size as a configuration parameter e.g. bulk_max_size.

Under Elastic Agent, the Beats startup is more complicated since Agent sends the beat configuration in multiple stages, and there will generally not be an output on the first initialization. Currently, this leads to eventConsumer receiving four separate calls to update the batch size (in each beat) -- three setting it to zero, and one setting it to the actual value requested by the output.

While the final value is correct, the inputs may have already started up by that point. Since a value of 0 indicates to the queue that it should send as many events as are available, this can cause the pipeline to be primed with batches containing multiple thousands of events before the output is initialized, even if the output itself requests a relatively small value (e.g. the shipper output defaults to a batch size of 50).

This is notably a problem for the Elasticsearch and Shipper outputs (and possibly others), which can have upstream caps on batch size, causing them to either drop the entire batch or to enter a retry loop that stalls the ingestion pipeline (#29778 #34695).

We need to correct the initialization process so eventConsumer doesn't begin creating batches until a valid output is configured; this will still allow incoming data to accumulate in the queue, but no explicit batches should be created until we know what the output workers can accept.

(This issue currently causes repeatable pipeline deadlocks for me when targeting the shipper.)

@faec faec added bug Team:Elastic-Agent Label for the Agent team labels Feb 28, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@faec
Copy link
Contributor Author

faec commented Feb 28, 2023

After talking over agent/beats initialization with @fearful-symmetry this may not really be unique to the Agent startup process. It might be present but unnoticed in vanilla beats, since we're doing things with agent that are particularly sensitive to batch size -- needs followup to determine full scope/cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants