You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When an output worker is created, it specifies the maximum size of event batches it should receive from the pipeline. This value is ultimately propagated back to eventConsumer, the routine that assembles batches for the output workers, which uses it for its queue requests. Most outputs accept this batch size as a configuration parameter e.g. bulk_max_size.
Under Elastic Agent, the Beats startup is more complicated since Agent sends the beat configuration in multiple stages, and there will generally not be an output on the first initialization. Currently, this leads to eventConsumer receiving four separate calls to update the batch size (in each beat) -- three setting it to zero, and one setting it to the actual value requested by the output.
While the final value is correct, the inputs may have already started up by that point. Since a value of 0 indicates to the queue that it should send as many events as are available, this can cause the pipeline to be primed with batches containing multiple thousands of events before the output is initialized, even if the output itself requests a relatively small value (e.g. the shipper output defaults to a batch size of 50).
This is notably a problem for the Elasticsearch and Shipper outputs (and possibly others), which can have upstream caps on batch size, causing them to either drop the entire batch or to enter a retry loop that stalls the ingestion pipeline (#29778#34695).
We need to correct the initialization process so eventConsumer doesn't begin creating batches until a valid output is configured; this will still allow incoming data to accumulate in the queue, but no explicit batches should be created until we know what the output workers can accept.
(This issue currently causes repeatable pipeline deadlocks for me when targeting the shipper.)
The text was updated successfully, but these errors were encountered:
After talking over agent/beats initialization with @fearful-symmetry this may not really be unique to the Agent startup process. It might be present but unnoticed in vanilla beats, since we're doing things with agent that are particularly sensitive to batch size -- needs followup to determine full scope/cause.
When an output worker is created, it specifies the maximum size of event batches it should receive from the pipeline. This value is ultimately propagated back to
eventConsumer
, the routine that assembles batches for the output workers, which uses it for its queue requests. Most outputs accept this batch size as a configuration parameter e.g.bulk_max_size
.Under Elastic Agent, the Beats startup is more complicated since Agent sends the beat configuration in multiple stages, and there will generally not be an output on the first initialization. Currently, this leads to
eventConsumer
receiving four separate calls to update the batch size (in each beat) -- three setting it to zero, and one setting it to the actual value requested by the output.While the final value is correct, the inputs may have already started up by that point. Since a value of 0 indicates to the queue that it should send as many events as are available, this can cause the pipeline to be primed with batches containing multiple thousands of events before the output is initialized, even if the output itself requests a relatively small value (e.g. the shipper output defaults to a batch size of 50).
This is notably a problem for the Elasticsearch and Shipper outputs (and possibly others), which can have upstream caps on batch size, causing them to either drop the entire batch or to enter a retry loop that stalls the ingestion pipeline (#29778 #34695).
We need to correct the initialization process so
eventConsumer
doesn't begin creating batches until a valid output is configured; this will still allow incoming data to accumulate in the queue, but no explicit batches should be created until we know what the output workers can accept.(This issue currently causes repeatable pipeline deadlocks for me when targeting the shipper.)
The text was updated successfully, but these errors were encountered: