-
Notifications
You must be signed in to change notification settings - Fork 17
Implement more efficient output tuning parameters to manage throughput #28
Comments
@nimarezainia did you have a chance to work on the requirements for this? |
i'm still working on defining these |
Interested to see the list of what we want to support, but we also need to consider that we need to avoid any breaking changes here. Today, we allow the user to use any Elasticsearch output setting from the UI, kibana.yml configuration, and API (though this isn't GA yet). Migrating this would be quite painful, mostly because we need to send a valid configuration to any agents that are not running the shipper (we support any agent >= 7.17.0 to work with any 8.x version of the Stack). Otherwise, Kibana does have the ability to run migrations during upgrades, which would allow us to do transformations to the user's YAML. This would also need to be done on the API and kibana.yml configuration code. |
Thinking about breaking changes some more, I'm curious if we need to consider the following when switching from beats outputs to the shipper:
This is just one example, there could be other related configs like |
@joshdover yes that is definitely a possible problem when enabling the shipper, users may need to retune their worker and max_bulk_size configurations if they were using them before. Even if we tried to apply the same configuration as before it may not behave equivalently as the data flowing through each worker will have changed from before. Filebeat workers would likely only write to There is no way to configure the underlying beat queue from an agent policy right now so that at least isn't a concern. |
@nimarezainia Do we have a requirements doc for that? Otherwise it is going to be hard to design. |
@jlind23 i'll share the requirements doc shortly. |
I've updated the description here to reflect the proposed changes to the output configuration, which I believe are the most impactful. We will likely want follow up issues about:
|
We will also need to consider how to handle existing agent policies that specify the existing worker and bulk_max_size parameters as advanced YAML configuration. We will likely need to handle both the old and new set of parameters. Fleet could migrate the policy for us, but that won't help standalone agents. |
Given this will affect the agent policy and the Fleet UI, we should probably convert this (or create) a cross team feature issue for this work. We will likely want to break each of the changes in the proposal into individual issues so they can be investigated and implemented incrementally. |
If we were to switch to using the go-elasticsearch client's BulkIndexer we would get this change essentially for free. BulkIndexer allows specifying a flush threshold in bytes and a minimum flush duration. https://pkg.go.dev/github.com/elastic/go-elasticsearch/v8/esutil#BulkIndexerConfig |
@cmacknz shouldn't we for good swithc to the go-elasticsearch client then? |
Yes I have prioritized the switch with #14 as the next task for the shipper. |
@alexsapran - put this one your radar |
I would close this once we have proven the go-elasticsearch client behaves the way we want, and that there will be no additional changes required. I'll also have to confirm that we have the Fleet UI changes tracked separately since they are mentioned here. |
@cmacknz @leehinman shall we keep this one in next sprint or we had enough time to double checked that go-elasticsearch behaviour was as expected? |
Beats have many knobs and whistles that allow the user to modify output related parameters in order to increase throughput. These parameters are extremely convoluted and sometimes contradict one another. With the new shipper design we have the opportunity to simplify and create more meaningful parameters for users to use.
Performance Tuning Proposal
a. Bytes are easier to mentally consume
b. It’s also easier to map to data seen on the wire
c. On the Elasticsearch ingest, the max document size is configured in bytes
a. Upon expiry the output queue is flushed and data written to the output
b. Users can lower this timeout to reduce the delay in collecting data
In summary for tuning the output we now will have 2 variables: maximum_batch_size and output_queue_flush_timeout
The text was updated successfully, but these errors were encountered: