[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion #508

chishui · 2024-04-07T02:22:34Z

Is your feature request related to a problem? Please describe.

In this opensearch-project/OpenSearch#12457, we proposed a batch ingestion feature which could accelerate the ingestion with neural search processors. It introduces an additional parameter "batch size" that texts from different documents could be combined and sent to ML server in one request. Since user could have different data set, different ML servers with different resources, in order to achieve better performance, they would need to experiment with different value of batch size to get the optimal performance. To offload the burden from user, we'd like to have a automation tool which could find this optimal batch size automatically.

Describe the solution you'd like

The automation tool would run _bulk API with different batch size to see which batch size would lead to optimal performance (high throughput & low latency & no errors). The OpenSearch-benchmark tool already provides rich features on benchmark which we could utilize for this automation. We can call benchmark with different parameters, collect and evaluate and even visualize the results then provide the recommendation.

The tool can also be used to help select bulk size and client number which we can support gradually.

Describe alternatives you've considered

N/A

Additional context

This issue was originally created in OS core repo but was suggested to repost in OSB. opensearch-project/OpenSearch#13009

chishui · 2024-04-10T05:59:36Z

I put the example code in this repo https://github.com/chishui/opensearch-ingest-param-tuning-tool, please evaluate to see if we can put the functionalities in the OSB

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

IanHoang · 2024-05-14T17:56:07Z

Thank you for suggesting this @chishui. After our discussion offline and inspecting your changes, I have some concerns over this feature and its user experience.

OpenSearch performance is complex in nature and is influenced by several factors — such as cluster configurations, hardware, workload types, etc. OSB was designed to be a flexible tool that encourages users to engage in the experimentation process in order to develop a better understanding of their cluster’s performance and make strategic decisions based off the data collected.

Some of my concerns for this feature:

A subcommand like tuning that automatically recommends optimal parameters might not account for all factors that influence OpenSearch performance and can potentially mislead users. It could also obscure important details and overall reduce the user’s understanding of these underlying factors.
Users also might become overly reliant on this feature and assume that it always provides the best possible configuration without testing and evaluating results themselves. This conflicts with the core purpose of OSB, which is to serve as a flexible tool that encourages users to engage in the experimentation process to understand cluster performance.
The tuning subcommand only runs a single test for a mix of parameters and then moves onto the next test with a different mix of the same parameters. OpenSearch performance can fluctuate and one-time test results may not hold true over time. To combat this, we always recommend our users to run several tests with the same parameters to ensure they get repeatable results before deciding which are the best.
Users often customize their tests to meet their unique requirements. This feature does not have insight into the user’s unique requirements and is solely focused comparing configurations performances of one-time tests
Although the compare subcommand might be similar in terms of how it compares different test execution results and shows the percent differences, it is fairly a light-weight operation and leaves the interpretation up to the user.

Alternative solution

To combat some of these concerns, we can take an alternative approach and have studies performed in the open that show how how variables like indexing / search clients, bulk size, and batch size influence a cluster’s performance. These studies would serve as a learning resource that's publicly available and reproducible (similar to the nightly runs at opensearch.org/benchmarks). We can also add more documentation on performance testing and tuning variables. These resources would set users up for long-term success by equipping them with a better understanding of these variables and empowering them to use this knowledge, paired with OSB, to better assess their cluster's performance.

Again, although I see the use-case, I’m not certain that it fits within the design and purpose of OSB. The current implementation of the tuning subcommand is essentially a wrapper around OSB’s core action of running tests and would be more suitable as a separate tool. We can discuss this further if you’d like and see how we can incorporate this else where, perhaps in a performance tool suite. Also, tagging other maintainers -- @gkamat @rishabh6788 @cgchinmay @beaioun -- to see if they have any feedback on this that they'd like to add.

IanHoang · 2024-05-24T19:08:58Z

@chishui Closing this issue for now as no activity. We can sync and collaborate to find a suitable solution for your needs.

chishui added the enhancement New feature or request label Apr 7, 2024

github-actions bot added the untriaged label Apr 7, 2024

chishui mentioned this issue Apr 7, 2024

[RFC] Parallel & Batch Ingestion opensearch-project/OpenSearch#12457

Closed

IanHoang removed the untriaged label Apr 9, 2024

chishui added a commit to chishui/opensearch-benchmark that referenced this issue Apr 16, 2024

Add command "tuning" (opensearch-project#508)

52b6cea

chishui mentioned this issue Apr 16, 2024

Add command "tuning" (#508) #515

Closed

1 task

chishui added a commit to chishui/opensearch-benchmark that referenced this issue Apr 28, 2024

Add command "tuning" (opensearch-project#508)

62b337b

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

IanHoang closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion #508

[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion #508

chishui commented Apr 7, 2024

chishui commented Apr 10, 2024

IanHoang commented May 14, 2024 •

edited

Loading

IanHoang commented May 24, 2024 •

edited

Loading

[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion #508

[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion #508

Comments

chishui commented Apr 7, 2024

chishui commented Apr 10, 2024

IanHoang commented May 14, 2024 • edited Loading

Some of my concerns for this feature:

Alternative solution

IanHoang commented May 24, 2024 • edited Loading

IanHoang commented May 14, 2024 •

edited

Loading

IanHoang commented May 24, 2024 •

edited

Loading