[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion #13009

chishui · 2024-04-02T02:56:24Z

Is your feature request related to a problem? Please describe

In this batch ingestion RFC, we proposed a batch ingestion feature which could accelerate the ingestion with neural search processors. It introduces an additional parameter "batch size" that texts from different documents could be combined and sent to ML server in one request. Since user could have different data set, different ML servers with different resources, in order to achieve better performance, they would need to experiment with different value of batch size to get the optimal performance. To offload the burden from user, we'd like to have a automation tool which could find this optimal batch size automatically.

Describe the solution you'd like

The automation tool would run _bulk API with different batch size to see which batch size would lead to optimal performance (high throughput & low latency & no errors). The OpenSearch-benchmark tool already provides rich features on benchmark which we could utilize for this automation. We can call benchmark with different parameters, collect and evaluate and even visualize the results then provide the recommendation.

The tool can also be used to help select bulk size and client number which we can support gradually.

Related component

Indexing:Performance

Describe alternatives you've considered

N/A

Additional context

RFC for batch ingestion: #12457

shwetathareja · 2024-04-02T08:17:51Z

@chishui : The proposal here is to build an automation over OSB benchmark which can find the optimal size for bulk. This doesnt require any changes in OpenSearch core. DO you think we can track this proposal in OSB repo?

chishui · 2024-04-02T08:51:29Z

@shwetathareja thanks for the comment. Although the proposal will not require changes in OpenSearch core, but it's a solution to address concern in this RFC #12457 which proposes batch ingestion feature in core. The tool itself doesn't need to reside in OSB codebase, it can be a standalone package just depends on OSB. So I kind of feel here might be a better place. WDYT?

mgodwan · 2024-04-02T11:01:04Z

The tool can also be used to help select bulk size and client number which we can support gradually.

Today, using OSB, you can configure both the number of clients and bulk size. Is the proposal around executing a set of runs, and provide a collated report for users to analyse better? Let me know if I may be missing something.

I've run some benchmarks for features and have written small custom scripts on top of it to generate visualizations using OSB reports (see example #4489 (comment)) which may prove helpful for such comparisons and had been thinking of contributing similar ideas to OSB.

The tool itself doesn't need to reside in OSB codebase, it can be a standalone package just depends on OSB. So I kind of feel here might be a better place. WDYT?

Since this is mostly about benchmarking scenarios and if any changes are needed, they may be in OSB itself or depend on it. I also think the issue may better reside in OSB repository to discuss performance measurement solutions.

reta · 2024-04-02T12:49:42Z

@chishui I think the intent is good but the problem space it not fully understood. As it stands now, this is operation issue: the optimal size of the bulk highly depends on the operational state of the cluster (available heap primarily) at given moment (please check out the Elasticsearch docs on the subject [1]).

Yes, tool will probably help to give at least some level of confidence but won't address the root of the problem.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html#multiple-workers-threads

chishui · 2024-04-02T12:53:42Z

Today, using OSB, you can configure both the number of clients and bulk size. Is the proposal around executing a set of runs, and provide a collated report for users to analyse better? Let me know if I may be missing something.

Yes, that's the idea!

I've run some benchmarks for features and have written small custom scripts on top of it to generate visualizations using OSB reports (see example #4489 (comment)) which may prove helpful for such comparisons and had been thinking of contributing similar ideas to OSB.

That's great, seems like the tool could have a much wider use case.

Since this is mostly about benchmarking scenarios and if any changes are needed, they may be in OSB itself or depend on it. I also think the issue may better reside in OSB repository to discuss performance measurement solutions.

Just feel the tool doesn't need to part of OSB and it almost doesn't require any changes from OSB at least for now.

chishui · 2024-04-02T13:05:49Z

@reta the idea is to try many combination of the variants which could impact ingestion performance such as bulk size, client number, batch size etc and based on the benchmark results, identify the combination which leads to the optimal ingestion performance.

And the "optimal" benchmark result could happen when OS node has more CPU, memory resource, that's true, but we are not able to have the exact same environment to run every benchmark test, and if that's the concern, maybe run the tool more times to make the results statistically make sense. Actually, we can let user configure retry time for each test and we can average the benchmark results before compare to reduce the outlier situation.

shwetathareja · 2024-04-03T05:59:06Z

Just feel the tool doesn't need to part of OSB and it almost doesn't require any changes from OSB at least for now.

@chishui : This could be an enhancement to OSB where it can used for optimal tuning of different configurations.

peternied · 2024-04-03T15:40:44Z

[Triage - attendees 1 2 3 4 5 6 7 8]
@chishui Thanks for creating this issue - this is an interesting idea, however it is outside the scope of OpenSearch core. Consider re-creating an issue in the OpenSearch Benchmark repository where this functionality might better align.

chishui added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 2, 2024

github-actions bot added the Indexing:Performance label Apr 2, 2024

This was referenced Apr 2, 2024

[FEATURE] A tool to help decide the optimal batch size for ingestion with neural search processors opensearch-project/neural-search#655

Closed

[RFC] Parallel & Batch Ingestion #12457

Closed

chishui mentioned this issue Apr 2, 2024

[BUG] Can't use benchmark and jupyter notebook in the same python env as their dependency versions conflict opensearch-project/opensearch-benchmark#499

Closed

peternied closed this as completed Apr 3, 2024

chishui mentioned this issue Apr 7, 2024

[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion opensearch-project/opensearch-benchmark#508

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion #13009

[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion #13009

chishui commented Apr 2, 2024

shwetathareja commented Apr 2, 2024

chishui commented Apr 2, 2024

mgodwan commented Apr 2, 2024

reta commented Apr 2, 2024 •

edited

Loading

chishui commented Apr 2, 2024

chishui commented Apr 2, 2024 •

edited

Loading

shwetathareja commented Apr 3, 2024

peternied commented Apr 3, 2024

[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion #13009

[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion #13009

Comments

chishui commented Apr 2, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

shwetathareja commented Apr 2, 2024

chishui commented Apr 2, 2024

mgodwan commented Apr 2, 2024

reta commented Apr 2, 2024 • edited Loading

chishui commented Apr 2, 2024

chishui commented Apr 2, 2024 • edited Loading

shwetathareja commented Apr 3, 2024

peternied commented Apr 3, 2024

reta commented Apr 2, 2024 •

edited

Loading

chishui commented Apr 2, 2024 •

edited

Loading