Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion #13009

Closed
chishui opened this issue Apr 2, 2024 · 8 comments
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Performance untriaged

Comments

@chishui
Copy link
Contributor

chishui commented Apr 2, 2024

Is your feature request related to a problem? Please describe

In this batch ingestion RFC, we proposed a batch ingestion feature which could accelerate the ingestion with neural search processors. It introduces an additional parameter "batch size" that texts from different documents could be combined and sent to ML server in one request. Since user could have different data set, different ML servers with different resources, in order to achieve better performance, they would need to experiment with different value of batch size to get the optimal performance. To offload the burden from user, we'd like to have a automation tool which could find this optimal batch size automatically.

Describe the solution you'd like

The automation tool would run _bulk API with different batch size to see which batch size would lead to optimal performance (high throughput & low latency & no errors). The OpenSearch-benchmark tool already provides rich features on benchmark which we could utilize for this automation. We can call benchmark with different parameters, collect and evaluate and even visualize the results then provide the recommendation.

The tool can also be used to help select bulk size and client number which we can support gradually.

Related component

Indexing:Performance

Describe alternatives you've considered

N/A

Additional context

RFC for batch ingestion: #12457

@shwetathareja
Copy link
Member

@chishui : The proposal here is to build an automation over OSB benchmark which can find the optimal size for bulk. This doesnt require any changes in OpenSearch core. DO you think we can track this proposal in OSB repo?

@chishui
Copy link
Contributor Author

chishui commented Apr 2, 2024

@shwetathareja thanks for the comment. Although the proposal will not require changes in OpenSearch core, but it's a solution to address concern in this RFC #12457 which proposes batch ingestion feature in core. The tool itself doesn't need to reside in OSB codebase, it can be a standalone package just depends on OSB. So I kind of feel here might be a better place. WDYT?

@mgodwan
Copy link
Member

mgodwan commented Apr 2, 2024

The tool can also be used to help select bulk size and client number which we can support gradually.

Today, using OSB, you can configure both the number of clients and bulk size. Is the proposal around executing a set of runs, and provide a collated report for users to analyse better? Let me know if I may be missing something.

I've run some benchmarks for features and have written small custom scripts on top of it to generate visualizations using OSB reports (see example #4489 (comment)) which may prove helpful for such comparisons and had been thinking of contributing similar ideas to OSB.

The tool itself doesn't need to reside in OSB codebase, it can be a standalone package just depends on OSB. So I kind of feel here might be a better place. WDYT?

Since this is mostly about benchmarking scenarios and if any changes are needed, they may be in OSB itself or depend on it. I also think the issue may better reside in OSB repository to discuss performance measurement solutions.

@reta
Copy link
Collaborator

reta commented Apr 2, 2024

@chishui I think the intent is good but the problem space it not fully understood. As it stands now, this is operation issue: the optimal size of the bulk highly depends on the operational state of the cluster (available heap primarily) at given moment (please check out the Elasticsearch docs on the subject [1]).

Yes, tool will probably help to give at least some level of confidence but won't address the root of the problem.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html#multiple-workers-threads

@chishui
Copy link
Contributor Author

chishui commented Apr 2, 2024

Today, using OSB, you can configure both the number of clients and bulk size. Is the proposal around executing a set of runs, and provide a collated report for users to analyse better? Let me know if I may be missing something.

Yes, that's the idea!

I've run some benchmarks for features and have written small custom scripts on top of it to generate visualizations using OSB reports (see example #4489 (comment)) which may prove helpful for such comparisons and had been thinking of contributing similar ideas to OSB.

That's great, seems like the tool could have a much wider use case.

Since this is mostly about benchmarking scenarios and if any changes are needed, they may be in OSB itself or depend on it. I also think the issue may better reside in OSB repository to discuss performance measurement solutions.

Just feel the tool doesn't need to part of OSB and it almost doesn't require any changes from OSB at least for now.

@chishui
Copy link
Contributor Author

chishui commented Apr 2, 2024

@reta the idea is to try many combination of the variants which could impact ingestion performance such as bulk size, client number, batch size etc and based on the benchmark results, identify the combination which leads to the optimal ingestion performance.

And the "optimal" benchmark result could happen when OS node has more CPU, memory resource, that's true, but we are not able to have the exact same environment to run every benchmark test, and if that's the concern, maybe run the tool more times to make the results statistically make sense. Actually, we can let user configure retry time for each test and we can average the benchmark results before compare to reduce the outlier situation.

@shwetathareja
Copy link
Member

Just feel the tool doesn't need to part of OSB and it almost doesn't require any changes from OSB at least for now.

@chishui : This could be an enhancement to OSB where it can used for optimal tuning of different configurations.

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7 8]
@chishui Thanks for creating this issue - this is an interesting idea, however it is outside the scope of OpenSearch core. Consider re-creating an issue in the OpenSearch Benchmark repository where this functionality might better align.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Performance untriaged
Projects
None yet
Development

No branches or pull requests

5 participants