-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion #13009
Comments
@chishui : The proposal here is to build an automation over OSB benchmark which can find the optimal size for bulk. This doesnt require any changes in OpenSearch core. DO you think we can track this proposal in OSB repo? |
@shwetathareja thanks for the comment. Although the proposal will not require changes in OpenSearch core, but it's a solution to address concern in this RFC #12457 which proposes batch ingestion feature in core. The tool itself doesn't need to reside in OSB codebase, it can be a standalone package just depends on OSB. So I kind of feel here might be a better place. WDYT? |
Today, using OSB, you can configure both the number of clients and bulk size. Is the proposal around executing a set of runs, and provide a collated report for users to analyse better? Let me know if I may be missing something. I've run some benchmarks for features and have written small custom scripts on top of it to generate visualizations using OSB reports (see example #4489 (comment)) which may prove helpful for such comparisons and had been thinking of contributing similar ideas to OSB.
Since this is mostly about benchmarking scenarios and if any changes are needed, they may be in OSB itself or depend on it. I also think the issue may better reside in OSB repository to discuss performance measurement solutions. |
@chishui I think the intent is good but the problem space it not fully understood. As it stands now, this is operation issue: the optimal size of the bulk highly depends on the operational state of the cluster (available heap primarily) at given moment (please check out the Elasticsearch docs on the subject [1]). Yes, tool will probably help to give at least some level of confidence but won't address the root of the problem. |
Yes, that's the idea!
That's great, seems like the tool could have a much wider use case.
Just feel the tool doesn't need to part of OSB and it almost doesn't require any changes from OSB at least for now. |
@reta the idea is to try many combination of the variants which could impact ingestion performance such as bulk size, client number, batch size etc and based on the benchmark results, identify the combination which leads to the optimal ingestion performance. And the "optimal" benchmark result could happen when OS node has more CPU, memory resource, that's true, but we are not able to have the exact same environment to run every benchmark test, and if that's the concern, maybe run the tool more times to make the results statistically make sense. Actually, we can let user configure retry time for each test and we can average the benchmark results before compare to reduce the outlier situation. |
@chishui : This could be an enhancement to OSB where it can used for optimal tuning of different configurations. |
Is your feature request related to a problem? Please describe
In this batch ingestion RFC, we proposed a batch ingestion feature which could accelerate the ingestion with neural search processors. It introduces an additional parameter "batch size" that texts from different documents could be combined and sent to ML server in one request. Since user could have different data set, different ML servers with different resources, in order to achieve better performance, they would need to experiment with different value of batch size to get the optimal performance. To offload the burden from user, we'd like to have a automation tool which could find this optimal batch size automatically.
Describe the solution you'd like
The automation tool would run _bulk API with different batch size to see which batch size would lead to optimal performance (high throughput & low latency & no errors). The OpenSearch-benchmark tool already provides rich features on benchmark which we could utilize for this automation. We can call benchmark with different parameters, collect and evaluate and even visualize the results then provide the recommendation.
The tool can also be used to help select bulk size and client number which we can support gradually.
Related component
Indexing:Performance
Describe alternatives you've considered
N/A
Additional context
RFC for batch ingestion: #12457
The text was updated successfully, but these errors were encountered: