-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] An automation tool to help identify the optimal bulk/batch size for ingestion #508
Comments
I put the example code in this repo https://github.com/chishui/opensearch-ingest-param-tuning-tool, please evaluate to see if we can put the functionalities in the OSB |
Signed-off-by: Liyun Xiu <xiliyun@amazon.com>
Thank you for suggesting this @chishui. After our discussion offline and inspecting your changes, I have some concerns over this feature and its user experience. OpenSearch performance is complex in nature and is influenced by several factors — such as cluster configurations, hardware, workload types, etc. OSB was designed to be a flexible tool that encourages users to engage in the experimentation process in order to develop a better understanding of their cluster’s performance and make strategic decisions based off the data collected. Some of my concerns for this feature:
Alternative solutionTo combat some of these concerns, we can take an alternative approach and have studies performed in the open that show how how variables like indexing / search clients, bulk size, and batch size influence a cluster’s performance. These studies would serve as a learning resource that's publicly available and reproducible (similar to the nightly runs at opensearch.org/benchmarks). We can also add more documentation on performance testing and tuning variables. These resources would set users up for long-term success by equipping them with a better understanding of these variables and empowering them to use this knowledge, paired with OSB, to better assess their cluster's performance. Again, although I see the use-case, I’m not certain that it fits within the design and purpose of OSB. The current implementation of the |
@chishui Closing this issue for now as no activity. We can sync and collaborate to find a suitable solution for your needs. |
Is your feature request related to a problem? Please describe.
In this opensearch-project/OpenSearch#12457, we proposed a batch ingestion feature which could accelerate the ingestion with neural search processors. It introduces an additional parameter "batch size" that texts from different documents could be combined and sent to ML server in one request. Since user could have different data set, different ML servers with different resources, in order to achieve better performance, they would need to experiment with different value of batch size to get the optimal performance. To offload the burden from user, we'd like to have a automation tool which could find this optimal batch size automatically.
Describe the solution you'd like
The automation tool would run _bulk API with different batch size to see which batch size would lead to optimal performance (high throughput & low latency & no errors). The OpenSearch-benchmark tool already provides rich features on benchmark which we could utilize for this automation. We can call benchmark with different parameters, collect and evaluate and even visualize the results then provide the recommendation.
The tool can also be used to help select bulk size and client number which we can support gradually.
Describe alternatives you've considered
N/A
Additional context
This issue was originally created in OS core repo but was suggested to repost in OSB. opensearch-project/OpenSearch#13009
The text was updated successfully, but these errors were encountered: