An OpenSearch ingestion variable tuning tool

When ingesting data to OpenSearch using bulk API, using different variables could result in different ingestion performance. For example, the amount of document in bulk API, how many OpenSearch clients are used to send requests etc. It's not easy for user to experiment with all the combinations of the variables and find the option which could lead to optimal ingestion performance. In OpenSearch-2.15.0, a new parameter "batch size" was introduced in bulk API which could significantly reduce ingestion time when using with text_embedding processor and sparse_encoding processor. However, this additional impactor could make the variable tuning even more difficult.

This tool is to help dealing with the pain point of tuning these variables which could impact ingestion performance and automatically find the optimal combination of the variables. It utilizes the OpenSearch-Benchmark, uses different varible combination to run benchmark, collects their outputs, analyzes and visualizes the results. It relies on OpenSearch-Benchmark tool to run benchmark with different variables, records the benchmark results, visualizes and recommends the combination which could lead to optimal ingestion result.

There are three variables that you can test against: bulk size, OS client number, batch size. If you already have a perferred value for certain variable, you can simply set it to a fixed value and only test other variables. For example, you always use 1 client to run bulk API with bulk size equals to 100, then you can only try different batch size to see which batch size can have the best performance.

It can be run through command line or through Jupyter Notebook.

Install

python3 -m venv tuning

source ./tuning/bin/activate

pip3 install opensearch-benchmark

If you want to run the tool through Jupyter Notebook

pip3 install notebook

Run

Example command:

python main.py run --target-hosts="{{opensearch host}}" --client-options="{{timeout, username, password etc}}" --workload-path="{{local path of workload}}"

Testing Schedule for variables

All three variables have two different parameters, one to set fixed value and one to set a testing schedule. The schedule has two patterns:

set starting value, end value, step size and trend, separated by :, e.g. "10:100:1:10" means we should test with "10, 20, 30, 40, 50, 60, 70, 80, 90, 100". "20:100:-1:20" means we should test reversely with "100, 80, 60, 40, 20"。
configure testing values manually by adding a prefix symbol @ and still separate values using : e.g. "@10:20:50" means we only test with 10, 20, 50.

Parameters shared with OpenSearch-Benchmark

We reuse these parameters with OpenSearch-Benchmark execute-test:

WORKLOAD_PATH same as "--workload-path" in OSB execute-test, "Define the path to a workload"
TARGET_HOSTS same as "--target-hosts" in OSB execute-test, "Define a comma-separated list of host:port pairs which should be targeted if using the pipeline 'benchmark-only' (default: localhost:9200)."
CLIENT_OPTIONS same as "--client-options" in OSB execute-test, "Define a comma-separated list of client options to use. The options will be passed to the OpenSearch Python client (default: timeout:60)."

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
tools		tools
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An OpenSearch ingestion variable tuning tool

Install

Run

Testing Schedule for variables

Parameters shared with OpenSearch-Benchmark

About

Releases

Packages

Languages

chishui/opensearch-ingest-param-tuning-tool

Folders and files

Latest commit

History

Repository files navigation

An OpenSearch ingestion variable tuning tool

Install

Run

Testing Schedule for variables

Parameters shared with OpenSearch-Benchmark

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages