Add vector search workload with no train procedure as default (opense…

…arch-project#144) * Add knnvector as new workload Create new workload to benchmark performacne of knn_vector field type. Added unit test and procedure for notrain. Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Update README Update readme to include how to execute this workload. Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Add new param file faiss enginge Added new param file to index/search vector search using faiss as engine type Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Rename knnvector to vectorsearch Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Add lucene engine Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * fix code review comments Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> --------- Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
harshavamsi · Mar 5, 2024 · 775e8bc · 775e8bc
1 parent 0551583
commit 775e8bc
Show file tree

Hide file tree

Showing 13 changed files with 529 additions and 0 deletions.
diff --git a/vectorsearch/README.md b/vectorsearch/README.md
@@ -0,0 +1,178 @@
+# Vector Search Workload
+
+This workload is to benchmark performance of indexing and search of Vector Engine of OpenSearch.
+
+## Datasets
+
+This workload currently supports datasets  with either HDF5 format or Big-ANN.
+You can download datasets from [here](http://corpus-texmex.irisa.fr/) to benchmark the quality of approximate k-NN algorithm from
+OpenSearch.
+
+### Running a benchmark
+
+Before running a benchmark, ensure that the load generation host is able to access your cluster endpoint and that the 
+appropriate dataset is available on the host.
+
+Currently, we support only one test procedure for the vector search workload. This is named no-train-test and does not include the steps required to train the model being used.
+This test procedures will index a data set of vectors into an OpenSearch cluster and then run a set of queries against the generated index. 
+
+Due to the number of parameters this workload offers, it's recommended to create a parameter file that specifies the desired workload 
+parameters instead of listing them all on the OSB command line. Users are welcome to use the example param files,
+`faiss-sift-128-l2.json`, `nmslib-sift-128-l2.json`, or `lucene-sift-128-l2.json` in `/params`, as references. Here, we named
+the parameter file using a format `<Vector Engine Type>-<Dataset Name>-<No of Dimension>-<Space Type>.json`
+
+To run the workload, invoke the following command with the params file.
+
+```
+# OpenSearch Cluster End point url with hostname and port
+export ENDPOINT=  
+# Absolute file path of Workload param file
+export PARAMS_FILE=
+
+opensearch-benchmark execute-test \
+    --target-hosts $ENDPOINT \
+    --workload vectorsearch \
+    --workload-params ${PARAMS_FILE} \
+    --pipeline benchmark-only \
+    --kill-running-processes
+```
+
+## Current Procedures
+
+### No Train Test
+
+The No Train Test procedure is used to test vector search indices which requires no training.
+You can define the underlying configuration of the vector search algorithm like specific engine, space type, etc... as
+method definition . Check [vector search method definitions]([https://opensearch.org/docs/latest/search-plugins/knn/knn-index/#method-definitions)
+for more details.
+
+#### Parameters
+
+This workload allows the following parameters to be specified using `--workload-params`:
+
+| Name                                    | Description                                                              |
+|-----------------------------------------|--------------------------------------------------------------------------|
+| target_index_name                       | Name of index to add vectors to                                          |
+| target_field_name                       | Name of field to add vectors to                                          |
+| target_index_body                       | Path to target index definition                                          |
+| target_index_primary_shards             | Target index primary shards                                              |
+| target_index_replica_shards             | Target index replica shards                                              |
+| target_index_dimension                  | Dimension of target index                                                |
+| target_index_space_type                 | Target index space type                                                  |
+| target_index_bulk_size                  | Target index bulk size                                                   |
+| target_index_bulk_index_data_set_format | Format of vector data set                                                |
+| target_index_bulk_index_data_set_path   | Path to vector data set                                                  |
+| target_index_bulk_index_clients         | Clients to be used for bulk ingestion (must be divisor of data set size) |
+| target_index_max_num_segments           | Number of segments to merge target index down to before beginning search |
+| target_index_force_merge_timeout        | Timeout for of force merge requests in seconds                           |
+| hnsw_ef_search                          | HNSW ef search parameter                                                 |
+| hnsw_ef_construction                    | HNSW ef construction parameter                                           |
+| hnsw_m                                  | HNSW m parameter                                                         |
+| query_k                                 | The number of neighbors to return for the search                         |
+| query_clients                           | Number of clients to use for running queries                             |
+| query_data_set_format                   | Format of vector data set for queries                                    |
+| query_data_set_path                     | Path to vector data set for queries                                      |
+| query_count                             | Number of queries for search operation                                   |
+
+
+
+#### Sample Output
+
+The output of a sample test run is provided below. Metrics are captured in the result's data store as usual, and this can be configured to be 
+either in-memory, or an external OpenSearch cluster.
+
+```
+------------------------------------------------------
+    _______             __   _____
+   / ____(_)___  ____ _/ /  / ___/_________  ________
+  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
+ / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
+/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
+------------------------------------------------------
+            
+|                                                         Metric |               Task |       Value |   Unit |
+|---------------------------------------------------------------:|-------------------:|------------:|-------:|
+|                     Cumulative indexing time of primary shards |                    |  0.00946667 |    min |
+|             Min cumulative indexing time across primary shards |                    |           0 |    min |
+|          Median cumulative indexing time across primary shards |                    |  0.00298333 |    min |
+|             Max cumulative indexing time across primary shards |                    |  0.00336667 |    min |
+|            Cumulative indexing throttle time of primary shards |                    |           0 |    min |
+|    Min cumulative indexing throttle time across primary shards |                    |           0 |    min |
+| Median cumulative indexing throttle time across primary shards |                    |           0 |    min |
+|    Max cumulative indexing throttle time across primary shards |                    |           0 |    min |
+|                        Cumulative merge time of primary shards |                    |           0 |    min |
+|                       Cumulative merge count of primary shards |                    |           0 |        |
+|                Min cumulative merge time across primary shards |                    |           0 |    min |
+|             Median cumulative merge time across primary shards |                    |           0 |    min |
+|                Max cumulative merge time across primary shards |                    |           0 |    min |
+|               Cumulative merge throttle time of primary shards |                    |           0 |    min |
+|       Min cumulative merge throttle time across primary shards |                    |           0 |    min |
+|    Median cumulative merge throttle time across primary shards |                    |           0 |    min |
+|       Max cumulative merge throttle time across primary shards |                    |           0 |    min |
+|                      Cumulative refresh time of primary shards |                    |  0.00861667 |    min |
+|                     Cumulative refresh count of primary shards |                    |          33 |        |
+|              Min cumulative refresh time across primary shards |                    |           0 |    min |
+|           Median cumulative refresh time across primary shards |                    |  0.00268333 |    min |
+|              Max cumulative refresh time across primary shards |                    |  0.00291667 |    min |
+|                        Cumulative flush time of primary shards |                    | 0.000183333 |    min |
+|                       Cumulative flush count of primary shards |                    |           2 |        |
+|                Min cumulative flush time across primary shards |                    |           0 |    min |
+|             Median cumulative flush time across primary shards |                    |           0 |    min |
+|                Max cumulative flush time across primary shards |                    | 0.000183333 |    min |
+|                                        Total Young Gen GC time |                    |       0.075 |      s |
+|                                       Total Young Gen GC count |                    |          17 |        |
+|                                          Total Old Gen GC time |                    |           0 |      s |
+|                                         Total Old Gen GC count |                    |           0 |        |
+|                                                     Store size |                    |  0.00869293 |     GB |
+|                                                  Translog size |                    | 2.56114e-07 |     GB |
+|                                         Heap used for segments |                    |           0 |     MB |
+|                                       Heap used for doc values |                    |           0 |     MB |
+|                                            Heap used for terms |                    |           0 |     MB |
+|                                            Heap used for norms |                    |           0 |     MB |
+|                                           Heap used for points |                    |           0 |     MB |
+|                                    Heap used for stored fields |                    |           0 |     MB |
+|                                                  Segment count |                    |           9 |        |
+|                                                 Min Throughput | custom-vector-bulk |       25527 | docs/s |
+|                                                Mean Throughput | custom-vector-bulk |       25527 | docs/s |
+|                                              Median Throughput | custom-vector-bulk |       25527 | docs/s |
+|                                                 Max Throughput | custom-vector-bulk |       25527 | docs/s |
+|                                        50th percentile latency | custom-vector-bulk |     36.3095 |     ms |
+|                                        90th percentile latency | custom-vector-bulk |     52.2662 |     ms |
+|                                       100th percentile latency | custom-vector-bulk |     68.6513 |     ms |
+|                                   50th percentile service time | custom-vector-bulk |     36.3095 |     ms |
+|                                   90th percentile service time | custom-vector-bulk |     52.2662 |     ms |
+|                                  100th percentile service time | custom-vector-bulk |     68.6513 |     ms |
+|                                                     error rate | custom-vector-bulk |           0 |      % |
+|                                                 Min Throughput |       prod-queries |      211.26 |  ops/s |
+|                                                Mean Throughput |       prod-queries |      213.85 |  ops/s |
+|                                              Median Throughput |       prod-queries |      213.48 |  ops/s |
+|                                                 Max Throughput |       prod-queries |      216.49 |  ops/s |
+|                                        50th percentile latency |       prod-queries |     3.43393 |     ms |
+|                                        90th percentile latency |       prod-queries |     4.01881 |     ms |
+|                                        99th percentile latency |       prod-queries |     5.56238 |     ms |
+|                                      99.9th percentile latency |       prod-queries |     9.95666 |     ms |
+|                                     99.99th percentile latency |       prod-queries |     39.7922 |     ms |
+|                                       100th percentile latency |       prod-queries |      62.415 |     ms |
+|                                   50th percentile service time |       prod-queries |     3.43405 |     ms |
+|                                   90th percentile service time |       prod-queries |      4.0191 |     ms |
+|                                   99th percentile service time |       prod-queries |     5.56316 |     ms |
+|                                 99.9th percentile service time |       prod-queries |     9.95666 |     ms |
+|                                99.99th percentile service time |       prod-queries |     39.7922 |     ms |
+|                                  100th percentile service time |       prod-queries |      62.415 |     ms |
+|                                                     error rate |       prod-queries |           0 |      % |
+
+
+---------------------------------
+[INFO] SUCCESS (took 119 seconds)
+---------------------------------
+
+```
+
+
+### Custom Runners
+
+Currently, there is only one custom runner defined in [runners.py](runners.py).
+
+| Syntax             | Description                                         | Parameters                                                                                                   |
+|--------------------|-----------------------------------------------------|:-------------------------------------------------------------------------------------------------------------|
+| warmup-knn-indices | Warm up knn indices with retry until success.       | 1. index - name of index to warmup                                                                           |
diff --git a/vectorsearch/__init__.py b/vectorsearch/__init__.py
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: Apache-2.0
+#
+# The OpenSearch Contributors require contributions made to
+# this file be licensed under the Apache-2.0 license or a
+# compatible open source license.
diff --git a/vectorsearch/indices/faiss-index.json b/vectorsearch/indices/faiss-index.json
@@ -0,0 +1,41 @@
+{
+    "settings": {
+      "index": {
+        "knn": true
+        {%- if target_index_primary_shards is defined and target_index_primary_shards %}
+        ,"number_of_shards": {{ target_index_primary_shards }}
+        {%- endif %}
+        {%- if target_index_replica_shards is defined and target_index_replica_shards %}
+        ,"number_of_replicas": {{ target_index_replica_shards }}
+        {%- endif %}
+        {%- if hnsw_ef_search is defined and hnsw_ef_search %}
+        ,"knn.algo_param.ef_search": {{ hnsw_ef_search }}
+        {%- endif %}
+      }
+    },
+    "mappings": {
+      "dynamic": "strict",
+      "properties": {
+        "target_field": {
+          "type": "knn_vector",
+          "dimension": {{ target_index_dimension }},
+          "method": {
+            "name": "hnsw",
+            "space_type": "{{ target_index_space_type }}",
+            "engine": "faiss",
+            "parameters": {
+            {%- if hnsw_ef_construction is defined and hnsw_ef_construction %}
+            "ef_construction": {{ hnsw_ef_construction }}
+            {%- endif %}
+            {%- if hnsw_m is defined and hnsw_m %}
+            {%- if hnsw_ef_construction is defined and hnsw_ef_construction %}
+            ,
+            {%- endif %}
+            "m": {{ hnsw_m }}
+            {%- endif %}
+            }
+          }
+        }
+      }
+    }
+  }
diff --git a/vectorsearch/indices/lucene-index.json b/vectorsearch/indices/lucene-index.json
@@ -0,0 +1,41 @@
+{
+    "settings": {
+      "index": {
+        "knn": true
+        {%- if target_index_primary_shards is defined and target_index_primary_shards %}
+        ,"number_of_shards": {{ target_index_primary_shards }}
+        {%- endif %}
+        {%- if target_index_replica_shards is defined and target_index_replica_shards %}
+        ,"number_of_replicas": {{ target_index_replica_shards }}
+        {%- endif %}
+        {%- if hnsw_ef_search is defined and hnsw_ef_search %}
+        ,"knn.algo_param.ef_search": {{ hnsw_ef_search }}
+        {%- endif %}
+      }
+    },
+    "mappings": {
+      "dynamic": "strict",
+      "properties": {
+        "target_field": {
+          "type": "knn_vector",
+          "dimension": {{ target_index_dimension }},
+          "method": {
+            "name": "hnsw",
+            "space_type": "{{ target_index_space_type }}",
+            "engine": "lucene",
+            "parameters": {
+            {%- if hnsw_ef_construction is defined and hnsw_ef_construction %}
+            "ef_construction": {{ hnsw_ef_construction }}
+            {%- endif %}
+            {%- if hnsw_m is defined and hnsw_m %}
+            {%- if hnsw_ef_construction is defined and hnsw_ef_construction %}
+            ,
+            {%- endif %}
+            "m": {{ hnsw_m }}
+            {%- endif %}
+            }
+          }
+        }
+      }
+    }
+  }
diff --git a/vectorsearch/indices/nmslib-index.json b/vectorsearch/indices/nmslib-index.json
@@ -0,0 +1,41 @@
+{
+    "settings": {
+      "index": {
+        "knn": true
+        {%- if target_index_primary_shards is defined and target_index_primary_shards %}
+        ,"number_of_shards": {{ target_index_primary_shards }}
+        {%- endif %}
+        {%- if target_index_replica_shards is defined and target_index_replica_shards %}
+        ,"number_of_replicas": {{ target_index_replica_shards }}
+        {%- endif %}
+        {%- if hnsw_ef_search is defined and hnsw_ef_search %}
+        ,"knn.algo_param.ef_search": {{ hnsw_ef_search }}
+        {%- endif %}
+      }
+    },
+    "mappings": {
+      "dynamic": "strict",
+      "properties": {
+        "target_field": {
+          "type": "knn_vector",
+          "dimension": {{ target_index_dimension }},
+          "method": {
+            "name": "hnsw",
+            "space_type": "{{ target_index_space_type }}",
+            "engine": "nmslib",
+            "parameters": {
+            {%- if hnsw_ef_construction is defined and hnsw_ef_construction %}
+            "ef_construction": {{ hnsw_ef_construction }}
+            {%- endif %}
+            {%- if hnsw_m is defined and hnsw_m %}
+            {%- if hnsw_ef_construction is defined and hnsw_ef_construction %}
+            ,
+            {%- endif %}
+            "m": {{ hnsw_m }}
+            {%- endif %}
+            }
+          }
+        }
+      }
+    }
+  }
diff --git a/vectorsearch/operations/default.json b/vectorsearch/operations/default.json
@@ -0,0 +1,21 @@
+{
+    "name": "warmup-indices",
+    "operation-type": "warmup-knn-indices",
+    "index": "{{ target_index_name | default('target_index') }}",
+    "include-in-results_publishing": false
+},
+{
+    "name": "force-merge",
+    "operation-type": "force-merge",
+    "request-timeout": {{ target_index_force_merge_timeout | default(7200) }},
+    "index": "{{ target_index_name | default('target_index') }}",
+    "mode": "polling",
+    "max-num-segments": {{ target_index_max_num_segments | default(1) }},
+    "include-in-results_publishing": false
+},
+{
+    "name": "refresh-target-index",
+    "operation-type": "refresh",
+    "retries": 100,
+    "index": "{{ target_index_name | default('target_index') }}"
+}
diff --git a/vectorsearch/params/faiss-sift-128-l2.json b/vectorsearch/params/faiss-sift-128-l2.json
@@ -0,0 +1,23 @@
+{
+    "target_index_name": "target_index",
+    "target_field_name": "target_field",
+    "target_index_body": "indices/faiss-index.json",
+    "target_index_primary_shards": 1,
+    "target_index_dimension": 128,
+    "target_index_space_type": "l2",
+
+    "target_index_bulk_size": 100,
+    "target_index_bulk_index_data_set_format": "hdf5",
+    "target_index_bulk_index_data_set_path": "/tmp/sift-128-euclidean.hdf5",
+    "target_index_bulk_indexing_clients": 10,
+
+    "target_index_max_num_segments": 10,
+    "target_index_force_merge_timeout": 45.0,
+    "hnsw_ef_search": 100,
+    "hnsw_ef_construction": 100,
+    "query_k": 100,
+
+    "query_data_set_format": "hdf5",
+    "query_data_set_path":"/tmp/sift-128-euclidean.hdf5",
+    "query_count": 100
+  }