Add vector search workload with no train procedure as default #144

VijayanB · 2023-11-22T23:00:12Z

Description

Add vector search workload to benchmark performance of indexing and search using knn_vector as field type.

Issues Resolved

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

VijayanB · 2023-11-22T23:54:07Z

This PR contains benchmark workload that was previously added to knn repository . This PR contains only indexing and search component. Other features like training will be added in subsequent PR.

VijayanB · 2023-11-23T00:02:03Z

@rishabh6788 To run this workload, we have dependencies to library like numpy and h5py. Should this be added in this workload or to opensearch-benchmark repository? It is good to be in this repository, provided that while checking out this repository we also install requirements.

knnvector/test_procedures/default.json

knnvector/README.md

knnvector/operations/default.json

knnvector/params/nmslib-sift-128-l2.json

knnvector/params_sources.py

knnvector/runners.py

knnvector/README.md

VijayanB · 2023-11-30T21:21:35Z

[INFO] Executing test with workload [knnvector], test_procedure [no-train-test] and provision_config_instance ['external'] with version [2.10.0].

[WARNING] indexing_total_time is 568 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 517 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 11 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-target-index                                                    [100% done]
Running create-target-index                                                    [100% done]
Running wait-for-cluster-to-be-green                                           [100% done]
Running custom-vector-bulk                                                     [100% done]
Running force-merge-segments                                                   [100% done]
Running refresh-target-index                                                   [100% done]
Running warmup-indices                                                         [100% done]
Running prod-queries                                                           [100% done]

(venv) ➜  opensearch-benchmark git:(main) ✗ curl http://localhost:9200/_cat/indices\?v
health status index                     uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   target_index              OZ6FQEjHQhafoE8CTFvVPA   3   1     100000            0    146.6mb        146.6mb
green  open   .plugins-ml-config        fgiAUaxuQPa7ShZFMnw6Vg   1   0          1            0      3.9kb          3.9kb
green  open   .opensearch-observability k7ccOjOhRtaS-Y9Yx1kdmA   1   0          0            0       208b           208b

When using default num of segments

(venv) ➜  opensearch-benchmark git:(main) ✗ curl http://localhost:9200/_cat/segments\?v
index              shard prirep ip         segment generation docs.count docs.deleted   size size.memory committed searchable version compound
target_index       0     p      172.17.0.2 _5               5      33162            0 48.6mb           0 true      true       9.7.0   false
target_index       1     p      172.17.0.2 _4               4      33324            0 48.9mb           0 true      true       9.7.0   false
target_index       2     p      172.17.0.2 _3               3      33514            0 49.1mb           0 true      true       9.7.0   false
.plugins-ml-config 0     p      172.17.0.2 _0               0          1            0  3.6kb           0 true      true       9.7.0   true

(venv) ➜  opensearch-benchmark git:(main) ✗ curl http://localhost:9200/_plugins/_knn/stats\?pretty              
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "docker-cluster",
  "circuit_breaker_triggered" : false,
  "model_index_status" : null,
  "nodes" : {
    "PrEm8YY6QQ6CNN9HkMUOJQ" : {
      "graph_memory_usage_percentage" : 1.8404406,
      "graph_query_requests" : 960000,
      "graph_memory_usage" : 65308,
      "cache_capacity_reached" : false,
      "load_success_count" : 96,
      "training_memory_usage" : 0,
      "indices_in_cache" : {
        "target_index" : {
          "graph_memory_usage_percentage" : 1.8404406,
          "graph_memory_usage" : 65308,
          "graph_count" : 3
        }
      },
      "script_query_errors" : 0,
      "hit_count" : 960000,
      "knn_query_requests" : 120000,
      "total_load_time" : 224553968,
      "miss_count" : 96,
      "knn_query_with_filter_requests" : 0,
      "training_memory_usage_percentage" : 0.0,
      "lucene_initialized" : false,
      "graph_index_requests" : 108,
      "faiss_initialized" : false,
      "load_exception_count" : 0,
      "training_errors" : 0,
      "eviction_count" : 0,
      "nmslib_initialized" : true,
      "script_compilations" : 0,
      "script_query_requests" : 0,
      "graph_query_errors" : 0,
      "indexing_from_model_degraded" : false,
      "graph_index_errors" : 0,
      "training_requests" : 0,
      "script_compilation_errors" : 0
    }
  }
}

knnvector/params_sources.py

knnvector/runners.py

knnvector/test_procedures/default.json

jmazanec15

LGTM!

knnvector/test_procedures/default.json

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

IanHoang · 2024-01-09T22:56:44Z

There is no coupling between params file name and workload. I named it such a way that it gives hint on selected param values .

@VijayanB I understand now. To clarify this, could you add a section in the README stating that the files faiss-sift-128-l2 and nmslib-sift-128-l2 are sample params that can be used in the workload and can be used as reference params file for users who want to make their own custom params file.
EDIT: See that it was added now 👍🏻

VijayanB · 2024-01-10T00:33:19Z

There is no coupling between params file name and workload. I named it such a way that it gives hint on selected param values .

@VijayanB I understand now. To clarify this, could you add a section in the README stating that the files faiss-sift-128-l2 and nmslib-sift-128-l2 are sample params that can be used in the workload and can be used as reference params file for users who want to make their own custom params file. EDIT: See that it was added now 👍🏻

@IanHoang Any pending comments needs to be addressed?

VijayanB · 2024-01-12T19:47:20Z

@gkamat can you take a look at this PR? Thanks

vectorsearch/README.md

gkamat · 2024-01-17T00:49:41Z

vectorsearch/README.md

+Currently, we support one test procedures for the vector search workload: 
+no-train-test that does not have steps to train a model included in the 


support only one test procedure for the vector search workload. This is named no-train-test and does not include the steps required to train the model being used.

Please indicate how the training steps are supposed to be carried out. Or if the expectation is that the workload is to be run on an untrained system, please clarify.

@gkamat I will update the text and will add new procedure which can use model in future. Do you recommend to mention this future work in README?

Yes, that would be ideal. Subsequently, you can update the writeup when the new procedure gets added.

vectorsearch/README.md

vectorsearch/runners.py

vectorsearch/test_procedures/default.json

vectorsearch/workload.json

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

gkamat · 2024-01-18T22:16:35Z

Please confirm this is intended for backport to both the 1 and 2 branches.

gkamat

Please confirm the backport labels are set correctly before merging. Thanks.

VijayanB · 2024-01-18T23:47:31Z

@gkamat yes, This is supported for both OpenSearch 1.x and 2.x

* Add knnvector as new workload Create new workload to benchmark performacne of knn_vector field type. Added unit test and procedure for notrain. Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Update README Update readme to include how to execute this workload. Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Add new param file faiss enginge Added new param file to index/search vector search using faiss as engine type Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Rename knnvector to vectorsearch Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Add lucene engine Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * fix code review comments Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> --------- Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> (cherry picked from commit bdbd4bb) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…156) * Add knnvector as new workload Create new workload to benchmark performacne of knn_vector field type. Added unit test and procedure for notrain. * Update README Update readme to include how to execute this workload. * Add new param file faiss enginge Added new param file to index/search vector search using faiss as engine type * Rename knnvector to vectorsearch * Add lucene engine * fix code review comments --------- (cherry picked from commit bdbd4bb) Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…157) * Add knnvector as new workload Create new workload to benchmark performacne of knn_vector field type. Added unit test and procedure for notrain. * Update README Update readme to include how to execute this workload. * Add new param file faiss enginge Added new param file to index/search vector search using faiss as engine type * Rename knnvector to vectorsearch * Add lucene engine * fix code review comments --------- (cherry picked from commit bdbd4bb) Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Add knnvector as new workload Create new workload to benchmark performacne of knn_vector field type. Added unit test and procedure for notrain. Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Update README Update readme to include how to execute this workload. Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Add new param file faiss enginge Added new param file to index/search vector search using faiss as engine type Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Rename knnvector to vectorsearch Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Add lucene engine Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * fix code review comments Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> --------- Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> (cherry picked from commit bdbd4bb) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…arch-project#144) * Add knnvector as new workload Create new workload to benchmark performacne of knn_vector field type. Added unit test and procedure for notrain. Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Update README Update readme to include how to execute this workload. Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Add new param file faiss enginge Added new param file to index/search vector search using faiss as engine type Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Rename knnvector to vectorsearch Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * Add lucene engine Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> * fix code review comments Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com> --------- Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

VijayanB requested review from IanHoang and gkamat as code owners November 22, 2023 23:00

VijayanB force-pushed the add-knn-vector-workload branch 3 times, most recently from 4cbd141 to d40f7d4 Compare November 22, 2023 23:52

VijayanB marked this pull request as draft November 23, 2023 00:06

VijayanB force-pushed the add-knn-vector-workload branch 2 times, most recently from 116f071 to 73425c6 Compare November 23, 2023 00:11