Add OpenSearch Benchmark index workload for k-NN #364

jmazanec15 · 2022-04-15T18:33:32Z

Description

This PR is the initial PR in a series that will integrate k-NN benchmarks with the OpenSearch Benchmark framework. This PR focuses on adding tests for indexing with indices that require models and ones that dont (refer to procedures subdirectory).

To support this, several custom runners and parameter sources had to be configured. In particular, functionality to bulk index from a data set in the HDF5 format of ann benchmarks or the BIGANN format of bigann benchmarks was ported over from our current tool. Additionally, functionality to train a model was ported over as well. For reviewers unfamiliar with OpenSearch benchmarks, custom functionality can be configured in a workload.py file that sits next to the workload.json definition.

One of the major benefits of this particular PR is that now we can use multiple clients to index data from a data set into the cluster.

In the future, we will look to add more functionality around querying.

For reviewers, a good place to start reviewing is the README. This contains details about the benchmarks as well as how to run them.

Issues Resolved

#341

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Adds a basic k-NN workload for OSB. Workload simply deletes and then creates a k-NN index. Signed-off-by: John Mazanec <jmazane@amazon.com>

Adds custom runners and parameter sources to be able to index from a data set that is not necessarily in JSON form. Code for this is lifted from the k-NN perf tool. Signed-off-by: John Mazanec <jmazane@amazon.com>

Adds support for multiple clients on indexing. Partitions data set amongst those clients. Condition is that the data set needs to be divisible by number of clients. Signed-off-by: John Mazanec <jmazane@amazon.com>

Separates custom runners and parameter sources into a separate module so that they can be shared across different tracks. Signed-off-by: John Mazanec <jmazane@amazon.com>

Parametrizes test so that one file contains all of the different variables that need to be set. Signed-off-by: John Mazanec <jmazane@amazon.com>

Adds requirements file for easy pip install. Adds documentation for OpenSearch benchmarks as well as specific test. Signed-off-by: John Mazanec <jmazane@amazon.com>

Signed-off-by: John Mazanec <jmazane@amazon.com>

Adds functionality for training index workflow to OSB logic. Parametrizes operations into their own file. Adds additional runners and parameter sources. Signed-off-by: John Mazanec <jmazane@amazon.com>

Signed-off-by: John Mazanec <jmazane@amazon.com>

codecov-commenter · 2022-04-15T18:51:41Z

Codecov Report

Merging #364 (dec66f1) into main (81b30b1) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main     #364   +/-   ##
=========================================
  Coverage     84.01%   84.01%           
  Complexity      911      911           
=========================================
  Files           130      130           
  Lines          3879     3879           
  Branches        359      359           
=========================================
  Hits           3259     3259           
  Misses          458      458           
  Partials        162      162

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81b30b1...dec66f1. Read the comment docs.

travisbenedict

This is great! I would love to see this incorporated into the OpenSearch Benchmark Workloads repo once these workloads are more finalized.

benchmarks/osb/README.md

benchmarks/osb/extensions/runners.py

Signed-off-by: John Mazanec <jmazane@amazon.com>

benchmarks/osb/extensions/data_set.py

benchmarks/osb/extensions/param_sources.py

benchmarks/osb/extensions/runners.py

Signed-off-by: John Mazanec <jmazane@amazon.com>

VijayanB · 2022-04-20T23:30:25Z

benchmarks/osb/extensions/runners.py

+
+                return
+            except:
+                pass


shall we capture the error and log. Also, we should know whether is it transient error or actual error. We should only retry transient error.

Sure, I will catch connection timeout error and log a message with it.

VijayanB · 2022-04-20T23:33:24Z

benchmarks/osb/extensions/runners.py

+    async def __call__(self, opensearch, params):
+        # Train a model and wait for it training to complete
+        body = params["body"]
+        timeout = parse_int_parameter("timeout", params)


it looks like retry than timeout. If we want timeout, then we should start timer and check value less than time out before iterating.

Not sure I understand the proposal. I set it as timeout because the call returns before training finishes. So once the call is submitted, I check every second if the model state is created.

At line 73, 1 second is elapsed, but you are not accounting number of seconds it takes to execute from line 74 till 83. I believe it is more than a second that it is spent inside the loop. First, i will check is it possible to pass timeout to request itself and wait for the response ( or timeout exception from API itself). If that is not possible, we need a stop watch at line 71, and checks whether stop watch reached expected seconds as loop condition.

I think the request itself isnt timing out. We are basically waiting on training to finish. I can setup the stop watch in the loop.

benchmarks/osb/extensions/data_set.py

benchmarks/osb/extensions/param_sources.py

benchmarks/osb/extensions/runners.py

Signed-off-by: John Mazanec <jmazane@amazon.com>

benchmarks/osb/extensions/param_sources.py

benchmarks/osb/extensions/runners.py

benchmarks/osb/extensions/util.py

Signed-off-by: John Mazanec <jmazane@amazon.com>

martin-gaievski

Looks good, thank you

VijayanB

Thanks for fixing review comments. LGTM.

jmazanec15 added 15 commits March 22, 2022 14:21

Initial commit adding support for OSB

d3c948b

Adds a basic k-NN workload for OSB. Workload simply deletes and then creates a k-NN index. Signed-off-by: John Mazanec <jmazane@amazon.com>

Add bulk index from knn data set

ddc8a59

Adds custom runners and parameter sources to be able to index from a data set that is not necessarily in JSON form. Code for this is lifted from the k-NN perf tool. Signed-off-by: John Mazanec <jmazane@amazon.com>

Add support for multiple clients on index

a4ea50f

Adds support for multiple clients on indexing. Partitions data set amongst those clients. Condition is that the data set needs to be divisible by number of clients. Signed-off-by: John Mazanec <jmazane@amazon.com>

Separate out custom code into module

d8d0878

Separates custom runners and parameter sources into a separate module so that they can be shared across different tracks. Signed-off-by: John Mazanec <jmazane@amazon.com>

Pull parameters into separate file

4a6d673

Parametrizes test so that one file contains all of the different variables that need to be set. Signed-off-by: John Mazanec <jmazane@amazon.com>

Clean up documentation and add requirements

268a726

Adds requirements file for easy pip install. Adds documentation for OpenSearch benchmarks as well as specific test. Signed-off-by: John Mazanec <jmazane@amazon.com>

Add parameter validation tools

5b41506

Signed-off-by: John Mazanec <jmazane@amazon.com>

Fix defaults

17d0068

Signed-off-by: John Mazanec <jmazane@amazon.com>

Initial commit adding training index workflow

83a05e3

Adds functionality for training index workflow to OSB logic. Parametrizes operations into their own file. Adds additional runners and parameter sources. Signed-off-by: John Mazanec <jmazane@amazon.com>

Fix bugs around train code

9b3bcdc

Signed-off-by: John Mazanec <jmazane@amazon.com>

Minor fix nmslib-index body

4684403

Signed-off-by: John Mazanec <jmazane@amazon.com>

Refactor tests from workloads to procedures

152af84

Signed-off-by: John Mazanec <jmazane@amazon.com>

Separate parameters

597b0fa

Signed-off-by: John Mazanec <jmazane@amazon.com>

Cleanup extension implementations

4104210

Signed-off-by: John Mazanec <jmazane@amazon.com>

Cleanup README and parameters

02afbe7

Signed-off-by: John Mazanec <jmazane@amazon.com>

jmazanec15 added the Infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc. label Apr 15, 2022

jmazanec15 requested a review from a team April 15, 2022 18:33

jmazanec15 mentioned this pull request Apr 15, 2022

Support Benchmarking K-NN Plugin and Vectorsearch Workload opensearch-project/opensearch-benchmark#103

Closed

travisbenedict previously approved these changes Apr 18, 2022

View reviewed changes

benchmarks/osb/README.md Outdated Show resolved Hide resolved

benchmarks/osb/extensions/runners.py Outdated Show resolved Hide resolved

jmazanec15 added 3 commits April 18, 2022 16:55

Switch param source to use copy module

692ce7d

Signed-off-by: John Mazanec <jmazane@amazon.com>

Update readme

313e917

Signed-off-by: John Mazanec <jmazane@amazon.com>

Update parsing of parameters

fe86040

Signed-off-by: John Mazanec <jmazane@amazon.com>

jmazanec15 dismissed travisbenedict’s stale review via fe86040 April 18, 2022 23:58

Merge branch 'main' of github.com:opensearch-project/k-NN into issue-341

bcc6e13