-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OpenSearch Benchmark index workload for k-NN #364
Conversation
Adds a basic k-NN workload for OSB. Workload simply deletes and then creates a k-NN index. Signed-off-by: John Mazanec <jmazane@amazon.com>
Adds custom runners and parameter sources to be able to index from a data set that is not necessarily in JSON form. Code for this is lifted from the k-NN perf tool. Signed-off-by: John Mazanec <jmazane@amazon.com>
Adds support for multiple clients on indexing. Partitions data set amongst those clients. Condition is that the data set needs to be divisible by number of clients. Signed-off-by: John Mazanec <jmazane@amazon.com>
Separates custom runners and parameter sources into a separate module so that they can be shared across different tracks. Signed-off-by: John Mazanec <jmazane@amazon.com>
Parametrizes test so that one file contains all of the different variables that need to be set. Signed-off-by: John Mazanec <jmazane@amazon.com>
Adds requirements file for easy pip install. Adds documentation for OpenSearch benchmarks as well as specific test. Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Adds functionality for training index workflow to OSB logic. Parametrizes operations into their own file. Adds additional runners and parameter sources. Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Codecov Report
@@ Coverage Diff @@
## main #364 +/- ##
=========================================
Coverage 84.01% 84.01%
Complexity 911 911
=========================================
Files 130 130
Lines 3879 3879
Branches 359 359
=========================================
Hits 3259 3259
Misses 458 458
Partials 162 162 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I would love to see this incorporated into the OpenSearch Benchmark Workloads repo once these workloads are more finalized.
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
benchmarks/osb/extensions/runners.py
Outdated
|
||
return | ||
except: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we capture the error and log. Also, we should know whether is it transient error or actual error. We should only retry transient error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will catch connection timeout error and log a message with it.
async def __call__(self, opensearch, params): | ||
# Train a model and wait for it training to complete | ||
body = params["body"] | ||
timeout = parse_int_parameter("timeout", params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like retry than timeout. If we want timeout, then we should start timer and check value less than time out before iterating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the proposal. I set it as timeout because the call returns before training finishes. So once the call is submitted, I check every second if the model state is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At line 73, 1 second is elapsed, but you are not accounting number of seconds it takes to execute from line 74 till 83. I believe it is more than a second that it is spent inside the loop. First, i will check is it possible to pass timeout to request itself and wait for the response ( or timeout exception from API itself). If that is not possible, we need a stop watch at line 71, and checks whether stop watch reached expected seconds as loop condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the request itself isnt timing out. We are basically waiting on training to finish. I can setup the stop watch in the loop.
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing review comments. LGTM.
Description
This PR is the initial PR in a series that will integrate k-NN benchmarks with the OpenSearch Benchmark framework. This PR focuses on adding tests for indexing with indices that require models and ones that dont (refer to procedures subdirectory).
To support this, several custom runners and parameter sources had to be configured. In particular, functionality to bulk index from a data set in the HDF5 format of ann benchmarks or the BIGANN format of bigann benchmarks was ported over from our current tool. Additionally, functionality to train a model was ported over as well. For reviewers unfamiliar with OpenSearch benchmarks, custom functionality can be configured in a workload.py file that sits next to the workload.json definition.
One of the major benefits of this particular PR is that now we can use multiple clients to index data from a data set into the cluster.
In the future, we will look to add more functionality around querying.
For reviewers, a good place to start reviewing is the README. This contains details about the benchmarks as well as how to run them.
Issues Resolved
#341
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.