Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenSearch Benchmark index workload for k-NN #364

Merged
merged 23 commits into from
Apr 25, 2022

Conversation

jmazanec15
Copy link
Member

Description

This PR is the initial PR in a series that will integrate k-NN benchmarks with the OpenSearch Benchmark framework. This PR focuses on adding tests for indexing with indices that require models and ones that dont (refer to procedures subdirectory).

To support this, several custom runners and parameter sources had to be configured. In particular, functionality to bulk index from a data set in the HDF5 format of ann benchmarks or the BIGANN format of bigann benchmarks was ported over from our current tool. Additionally, functionality to train a model was ported over as well. For reviewers unfamiliar with OpenSearch benchmarks, custom functionality can be configured in a workload.py file that sits next to the workload.json definition.

One of the major benefits of this particular PR is that now we can use multiple clients to index data from a data set into the cluster.

In the future, we will look to add more functionality around querying.

For reviewers, a good place to start reviewing is the README. This contains details about the benchmarks as well as how to run them.

Issues Resolved

#341

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Adds a basic k-NN workload for OSB. Workload simply deletes and then
creates a k-NN index.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Adds custom runners and parameter sources to be able to index from a
data set that is not necessarily in JSON form. Code for this is lifted
from the k-NN perf tool.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Adds support for multiple clients on indexing. Partitions data set
amongst those clients. Condition is that the data set needs to be
divisible by number of clients.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Separates custom runners and parameter sources into a separate module so
that they can be shared across different tracks.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Parametrizes test so that one file contains all of the different
variables that need to be set.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Adds requirements file for easy pip install. Adds documentation for
OpenSearch benchmarks as well as specific test.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Adds functionality for training index workflow to OSB logic.
Parametrizes operations into their own file. Adds additional runners and
parameter sources.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
@jmazanec15 jmazanec15 added the Infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc. label Apr 15, 2022
@jmazanec15 jmazanec15 requested a review from a team April 15, 2022 18:33
@codecov-commenter
Copy link

codecov-commenter commented Apr 15, 2022

Codecov Report

Merging #364 (dec66f1) into main (81b30b1) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main     #364   +/-   ##
=========================================
  Coverage     84.01%   84.01%           
  Complexity      911      911           
=========================================
  Files           130      130           
  Lines          3879     3879           
  Branches        359      359           
=========================================
  Hits           3259     3259           
  Misses          458      458           
  Partials        162      162           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81b30b1...dec66f1. Read the comment docs.

travisbenedict
travisbenedict previously approved these changes Apr 18, 2022
Copy link

@travisbenedict travisbenedict left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I would love to see this incorporated into the OpenSearch Benchmark Workloads repo once these workloads are more finalized.

benchmarks/osb/README.md Outdated Show resolved Hide resolved
benchmarks/osb/extensions/runners.py Outdated Show resolved Hide resolved
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
@jmazanec15 jmazanec15 requested a review from VijayanB April 19, 2022 22:24
Signed-off-by: John Mazanec <jmazane@amazon.com>

return
except:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we capture the error and log. Also, we should know whether is it transient error or actual error. We should only retry transient error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will catch connection timeout error and log a message with it.

async def __call__(self, opensearch, params):
# Train a model and wait for it training to complete
body = params["body"]
timeout = parse_int_parameter("timeout", params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like retry than timeout. If we want timeout, then we should start timer and check value less than time out before iterating.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the proposal. I set it as timeout because the call returns before training finishes. So once the call is submitted, I check every second if the model state is created.

Copy link
Member

@VijayanB VijayanB Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 73, 1 second is elapsed, but you are not accounting number of seconds it takes to execute from line 74 till 83. I believe it is more than a second that it is spent inside the loop. First, i will check is it possible to pass timeout to request itself and wait for the response ( or timeout exception from API itself). If that is not possible, we need a stop watch at line 71, and checks whether stop watch reached expected seconds as loop condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the request itself isnt timing out. We are basically waiting on training to finish. I can setup the stop watch in the loop.

benchmarks/osb/extensions/data_set.py Outdated Show resolved Hide resolved
benchmarks/osb/extensions/data_set.py Outdated Show resolved Hide resolved
benchmarks/osb/extensions/param_sources.py Outdated Show resolved Hide resolved
benchmarks/osb/extensions/runners.py Show resolved Hide resolved
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you

Copy link
Member

@VijayanB VijayanB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing review comments. LGTM.

@jmazanec15 jmazanec15 merged commit e5745de into opensearch-project:main Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc. v2.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants