-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor benchmark dataset format and add big ann benchmark format #265
Refactor benchmark dataset format and add big ann benchmark format #265
Conversation
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
|
||
if dataset_format == 'bigann' and context == Context.NEIGHBORS: | ||
return BigANNNeighborDataSet(dataset_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the behavior in case dataset is in some unsupported format or in a wrong context, do we need to validate and maybe throw some error in such case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Behavior is undefined. I will add some validation.
|
||
if dataset_format == 'bigann' and context == Context.NEIGHBORS: | ||
return BigANNNeighborDataSet(dataset_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: BigANNNeighborDataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt like that looked better as well but PyCharm was giving me spelling errors so I went with DataSet. Here is a post on it: https://english.stackexchange.com/questions/2120/which-is-correct-dataset-or-data-set. I dont really have a preference so I went with the one that wasnt giving me spelling errors.
@@ -341,10 +354,13 @@ def get_body(vec): | |||
|
|||
results = {} | |||
query_responses = [] | |||
for v in self.dataset.test: | |||
while True: | |||
query = self.dataset.read(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: constant with name why it is 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 because it reads 1 vector from the dataset at a time.
Signed-off-by: John Mazanec <jmazane@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…pensearch-project#265) Signed-off-by: John Mazanec <jmazane@amazon.com>
…pensearch-project#265) Signed-off-by: John Mazanec <jmazane@amazon.com> Signed-off-by: Martin Gaievski <gaievski@amazon.com>
…pensearch-project#265) Signed-off-by: John Mazanec <jmazane@amazon.com>
Description
This change refactors how the benchmarking tool works with datasets. It creates a DataSet interface and implements the interface for the existing HDF5 format as well as the formats specified in Big ANN Benchmarks. It then introduces these changes in the steps that need datasets.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.