-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support popular vector search dataset like sift, gist as corpus that can be downloaded from public repository #442
Comments
@VijayanB Can you provide what using with and without the corpus would look like to an end user? |
Vector search workload requires three external inputs like train ( vectors to index ), test ( vectors to search ) and neighbors ( ground truth ). It works well for average/advance users who wants to benchmark performance against specific dataset. However for simple users or nightly runs, we can use some of the popular datasets to consistently measure performance across every runs. In this case it will be easier and better user experience, if vectorsearch has ability to just run workload similar to nyctaxi instead of downloading those standard inputs every time |
@VijayanB Makes sense. Could you show in issue what users have to do now vs. what they could once change is added? i.e. cli comands, etc. Just want to make sure I understand the change in experience. |
@jmazanec15 Sure. i was about to raise PR in workload, Let me share how new param file will look like
As a pre req, corresponding corpus should be added to workload.json file similar to other workloads |
@VijayanB thanks, I see so corpus for the most part will represent just a file that abstracts away the location of it, correct? |
@jmazanec15 Thats correct. It can represent more than 1 file, but, at this moment we don't support folder or multiple files as input, hence, added check to make sure that more than one documents (or files) is not supported. |
Is your feature request related to a problem? Please describe.
Similar to nyctaxi, geonames corpus, OpenSerach Benchmark can support some of the popular vector search datasets, that can be downloaded as corpus and used in vector serach workload instead of downloading manually every time for standard usecases.
This can be added as part of nightly runs too.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
While setting up private corpus repository for vector search workload, i get exception at
opensearch-benchmark/osbenchmark/utils/io.py
Line 578 in 9ffbec0
Describe the solution you'd like
A clear and concise description of what you want to happen.
prepare_file_offset_table method was required to optimize disk read for large file by creating offset table that creates mapping from line number to file offset. This is not required for vectorsearch datasets, since, we don't need this offset table for bulk ingestion at this moment. We should make this creation of file offset table optional to extend corpus eligibility criteria to support multiple formats.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Manually download those files into temp directory and update input file path to points to downloaded location.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: