Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support popular vector search dataset like sift, gist as corpus that can be downloaded from public repository #442

Open
VijayanB opened this issue Jan 19, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@VijayanB
Copy link
Member

Is your feature request related to a problem? Please describe.
Similar to nyctaxi, geonames corpus, OpenSerach Benchmark can support some of the popular vector search datasets, that can be downloaded as corpus and used in vector serach workload instead of downloading manually every time for standard usecases.
This can be added as part of nightly runs too.

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
While setting up private corpus repository for vector search workload, i get exception at

line = data_file.readline()
. This is expected since vector search datasets are not standard utf-8 file. This is blocker for adding custom dataset as corpus to vectorsearch workload.

Describe the solution you'd like

A clear and concise description of what you want to happen.
prepare_file_offset_table method was required to optimize disk read for large file by creating offset table that creates mapping from line number to file offset. This is not required for vectorsearch datasets, since, we don't need this offset table for bulk ingestion at this moment. We should make this creation of file offset table optional to extend corpus eligibility criteria to support multiple formats.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.
Manually download those files into temp directory and update input file path to points to downloaded location.

Additional context

Add any other context or screenshots about the feature request here.

@jmazanec15
Copy link
Member

@VijayanB Can you provide what using with and without the corpus would look like to an end user?

@VijayanB
Copy link
Member Author

VijayanB commented Feb 8, 2024

Vector search workload requires three external inputs like train ( vectors to index ), test ( vectors to search ) and neighbors ( ground truth ). It works well for average/advance users who wants to benchmark performance against specific dataset. However for simple users or nightly runs, we can use some of the popular datasets to consistently measure performance across every runs. In this case it will be easier and better user experience, if vectorsearch has ability to just run workload similar to nyctaxi instead of downloading those standard inputs every time

@jmazanec15
Copy link
Member

@VijayanB Makes sense. Could you show in issue what users have to do now vs. what they could once change is added? i.e. cli comands, etc. Just want to make sure I understand the change in experience.

@VijayanB
Copy link
Member Author

VijayanB commented Feb 8, 2024

@jmazanec15 Sure. i was about to raise PR in workload, Let me share how new param file will look like

{
    "target_index_name": "target_index",
    "target_field_name": "target_field",
    "target_index_body": "indices/nmslib-index.json",
    "target_index_primary_shards": 1,
    "target_index_dimension": 128,
    "target_index_space_type": "l2",
    
    "target_index_bulk_size": 100,
    "target_index_bulk_index_data_set_format": "hdf5",
    "target_index_bulk_index_data_set_corpus": "sift-128-euclidean-train",
    "target_index_bulk_indexing_clients": 10,
    
    
    "target_index_max_num_segments": 10,
    "target_index_force_merge_timeout": 45.0,
    "hnsw_ef_search": 100,
    "hnsw_ef_construction": 100,
    "query_k": 100,

    "query_data_set_format": "hdf5",
    "query_data_set_corpus":"sift-128-euclidean-test",
    "neighbors_data_set_format": "hdf5",
    "neighbors_data_set_corpus":"sift-128-euclidean-neighbors",
    "query_count": 100
  }

As a pre req, corresponding corpus should be added to workload.json file similar to other workloads
This doesn't break existing behavior where users could provide file path.

@jmazanec15
Copy link
Member

@VijayanB thanks, I see so corpus for the most part will represent just a file that abstracts away the location of it, correct?

@VijayanB
Copy link
Member Author

VijayanB commented Feb 8, 2024

@jmazanec15 Thats correct. It can represent more than 1 file, but, at this moment we don't support folder or multiple files as input, hence, added check to make sure that more than one documents (or files) is not supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants