Support popular vector search dataset like sift, gist as corpus that can be downloaded from public repository #442

VijayanB · 2024-01-19T22:56:07Z

Is your feature request related to a problem? Please describe.
Similar to nyctaxi, geonames corpus, OpenSerach Benchmark can support some of the popular vector search datasets, that can be downloaded as corpus and used in vector serach workload instead of downloading manually every time for standard usecases.
This can be added as part of nightly runs too.

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
While setting up private corpus repository for vector search workload, i get exception at

opensearch-benchmark/osbenchmark/utils/io.py

Line 578 in 9ffbec0

line = data_file.readline()

. This is expected since vector search datasets are not standard utf-8 file. This is blocker for adding custom dataset as corpus to vectorsearch workload.

Describe the solution you'd like

A clear and concise description of what you want to happen.
prepare_file_offset_table method was required to optimize disk read for large file by creating offset table that creates mapping from line number to file offset. This is not required for vectorsearch datasets, since, we don't need this offset table for bulk ingestion at this moment. We should make this creation of file offset table optional to extend corpus eligibility criteria to support multiple formats.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.
Manually download those files into temp directory and update input file path to points to downloaded location.

Additional context

Add any other context or screenshots about the feature request here.

jmazanec15 · 2024-02-08T19:33:01Z

@VijayanB Can you provide what using with and without the corpus would look like to an end user?

VijayanB · 2024-02-08T19:42:10Z

Vector search workload requires three external inputs like train ( vectors to index ), test ( vectors to search ) and neighbors ( ground truth ). It works well for average/advance users who wants to benchmark performance against specific dataset. However for simple users or nightly runs, we can use some of the popular datasets to consistently measure performance across every runs. In this case it will be easier and better user experience, if vectorsearch has ability to just run workload similar to nyctaxi instead of downloading those standard inputs every time

jmazanec15 · 2024-02-08T19:45:16Z

@VijayanB Makes sense. Could you show in issue what users have to do now vs. what they could once change is added? i.e. cli comands, etc. Just want to make sure I understand the change in experience.

VijayanB · 2024-02-08T19:50:28Z

@jmazanec15 Sure. i was about to raise PR in workload, Let me share how new param file will look like

{
    "target_index_name": "target_index",
    "target_field_name": "target_field",
    "target_index_body": "indices/nmslib-index.json",
    "target_index_primary_shards": 1,
    "target_index_dimension": 128,
    "target_index_space_type": "l2",
    
    "target_index_bulk_size": 100,
    "target_index_bulk_index_data_set_format": "hdf5",
    "target_index_bulk_index_data_set_corpus": "sift-128-euclidean-train",
    "target_index_bulk_indexing_clients": 10,
    
    
    "target_index_max_num_segments": 10,
    "target_index_force_merge_timeout": 45.0,
    "hnsw_ef_search": 100,
    "hnsw_ef_construction": 100,
    "query_k": 100,

    "query_data_set_format": "hdf5",
    "query_data_set_corpus":"sift-128-euclidean-test",
    "neighbors_data_set_format": "hdf5",
    "neighbors_data_set_corpus":"sift-128-euclidean-neighbors",
    "query_count": 100
  }

As a pre req, corresponding corpus should be added to workload.json file similar to other workloads
This doesn't break existing behavior where users could provide file path.

jmazanec15 · 2024-02-08T19:53:54Z

@VijayanB thanks, I see so corpus for the most part will represent just a file that abstracts away the location of it, correct?

VijayanB · 2024-02-08T19:56:13Z

@jmazanec15 Thats correct. It can represent more than 1 file, but, at this moment we don't support folder or multiple files as input, hence, added check to make sure that more than one documents (or files) is not supported.

VijayanB added the enhancement New feature or request label Jan 19, 2024

github-actions bot added the untriaged label Jan 19, 2024

IanHoang removed the untriaged label Jan 23, 2024

This was referenced Feb 6, 2024

Add new property to define operation #458

Merged

Add support to use corpus for vector search params #459

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support popular vector search dataset like sift, gist as corpus that can be downloaded from public repository #442

Support popular vector search dataset like sift, gist as corpus that can be downloaded from public repository #442

VijayanB commented Jan 19, 2024

jmazanec15 commented Feb 8, 2024

VijayanB commented Feb 8, 2024

jmazanec15 commented Feb 8, 2024

VijayanB commented Feb 8, 2024 •

edited

Loading

jmazanec15 commented Feb 8, 2024

VijayanB commented Feb 8, 2024

Support popular vector search dataset like sift, gist as corpus that can be downloaded from public repository #442

Support popular vector search dataset like sift, gist as corpus that can be downloaded from public repository #442

Comments

VijayanB commented Jan 19, 2024

jmazanec15 commented Feb 8, 2024

VijayanB commented Feb 8, 2024

jmazanec15 commented Feb 8, 2024

VijayanB commented Feb 8, 2024 • edited Loading

jmazanec15 commented Feb 8, 2024

VijayanB commented Feb 8, 2024

VijayanB commented Feb 8, 2024 •

edited

Loading