Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faiss to Parquet Conversion #2631

Merged

Conversation

valamuri2020
Copy link
Member

get_faiss_indexes.sh - downloads BEIR collections in faiss format

faiss_to_parquet - converts a collection in faiss to parquet format

Usage:

python src/main/python/parquet/faiss_to_parquet.py --input /path/to/faiss/collection/ --output /path/to/output/ --overwrite. Use --overwrite to overwrite the output directory. Each file in the output directory will have up to 1M vectors per file.

run_conversions.sh - Script to convert collections to parquet, run indexing, search, and eval with the newly converted Parquet data. This was part of validation to ensure embedding quality matched JSON to Parquet conversion - ended up keeping the script after.

Usage:

To use the default base directory:
./run_conversions.sh --all
To specify a custom base directory:
./run_conversions.sh /path/to/custom/dir --all
To process specific subdirectories in a custom directory:
./run_conversions.sh /path/to/custom/dir subdir1 subdir2

@valamuri2020 valamuri2020 changed the title Faiss to Parquet conversion Faiss to Parquet Conversion Nov 22, 2024
Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update submodule tools/; your current PR is going backwards to an earlier commit state.

get_faiss_indexes.sh Outdated Show resolved Hide resolved
@lintool lintool merged commit 8bd8ca8 into castorini:master Nov 22, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants