A Plug & Play example is given in pdxearch_simple.py
. This example creates a random collection of vectors with scikit-learn. The rest of the examples read vector data from a .hdf5
file.
Our Python bindings expect Numpy matrices as input. However, the provided examples read vectors in a .hdf5
format expected to be located at /benchmarks/datasets/downloaded
. These datasets follow the convention used in the ANN-Benchmarks project. One .hdf5
file with two datasets: train
and test
. We have a few ways in which you can download the data we used:
- Download and unzip ALL the
.hdf5
datasets manually from here (~25GB zipped and ~40GB unzipped). - Download datasets individually from here.
- Run the script
/benchmarks/python_scripts/setup_data.py
from the root folder with the script flagDOWNLOAD = True
. This will download and unzip ALL the.hdf5
datasets (~25GB zipped and ~40GB unzipped). Make sure you set all the other flags toFalse
and comment the elements inside theALGORITHMS
array.
You can, of course, change each example to read your data.
This collection of examples uses the algorithms exposed in our Python bindings.
pdx_brute.py
: PDX kernels (without pruning). The full scan on vertical kernels shines when D is high, as the tight loops of the kernel avoid additional LOAD/STORE operations and are free of dependencies. Refer to Figure 3 in our publication.pdxearch_simple.py
: PDXearch (pruned search) + ADSampling with an IVF index (built with FAISS). Plug & Play example that uses a random collection of vectors.pdxearch_exact.py
: PDXearch (pruned search) + ADSampling on the entire collection (no index). This produces virtually exact results. In our experiments, the recall loss due to ADSampling hypothesis testing was never higher than 0.001.pdxearch_exact_bond.py
: PDXearch (pruned search) + BOND on the entire collection (no index). This produces exact results.pdxearch_ivf.py
: PDXearch (pruned search) + ADSampling with an IVF index (built with FAISS). The recall is controlled with thenprobe
parameter.pdxearch_ivf_exhaustive.py
: Exact search using PDXearch (pruned search) + ADSampling with an IVF index (built with FAISS). We can do an exact search by exploring all the buckets. This lets the pruning strategy shine and get up to 13x speedup. This produces virtually exact results. In our experiments, the recall loss due to ADSampling hypothesis testing was never higher than 0.001.pdxearch_ivf_exhaustive_bond.py
: Exact search using PDXearch (pruned search) + BOND with an IVF index (built with FAISS). We can do an exact search by exploring all the buckets. This produces exact results.pdxearch_persist.py
: Example to store the PDX index and the metadata of ADSampling in a file to use later
Note
As part of our research, we also ran benchmarks of PDXearch against the pruning algorithms ADSampling and BSA (renamed to DDC) on the N-ary/horizontal layout. These are not available to use directly in our Python bindings. Refer to BENCHMARKING.md.