Dynamic benchmarking nearest neighbors

Note

This project is a fork of the original ann-benchmarks repository. We make significant use of their codebase and extend it to handle dynamically updating algorithms. Please see the original project for more information. Attempts have been made to keep the original functionality of the benchmarks accessible, but this project is mainly concerned with dynamically updating algorithms.

Evaluated

The following methods are those evaluated in the original project, with annotations concerning their support for dynamic updates. We have also added dci-knn and pydci.

Method	Dynamic updates?	Notes on dynamic updates
Annoy	❌	"in particular you can not add more items once the tree has been created"
FLANN	✔️	"Working with dynamic point clouds without a need to rebuild entire kd-tree index"
scikit-learn: LSHForest, KDTree, BallTree	❌	Requires rebuilding the tree each time
PANNS	❌	Can be compared to old indices but needs to be recreated
NearPy	❌	Rehashes with each new query vector
KGraph	❌	The index file has to be rebuilt with each new dataset.
NMSLIB (Non-Metric Space Library): SWGraph, HNSW, BallTree, MPLSH	❌	Static data only
hnswlib (a part of nmslib project)	✔️	Updatable module
RPForest	❌	Tree structures have to be rebuilt
FAISS	❌	The x_i's are assumed to be fixed
DolphinnPy	❌	Hypeplane LSH family model
Datasketch	❌	Rebuild index each time
PyNNDescent	❌	Rebuild index each time
MRPT	❌	Rebuild index each time
NGT: ONNG, PANNG, QG	❌	Project has evolved into VALD, which supports dynamic updating
SPTAG	✔️	"Fresh update: Support online vector deletion and insertion"
PUFFINN	❌	Can insert points but needs to be rebuilt after insertion
N2	❌	Rebuild each time, but can be split over several threads
ScaNN	❌	Works well with large data, but does not support dynamic updating
Elastiknn	✔️	"Implementation based on standard Elasticsearch and Lucene primitives, entirely in the JVM. Indexing and querying scale horizontally with Elasticsearch."
OpenSearch KNN	❌	High latency in large dimensional vectors
DiskANN: Vamana, Vamana-PQ	✔️	Updates when points are added
Vespa	✔️	"Vespa is self-repairing and dynamic"
scipy: cKDTree	❌	Tree structure needs to be rebuilt
vald	✔️	"Vald uses distributed index graphs so it continues to work during indexing"
dci-knn	❌
pydci	✔️

Data sets

The datasets used for this project were provided to us by Siemens and are not included in this repository. Before running any tests on this data, please move the following files to the data directory: AS.csv, OLHC.csv, SHERPA.csv, SHERPA_100000.csv.

Note that the data included with the original project should be able to be adapted for dynamic tests. See the function ann-benchmarks.datasets.write_dynamic_output for more information and ann-benchmarks.datasets.siemens_dynamic for an example.

Install

The only prerequisite is Python (tested with 3.6) and Docker.

Clone the repo.
Run pip install -r requirements.txt.
Run python install.py to build all the libraries inside Docker containers (this can take a while, like 10-30 minutes).

Running

In order to replicate the results summarized in the report, try using the script run_script.sh. As currently setup, it will take a very long time to run, so running only one dataset or algorithm at a time may be preferable. This can be performed with the command

python run.py --dataset $DATASET --count -1 --runs 5 --algorithm $ALGORITHM

where $DATASET can be any of

siemens-sherpa
siemens-big-sherpa
siemens-olhc
siemens-as

and $ALGORITHM can be any of

bruteforce
dciknn
flann
hnswlib
pydci
sptag

See python run.py --help for further explanation and command line arguments.

In order to change the values tested in the parameter sweep for each method, change the args and query-args entries under the desired methods in algos.yaml.

Since the testing workflow is somewhat complex, a detailed walkthrough is provided in docs/dynamic_workflow_outline.md.

Plotting

To replicate the majority of plots in the report, after running run_script.sh, run plot_script.sh. This script will plot results for a few combinations of algorithms on all datasets. See python plot_dynamic.py --help for more information, and see the commands in plot_script.sh for examples. Note that specifying metrics in the --best_metric argument will find the run with the best behavior in that metric over the last 10% of queries. These algorithms will also be cached separately so that replotting will be much faster. Using the --force option will recompute these best runs rather than using their cached versions.

In addition, see custom_plot_dynamic.py for plotting individual runs rather than averages of all runs or those best in a certain metric. The main difference is that instead of algorithm names in the --algorithms argument, the specific HDF5 files of the desired runs to be plotted should be passed instead.

Including your algorithm

Add your algorithm into ann_benchmarks/algorithms by providing a small Python wrapper.
Add a Dockerfile in install/ for it
Add it to algos.yaml

Principles

Make everything easy to replicate, including installing and preparing the datasets.
Single queries are used by default. ANN-Benchmarks enforces that only one CPU is saturated during experimentation, i.e., no multi-threading.
We mainly support CPU-based ANN algorithms. GPU support exists for FAISS, but it has to be compiled with GPU support locally and experiments must be run using the flags --local --batch.
Note that we consider that set similarity datasets are sparse and thus we pass a sorted array of integers to algorithms to represent the set of each user.

Authors

Original project built by Erik Bernhardsson with significant contributions from Martin Aumüller and Alexander Faithfull.

The modifications for this project were primarily made by Craig Gross with support from Daniel Ejsmont, Cullen Haselby, and Jim Lewis at Michigan State University.

Related Publication

The following publication details design principles behind the original benchmarking framework:

M. Aumüller, E. Bernhardsson, A. Faithfull: ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. Information Systems 2019. DOI: 10.1016/j.is.2019.02.006

Related Projects

big-ann-benchmarks is a benchmarking effort for billion-scale approximate nearest neighbor search as part of the NeurIPS'21 Competition track.

Name		Name	Last commit message	Last commit date
Latest commit History 1,337 Commits
.github/workflows		.github/workflows
ann_benchmarks		ann_benchmarks
docs		docs
install		install
protocol		protocol
results		results
templates		templates
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
algos.yaml		algos.yaml
algosP.yaml		algosP.yaml
create_dataset.py		create_dataset.py
create_website.py		create_website.py
custom_plot_dynamic.py		custom_plot_dynamic.py
environment.yml		environment.yml
install.py		install.py
logging.conf		logging.conf
plot.py		plot.py
plot_dynamic.py		plot_dynamic.py
plot_script.sh		plot_script.sh
requirements.txt		requirements.txt
requirements_py38.txt		requirements_py38.txt
run.py		run.py
run_algorithm.py		run_algorithm.py
run_script.sh		run_script.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dynamic benchmarking nearest neighbors

Note

Evaluated

Data sets

Install

Running

Plotting

Including your algorithm

Principles

Authors

Related Publication

Related Projects

About

Releases

Packages

Languages

License

cgross95/ann-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Dynamic benchmarking nearest neighbors

Note

Evaluated

Data sets

Install

Running

Plotting

Including your algorithm

Principles

Authors

Related Publication

Related Projects

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages