Skip to content

Benchmarks of approximate nearest neighbor libraries in Python

License

Notifications You must be signed in to change notification settings

cgross95/ann-benchmarks

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dynamic benchmarking nearest neighbors

Note

This project is a fork of the original ann-benchmarks repository. We make significant use of their codebase and extend it to handle dynamically updating algorithms. Please see the original project for more information. Attempts have been made to keep the original functionality of the benchmarks accessible, but this project is mainly concerned with dynamically updating algorithms.

Evaluated

The following methods are those evaluated in the original project, with annotations concerning their support for dynamic updates. We have also added dci-knn and pydci.

Method Dynamic updates? Notes on dynamic updates
Annoy "in particular you can not add more items once the tree has been created"
FLANN ✔️ "Working with dynamic point clouds without a need to rebuild entire kd-tree index"
scikit-learn: LSHForest, KDTree, BallTree Requires rebuilding the tree each time
PANNS Can be compared to old indices but needs to be recreated
NearPy Rehashes with each new query vector
KGraph The index file has to be rebuilt with each new dataset.
NMSLIB (Non-Metric Space Library): SWGraph, HNSW, BallTree, MPLSH Static data only
hnswlib (a part of nmslib project) ✔️ Updatable module
RPForest Tree structures have to be rebuilt
FAISS The x_i's are assumed to be fixed
DolphinnPy Hypeplane LSH family model
Datasketch Rebuild index each time
PyNNDescent Rebuild index each time
MRPT Rebuild index each time
NGT: ONNG, PANNG, QG Project has evolved into VALD, which supports dynamic updating
SPTAG ✔️ "Fresh update: Support online vector deletion and insertion"
PUFFINN Can insert points but needs to be rebuilt after insertion
N2 Rebuild each time, but can be split over several threads
ScaNN Works well with large data, but does not support dynamic updating
Elastiknn ✔️ "Implementation based on standard Elasticsearch and Lucene primitives, entirely in the JVM. Indexing and querying scale horizontally with Elasticsearch."
OpenSearch KNN High latency in large dimensional vectors
DiskANN: Vamana, Vamana-PQ ✔️ Updates when points are added
Vespa ✔️ "Vespa is self-repairing and dynamic"
scipy: cKDTree Tree structure needs to be rebuilt
vald ✔️ "Vald uses distributed index graphs so it continues to work during indexing"
dci-knn
pydci ✔️

Data sets

The datasets used for this project were provided to us by Siemens and are not included in this repository. Before running any tests on this data, please move the following files to the data directory: AS.csv, OLHC.csv, SHERPA.csv, SHERPA_100000.csv.

Note that the data included with the original project should be able to be adapted for dynamic tests. See the function ann-benchmarks.datasets.write_dynamic_output for more information and ann-benchmarks.datasets.siemens_dynamic for an example.

Install

The only prerequisite is Python (tested with 3.6) and Docker.

  1. Clone the repo.
  2. Run pip install -r requirements.txt.
  3. Run python install.py to build all the libraries inside Docker containers (this can take a while, like 10-30 minutes).

Running

In order to replicate the results summarized in the report, try using the script run_script.sh. As currently setup, it will take a very long time to run, so running only one dataset or algorithm at a time may be preferable. This can be performed with the command

python run.py --dataset $DATASET --count -1 --runs 5 --algorithm $ALGORITHM

where $DATASET can be any of

  • siemens-sherpa
  • siemens-big-sherpa
  • siemens-olhc
  • siemens-as

and $ALGORITHM can be any of

  • bruteforce
  • dciknn
  • flann
  • hnswlib
  • pydci
  • sptag

See python run.py --help for further explanation and command line arguments.

In order to change the values tested in the parameter sweep for each method, change the args and query-args entries under the desired methods in algos.yaml.

Since the testing workflow is somewhat complex, a detailed walkthrough is provided in docs/dynamic_workflow_outline.md.

Plotting

To replicate the majority of plots in the report, after running run_script.sh, run plot_script.sh. This script will plot results for a few combinations of algorithms on all datasets. See python plot_dynamic.py --help for more information, and see the commands in plot_script.sh for examples. Note that specifying metrics in the --best_metric argument will find the run with the best behavior in that metric over the last 10% of queries. These algorithms will also be cached separately so that replotting will be much faster. Using the --force option will recompute these best runs rather than using their cached versions.

In addition, see custom_plot_dynamic.py for plotting individual runs rather than averages of all runs or those best in a certain metric. The main difference is that instead of algorithm names in the --algorithms argument, the specific HDF5 files of the desired runs to be plotted should be passed instead.

Including your algorithm

  1. Add your algorithm into ann_benchmarks/algorithms by providing a small Python wrapper.
  2. Add a Dockerfile in install/ for it
  3. Add it to algos.yaml

Principles

  • Make everything easy to replicate, including installing and preparing the datasets.
  • Single queries are used by default. ANN-Benchmarks enforces that only one CPU is saturated during experimentation, i.e., no multi-threading.
  • We mainly support CPU-based ANN algorithms. GPU support exists for FAISS, but it has to be compiled with GPU support locally and experiments must be run using the flags --local --batch.
  • Note that we consider that set similarity datasets are sparse and thus we pass a sorted array of integers to algorithms to represent the set of each user.

Authors

Original project built by Erik Bernhardsson with significant contributions from Martin Aumüller and Alexander Faithfull.

The modifications for this project were primarily made by Craig Gross with support from Daniel Ejsmont, Cullen Haselby, and Jim Lewis at Michigan State University.

Related Publication

The following publication details design principles behind the original benchmarking framework:

Related Projects

About

Benchmarks of approximate nearest neighbor libraries in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.0%
  • Shell 3.8%
  • HTML 3.1%
  • Dockerfile 0.1%