Nearest Neighbor Descent (nndescent) is a C++ implementation of the nearest neighbor descent algorithm, designed for efficient and accurate approximate nearest neighbor search. With seamless integration into Python, it offers a powerful solution for constructing k-nearest neighbor graphs. This algorithm is based on the pynndescent library.
- Seamless integration into Python and effortless installation using
pip
. - The handling of nndescent is very similar to that of pynndescent.
- Pure C++11 implementation utilizing OpenMP for parallel computation. No other external libraries are needed.
- Currently tested only on Linux.
- Both dense and sparse matrices are supported.
- Implementation of multiple distance functions, i.e.
- Bray-Curtis
- Canberra
- Chebyshev
- Circular Kantorovich (no sparse verion)
- Correlation
- Cosine
- Dice
- Dot
- Euclidean
- Hamming
- Haversine
- Hellinger
- Hellinger
- Jaccard
- Jensen-Shannon
- Kulsinski
- Manhattan
- Matching
- Minkowski
- Rogers-Tanimoto
- Russell-Rao
- Sokal-Michener
- Sokal-Sneath
- Spearman's Rank Correlation (no sparse version)
- Symmetric KL Divergence
- TSSS
- True Angular
- Wasserstein 1D (no sparse version)
- Yule
Please note that not all distances have undergone thorough testing. Therefore, it is advised to use them with caution and at your own discretion.
You can install nndescent directly from PyPI using pip:
pip install nndescent
If you want to run the examples in tests
, additional packages are needed. You can install them manually or install nndescent with the full option:
pip install nndescent[full]
- Clone the repository:
git clone https://github.com/brj0/nndescent.git
cd nndescent
- Build and install the package:
pip install .
If you want to run the examples in tests
, additional packages are needed. You can install them manually or install nndescent with the full option:
pip install .[full]
- To run the examples in
tests
you must first download the datasets:
python tests/make_test_data.py
In Python you can utilize the nndescent library in the following way:
import numpy as np
import nndescent
# Data must be a 2D numpy array of dtype 'float32'.
data = np.random.randint(50, size=(20,3)).astype(np.float32)
# Run NND algorithm
nnd = nndescent.NNDescent(data, n_neighbors=4)
# Get result
nn_indices, nn_distances = nnd.neighbor_graph
# Query data must be a 2D numpy array of dtype 'float32'.
query_data = np.random.randint(50, size=(5,3)).astype(np.float32)
# Calculate nearest neighbors for each query point
nn_query_indices, nn_query_distances = nnd.query(query_data, k=6)
To compile and run the C++ examples use the following commands within the project folder:
mkdir build
cd build
cmake ..
make
./simple
For detailed usage in C++ and for further Python/C++ examples please refer to the examples provided in the tests
directory of the repository and the code documentation.
On my computer, the training phase of nndescent is approximately 15% faster than pynndescent for dense matrices, and 75% faster for sparse matrices. Furthermore, the search query phase shows a significant improvement, with >70% faster execution time. Below is the output obtained from running tests/benchmark.py
, an ad hoc benchmark test. In this test, both nndescent and pynndescent were executed with the same parameters using either 'euclidean' or 'dot' as metric:
Data set | py train [ms] | c train [ms] | ratio | py vs c match | py test [ms] | c test [ms] | ratio | py accuracy | c accuracy |
---|---|---|---|---|---|---|---|---|---|
faces | 149.8 | 145.9 | 0.974 | 1.000 | 1663.7 | 18.4 | 0.011 | 1.000 | 0.999 |
fmnist | 11959.2 | 10768.7 | 0.900 | 0.997 | 5754.8 | 1334.1 | 0.232 | 0.978 | 0.978 |
glove25 | 149754.2 | 101864.0 | 0.680 | 0.964 | 98740.6 | 9907.4 | 0.100 | 0.796 | 0.808 |
glove50 | 192965.8 | 137171.8 | 0.711 | 0.885 | 99750.8 | 10647.7 | 0.107 | 0.705 | 0.743 |
glove100 | 218202.9 | 180088.4 | 0.825 | 0.815 | 98770.2 | 12080.4 | 0.122 | 0.651 | 0.731 |
glove200 | 287206.6 | 243466.6 | 0.848 | 0.772 | 101639.4 | 17615.6 | 0.173 | 0.622 | 0.773 |
mnist | 11319.7 | 10188.1 | 0.900 | 0.997 | 5725.9 | 1273.8 | 0.222 | 0.969 | 0.968 |
nytimes | 63323.8 | 55638.1 | 0.879 | 0.814 | 23632.1 | 7108.9 | 0.301 | 0.614 | 0.811 |
sift | 131711.4 | 105826.0 | 0.803 | 0.974 | 82503.7 | 7957.9 | 0.096 | 0.838 | 0.839 |
20newsgroups | 107339.0 | 28339.7 | 0.264 | 0.922 | 67518.6 | 22513.1 | 0.333 | 0.858 | 0.929 |
The compilation time and the lengthy numba loading time during runtime and import for 'pynndescent' are not considered in this ad hoc benchmark test. An Ann-Benchmarks wrapper is planned for the future.
The theoretical background of NND is based on the following paper:
- Dong, Wei, Charikar Moses, and Kai Li. "Efficient k-nearest neighbor graph construction for generic similarity measures." Proceedings of the 20th International Conference on World Wide Web. 2011.
In addition, the algorithm utilizes random projection trees for initializing the nearest neighbor graph. The nndescent algorithm constructs a tree by randomly selecting two points and splitting the data along a hyperplane passing through their midpoint. For a more theoretical background, please refer to:
- DASGUPTA, Sanjoy; FREUND, Yoav. Random projection trees and low dimensional manifolds. In: Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing. 2008.
Contributions are welcome! If you have any bug reports, feature requests, or suggestions, please open an issue or submit a pull request.
This project is licensed under the BSD-2-Clause license.
This implementation is based on the original pynndescent library by Leland McInnes. I would like to acknowledge and appreciate his work as a source of inspiration for this project.
For more information, visit the pynndescent GitHub repository.