Skip to content

Commit 841635f

Browse files
author
Danilo Poccia
committed
First release.
0 parents  commit 841635f

File tree

5 files changed

+704
-0
lines changed

5 files changed

+704
-0
lines changed

.gitignore

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
share/python-wheels/
24+
*.egg-info/
25+
.installed.cfg
26+
*.egg
27+
MANIFEST
28+
29+
# PyInstaller
30+
# Usually these files are written by a python script from a template
31+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
32+
*.manifest
33+
*.spec
34+
35+
# Installer logs
36+
pip-log.txt
37+
pip-delete-this-directory.txt
38+
39+
# Unit test / coverage reports
40+
htmlcov/
41+
.tox/
42+
.nox/
43+
.coverage
44+
.coverage.*
45+
.cache
46+
nosetests.xml
47+
coverage.xml
48+
*.cover
49+
*.py,cover
50+
.hypothesis/
51+
.pytest_cache/
52+
cover/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
.pybuilder/
76+
target/
77+
78+
# Jupyter Notebook
79+
.ipynb_checkpoints
80+
81+
# IPython
82+
profile_default/
83+
ipython_config.py
84+
85+
# pyenv
86+
# For a library or package, you might want to ignore these files since the code is
87+
# intended to run in multiple environments; otherwise, check them in:
88+
# .python-version
89+
90+
# pipenv
91+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
93+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
94+
# install all needed dependencies.
95+
#Pipfile.lock
96+
97+
# poetry
98+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
99+
# This is especially recommended for binary packages to ensure reproducibility, and is more
100+
# commonly ignored for libraries.
101+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102+
#poetry.lock
103+
104+
# pdm
105+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106+
#pdm.lock
107+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108+
# in version control.
109+
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
110+
.pdm.toml
111+
.pdm-python
112+
.pdm-build/
113+
114+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
115+
__pypackages__/
116+
117+
# Celery stuff
118+
celerybeat-schedule
119+
celerybeat.pid
120+
121+
# SageMath parsed files
122+
*.sage.py
123+
124+
# Environments
125+
.env
126+
.venv
127+
env/
128+
venv/
129+
ENV/
130+
env.bak/
131+
venv.bak/
132+
133+
# Spyder project settings
134+
.spyderproject
135+
.spyproject
136+
137+
# Rope project settings
138+
.ropeproject
139+
140+
# mkdocs documentation
141+
/site
142+
143+
# mypy
144+
.mypy_cache/
145+
.dmypy.json
146+
dmypy.json
147+
148+
# Pyre type checker
149+
.pyre/
150+
151+
# pytype static type analyzer
152+
.pytype/
153+
154+
# Cython debug symbols
155+
cython_debug/
156+
157+
# PyCharm
158+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
159+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
160+
# and can be added to the global gitignore or merged into this file. For a more nuclear
161+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
162+
#.idea/

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License Copyright (c) 2024 Danilo Poccia
2+
3+
Permission is hereby granted, free
4+
of charge, to any person obtaining a copy of this software and associated
5+
documentation files (the "Software"), to deal in the Software without
6+
restriction, including without limitation the rights to use, copy, modify, merge,
7+
publish, distribute, sublicense, and/or sell copies of the Software, and to
8+
permit persons to whom the Software is furnished to do so, subject to the
9+
following conditions:
10+
11+
The above copyright notice and this permission notice
12+
(including the next paragraph) shall be included in all copies or substantial
13+
portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF
16+
ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
17+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO
18+
EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
19+
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
20+
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21+
THE SOFTWARE.

README.md

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# KNN Search Algorithm Comparison
2+
3+
This project compares the performance of different K-Nearest Neighbors (KNN) search algorithms across various dataset sizes and dimensions. The algorithms compared are:
4+
5+
1. KD-Tree
6+
2. Ball Tree
7+
3. Brute Force (Full KNN)
8+
4. HNSW (Hierarchical Navigable Small World)
9+
10+
## Algorithm Explanations
11+
12+
1. KD-Tree (K-Dimensional Tree):
13+
- A space-partitioning data structure for organizing points in a k-dimensional space.
14+
- Builds a binary tree by recursively splitting the space along different dimensions.
15+
- Efficient for low-dimensional spaces (typically < 20 dimensions).
16+
- Average time complexity for search: O(log n), where n is the number of points.
17+
- Less effective in high-dimensional spaces due to the "curse of dimensionality".
18+
Example: In a 2D space, a KD-Tree might split the plane vertically, then horizontally, alternating at each level:
19+
```
20+
y
21+
|
22+
4 | C
23+
| A D
24+
2 | B
25+
|___________
26+
0 2 4 x
27+
```
28+
Points: A(1,3), B(3,1), C(4,3), D(3,3)
29+
Tree structure: Root(x=2) -> Left(y=2) -> Right(x=3)
30+
31+
2. Ball Tree:
32+
- A binary tree data structure that partitions points into nested hyperspheres.
33+
- Each node represents a ball (hypersphere) containing a subset of the points.
34+
- More effective than KD-Tree for high-dimensional spaces.
35+
- Average time complexity for search: O(log n), but with higher constant factors than KD-Tree.
36+
- Generally performs better than KD-Tree when dimensions > 20.
37+
Example: In a 2D space, a Ball Tree might create nested circles:
38+
```
39+
y
40+
|
41+
4 | (C)
42+
| (A) (D)
43+
2 | (B)
44+
|___________
45+
0 2 4 x
46+
```
47+
Outer circle contains all points, inner circles divide subsets.
48+
49+
3. Full KNN (Brute Force):
50+
- Computes distances from the query point to all other points in the dataset.
51+
- Simple to implement but computationally expensive for large datasets.
52+
- Time complexity: O(n * d), where n is the number of points and d is the number of dimensions.
53+
- Becomes inefficient as the dataset size or dimensionality increases.
54+
- Guaranteed to find the exact nearest neighbors.
55+
Example: For a query point Q(2,2) and K=2:
56+
```
57+
y
58+
|
59+
4 | C
60+
| A D
61+
2 |----Q--B
62+
|___________
63+
0 2 4 x
64+
```
65+
Calculate distances: QA=1.41, QB=1, QC=2.24, QD=1.41
66+
Result: Nearest 2 neighbors are B and A (or D)
67+
68+
4. HNSW (Hierarchical Navigable Small World):
69+
- An approximate nearest neighbor search algorithm.
70+
- Builds a multi-layer graph structure for efficient navigation.
71+
- Provides a trade-off between search speed and accuracy.
72+
- Performs well in high-dimensional spaces and with large datasets.
73+
- Average time complexity for search: O(log n), but with better constants than tree-based methods.
74+
- Allows for faster searches by sacrificing some accuracy.
75+
Example: A simplified 2D representation of HNSW layers:
76+
```
77+
Layer 2: A --- C
78+
|
79+
Layer 1: A --- B --- C
80+
| \ | \ |
81+
Layer 0: A --- B --- C --- D --- E
82+
```
83+
Search starts at a random point in the top layer and descends,
84+
exploring neighbors at each level until reaching the bottom.
85+
86+
The choice between these algorithms depends on the dataset size, dimensionality, required accuracy, and query speed.
87+
KD-Tree and Ball Tree provide exact results and are efficient for low to moderate dimensions.
88+
Full KNN is simple but becomes slow for large datasets.
89+
HNSW offers a good balance between speed and accuracy, especially for high-dimensional data or large datasets.
90+
91+
## Installation
92+
93+
1. Clone this repository:
94+
```
95+
git clone https://github.com/yourusername/knn-search-comparison.git
96+
cd knn-search-comparison
97+
```
98+
99+
2. Create a virtual environment (optional but recommended):
100+
```
101+
python -m venv venv
102+
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
103+
```
104+
105+
3. Install the required dependencies:
106+
```
107+
pip install -r requirements.txt
108+
```
109+
110+
This will install all necessary packages listed in the `requirements.txt` file, including numpy, scipy, scikit-learn, hnswlib, tabulate, and tqdm.
111+
112+
## Usage
113+
114+
To run the comparison tests with default parameters:
115+
116+
```
117+
python app.py
118+
```
119+
120+
You can also customize the test parameters using command-line arguments:
121+
122+
```
123+
python app.py --vectors 1000 10000 100000 --dimensions 4 16 256 --num-tests 5 --k 5
124+
```
125+
126+
Available arguments:
127+
- `--vectors`: List of vector counts to test (default: 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000)
128+
- `--dimensions`: List of dimensions to test (default: 4 16 256 1024)
129+
- `--num-tests`: Number of tests to run for each combination (default: 10)
130+
- `--k`: Number of nearest neighbors to search for (default: 10)
131+
132+
The script will display a progress bar during execution, giving you an estimate of the remaining time.
133+
134+
The script can be interrupted at any time by pressing Ctrl+C. It will attempt to exit gracefully, even during time-consuming operations like building the HNSW index.
135+
136+
## Output
137+
138+
The script will display progress and results in the console. After completion, you'll see:
139+
140+
1. A summary of results for each combination of vector count and dimensions, including:
141+
- Build times for KD-Tree, Ball Tree, and HNSW index
142+
- Average search times for each algorithm
143+
2. A table of all results
144+
3. The location of the CSV file containing detailed results
145+
146+
Example output for a single combination:
147+
148+
```
149+
Results for 10000 vectors with 256 dimensions:
150+
KD-Tree build time: 0.123456 seconds
151+
Ball Tree build time: 0.234567 seconds
152+
HNSW build time: 0.345678 seconds
153+
KD-Tree search time: 0.001234 seconds
154+
Ball Tree search time: 0.002345 seconds
155+
Brute Force search time: 0.012345 seconds
156+
HNSW search time: 0.000123 seconds
157+
```
158+
159+
The final results table and CSV file will include both build times and search times for each algorithm, allowing for a comprehensive comparison of performance across different vector counts and dimensions.
160+
161+
## Customization
162+
163+
You can modify the following variables in `app.py` to adjust the test parameters:
164+
165+
- `NUM_VECTORS_LIST`: List of vector counts to test
166+
- `NUM_DIMENSIONS_LIST`: List of dimensions to test
167+
- `NUM_TESTS`: Number of tests to run for each combination
168+
- `K`: Number of nearest neighbors to search for
169+
170+
## Contributing
171+
172+
Contributions are welcome! Please feel free to submit a Pull Request.
173+
174+
## License
175+
176+
This project is open source and available under the [MIT License](LICENSE).

0 commit comments

Comments
 (0)