|
| 1 | +# KNN Search Algorithm Comparison |
| 2 | + |
| 3 | +This project compares the performance of different K-Nearest Neighbors (KNN) search algorithms across various dataset sizes and dimensions. The algorithms compared are: |
| 4 | + |
| 5 | +1. KD-Tree |
| 6 | +2. Ball Tree |
| 7 | +3. Brute Force (Full KNN) |
| 8 | +4. HNSW (Hierarchical Navigable Small World) |
| 9 | + |
| 10 | +## Algorithm Explanations |
| 11 | + |
| 12 | +1. KD-Tree (K-Dimensional Tree): |
| 13 | + - A space-partitioning data structure for organizing points in a k-dimensional space. |
| 14 | + - Builds a binary tree by recursively splitting the space along different dimensions. |
| 15 | + - Efficient for low-dimensional spaces (typically < 20 dimensions). |
| 16 | + - Average time complexity for search: O(log n), where n is the number of points. |
| 17 | + - Less effective in high-dimensional spaces due to the "curse of dimensionality". |
| 18 | + Example: In a 2D space, a KD-Tree might split the plane vertically, then horizontally, alternating at each level: |
| 19 | + ``` |
| 20 | + y |
| 21 | + | |
| 22 | + 4 | C |
| 23 | + | A D |
| 24 | + 2 | B |
| 25 | + |___________ |
| 26 | + 0 2 4 x |
| 27 | + ``` |
| 28 | + Points: A(1,3), B(3,1), C(4,3), D(3,3) |
| 29 | + Tree structure: Root(x=2) -> Left(y=2) -> Right(x=3) |
| 30 | + |
| 31 | +2. Ball Tree: |
| 32 | + - A binary tree data structure that partitions points into nested hyperspheres. |
| 33 | + - Each node represents a ball (hypersphere) containing a subset of the points. |
| 34 | + - More effective than KD-Tree for high-dimensional spaces. |
| 35 | + - Average time complexity for search: O(log n), but with higher constant factors than KD-Tree. |
| 36 | + - Generally performs better than KD-Tree when dimensions > 20. |
| 37 | + Example: In a 2D space, a Ball Tree might create nested circles: |
| 38 | + ``` |
| 39 | + y |
| 40 | + | |
| 41 | + 4 | (C) |
| 42 | + | (A) (D) |
| 43 | + 2 | (B) |
| 44 | + |___________ |
| 45 | + 0 2 4 x |
| 46 | + ``` |
| 47 | + Outer circle contains all points, inner circles divide subsets. |
| 48 | + |
| 49 | +3. Full KNN (Brute Force): |
| 50 | + - Computes distances from the query point to all other points in the dataset. |
| 51 | + - Simple to implement but computationally expensive for large datasets. |
| 52 | + - Time complexity: O(n * d), where n is the number of points and d is the number of dimensions. |
| 53 | + - Becomes inefficient as the dataset size or dimensionality increases. |
| 54 | + - Guaranteed to find the exact nearest neighbors. |
| 55 | + Example: For a query point Q(2,2) and K=2: |
| 56 | + ``` |
| 57 | + y |
| 58 | + | |
| 59 | + 4 | C |
| 60 | + | A D |
| 61 | + 2 |----Q--B |
| 62 | + |___________ |
| 63 | + 0 2 4 x |
| 64 | + ``` |
| 65 | + Calculate distances: QA=1.41, QB=1, QC=2.24, QD=1.41 |
| 66 | + Result: Nearest 2 neighbors are B and A (or D) |
| 67 | + |
| 68 | +4. HNSW (Hierarchical Navigable Small World): |
| 69 | + - An approximate nearest neighbor search algorithm. |
| 70 | + - Builds a multi-layer graph structure for efficient navigation. |
| 71 | + - Provides a trade-off between search speed and accuracy. |
| 72 | + - Performs well in high-dimensional spaces and with large datasets. |
| 73 | + - Average time complexity for search: O(log n), but with better constants than tree-based methods. |
| 74 | + - Allows for faster searches by sacrificing some accuracy. |
| 75 | + Example: A simplified 2D representation of HNSW layers: |
| 76 | + ``` |
| 77 | + Layer 2: A --- C |
| 78 | + | |
| 79 | + Layer 1: A --- B --- C |
| 80 | + | \ | \ | |
| 81 | + Layer 0: A --- B --- C --- D --- E |
| 82 | + ``` |
| 83 | + Search starts at a random point in the top layer and descends, |
| 84 | + exploring neighbors at each level until reaching the bottom. |
| 85 | + |
| 86 | +The choice between these algorithms depends on the dataset size, dimensionality, required accuracy, and query speed. |
| 87 | +KD-Tree and Ball Tree provide exact results and are efficient for low to moderate dimensions. |
| 88 | +Full KNN is simple but becomes slow for large datasets. |
| 89 | +HNSW offers a good balance between speed and accuracy, especially for high-dimensional data or large datasets. |
| 90 | + |
| 91 | +## Installation |
| 92 | + |
| 93 | +1. Clone this repository: |
| 94 | + ``` |
| 95 | + git clone https://github.com/yourusername/knn-search-comparison.git |
| 96 | + cd knn-search-comparison |
| 97 | + ``` |
| 98 | + |
| 99 | +2. Create a virtual environment (optional but recommended): |
| 100 | + ``` |
| 101 | + python -m venv venv |
| 102 | + source venv/bin/activate # On Windows, use `venv\Scripts\activate` |
| 103 | + ``` |
| 104 | + |
| 105 | +3. Install the required dependencies: |
| 106 | + ``` |
| 107 | + pip install -r requirements.txt |
| 108 | + ``` |
| 109 | + |
| 110 | + This will install all necessary packages listed in the `requirements.txt` file, including numpy, scipy, scikit-learn, hnswlib, tabulate, and tqdm. |
| 111 | + |
| 112 | +## Usage |
| 113 | + |
| 114 | +To run the comparison tests with default parameters: |
| 115 | + |
| 116 | +``` |
| 117 | +python app.py |
| 118 | +``` |
| 119 | + |
| 120 | +You can also customize the test parameters using command-line arguments: |
| 121 | + |
| 122 | +``` |
| 123 | +python app.py --vectors 1000 10000 100000 --dimensions 4 16 256 --num-tests 5 --k 5 |
| 124 | +``` |
| 125 | + |
| 126 | +Available arguments: |
| 127 | +- `--vectors`: List of vector counts to test (default: 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000) |
| 128 | +- `--dimensions`: List of dimensions to test (default: 4 16 256 1024) |
| 129 | +- `--num-tests`: Number of tests to run for each combination (default: 10) |
| 130 | +- `--k`: Number of nearest neighbors to search for (default: 10) |
| 131 | + |
| 132 | +The script will display a progress bar during execution, giving you an estimate of the remaining time. |
| 133 | + |
| 134 | +The script can be interrupted at any time by pressing Ctrl+C. It will attempt to exit gracefully, even during time-consuming operations like building the HNSW index. |
| 135 | + |
| 136 | +## Output |
| 137 | + |
| 138 | +The script will display progress and results in the console. After completion, you'll see: |
| 139 | + |
| 140 | +1. A summary of results for each combination of vector count and dimensions, including: |
| 141 | + - Build times for KD-Tree, Ball Tree, and HNSW index |
| 142 | + - Average search times for each algorithm |
| 143 | +2. A table of all results |
| 144 | +3. The location of the CSV file containing detailed results |
| 145 | + |
| 146 | +Example output for a single combination: |
| 147 | + |
| 148 | +``` |
| 149 | +Results for 10000 vectors with 256 dimensions: |
| 150 | +KD-Tree build time: 0.123456 seconds |
| 151 | +Ball Tree build time: 0.234567 seconds |
| 152 | +HNSW build time: 0.345678 seconds |
| 153 | +KD-Tree search time: 0.001234 seconds |
| 154 | +Ball Tree search time: 0.002345 seconds |
| 155 | +Brute Force search time: 0.012345 seconds |
| 156 | +HNSW search time: 0.000123 seconds |
| 157 | +``` |
| 158 | + |
| 159 | +The final results table and CSV file will include both build times and search times for each algorithm, allowing for a comprehensive comparison of performance across different vector counts and dimensions. |
| 160 | + |
| 161 | +## Customization |
| 162 | + |
| 163 | +You can modify the following variables in `app.py` to adjust the test parameters: |
| 164 | + |
| 165 | +- `NUM_VECTORS_LIST`: List of vector counts to test |
| 166 | +- `NUM_DIMENSIONS_LIST`: List of dimensions to test |
| 167 | +- `NUM_TESTS`: Number of tests to run for each combination |
| 168 | +- `K`: Number of nearest neighbors to search for |
| 169 | + |
| 170 | +## Contributing |
| 171 | + |
| 172 | +Contributions are welcome! Please feel free to submit a Pull Request. |
| 173 | + |
| 174 | +## License |
| 175 | + |
| 176 | +This project is open source and available under the [MIT License](LICENSE). |
0 commit comments