-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add: Add index_gt::merge() #572
base: main
Are you sure you want to change the base?
Conversation
This just focus on index_gt. index_dense_gt is out of scope of this PR. If we get consensus of implementation approach, we'll be able to implement index_dense_gt::merge() too. This adds mutable `memory_mapped_file_t` and you can create a mutable memory-mapped index with it. You can merge multiple indexes to the mutable memory-mapped index without allocating all data on RAM.
@@ -3272,7 +3449,6 @@ class index_gt { | |||
|
|||
// We are loading an empty index, no more work to do | |||
if (!header.size) { | |||
reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not related to this PR but I add this PR because I need similar change in view()
. If this should be a separated PR, I'll open a separated PR.
We don't need to call reset()
here because we don't change anything after the above reset()
.
serialization_result_t stream_result = load_from_stream( | ||
[&](void* buffer, std::size_t length) { | ||
if (offset + length > file.size()) | ||
return false; | ||
std::memcpy(buffer, file.data() + offset, length); | ||
offset += length; | ||
return true; | ||
}, | ||
std::forward<progress_at>(progress)); | ||
|
||
return stream_result; | ||
is_mutable_ = true; | ||
return {}; | ||
} else { | ||
serialization_result_t io_result = file.open_if_not(); | ||
if (!io_result) | ||
return io_result; | ||
|
||
serialization_result_t stream_result = load_from_stream( | ||
[&](void* buffer, std::size_t length) { | ||
if (offset + length > file.size()) | ||
return false; | ||
std::memcpy(buffer, file.data() + offset, length); | ||
offset += length; | ||
return true; | ||
}, | ||
std::forward<progress_at>(progress)); | ||
return stream_result; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation is only changed because this is moved to else
.
if (!header.size) { | ||
reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to call reset()
because we don't change anything after the above reset()
.
include/usearch/index.hpp
Outdated
// Add all values in `index` to this index. | ||
add_result_t merge_result; | ||
for (const auto& member : index) { | ||
auto& value = get_value(member); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to filter the target values as suggested in #84 (comment) , we can do it here.
Thank you, @kou! That's a good start! I think our key goal for the "merge" operation is to reduce the asymptotic complexity from ~$O(N \log N)$ to something closer to |
OK. Should we work on the key goal in this PR? Or can we work on it as a follow-up task? For |
@ashvardanian Do you have any preference on this? I prefer the follow-up approach. If we use the follow-up approach, we can focus primarily on mutable |
It's important to nail the algorithm before making other changes. Would you like to try and design it? |
@kou, I think it’s a valid idea! Once implemented, we can benchmark it on carious datasets, and if recall/performance us preserved, merge! |
OK. I'll implement it! |
@ashvardanian I've implemented the idea. Do you have any idea how to benchmark the idea? (I'll update the PR description after we found that the idea works well.) |
@kou, I think its worth testing the idea in 2 modes - balanced and unbalanced. In the balanced case, we should take a large dataset split into halfs, construct separately and then measure the merge time. Then we should also build a solid index at once, and compare both the recall and throughput curves. For unbalanced setup we should do the same, but probably with more of a 10:1 size distribution. Does this make sense? |
It makes sense. Do you have a suggested dataset for this? |
BENCHMARKS.md links several dataset worth trying. As for code, yes, we can extend the existing suite. |
@ashvardanian Thanks. I've added support for benchmarking Here is a summary of the benchmark results with the "Unum UForm Wiki" dataset:
It seems that splitting to halves isn't bad (less indexing time and less recall loss) but splitting to 10 chunks is bad (less indexing time but too much recall loss) with this idea. Here are details of them: index_gt (no merge) (this is the base score)$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin --index-gt-api
- Dataset:
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index:
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark in-memory
------------
Indexing. 1000000 items
100 % completed, 2466 vectors/s, 405.5 s
Search 10. 100000 items
100 % completed, 3628 vectors/s, 27.6 s
Recall@1 98.79 %
Recall 98.84 %
Will benchmark an on-disk view
------------
Search 10. 100000 items
100 % completed, 3567 vectors/s, 28.0 s
Recall@1 98.79 %
Recall 98.84 % index_gt (chunks to merge: 2)$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin --index-gt-api --chunks-to-merge 2
- Dataset:
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index:
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark merge: 2
------------
Indexing. 500000 items
100 % completed, 2983 vectors/s, 167.6 s
Indexing. 500000 items
100 % completed, 2955 vectors/s, 169.2 s
Merging. 1000000 items
100 % completed, 43323 vectors/s, 23.1 s
------------
Search 10. 100000 items
100 % completed, 3314 vectors/s, 30.2 s
Recall@1 97.21 %
Recall 97.26 % index_gt (chunks to merge: 10)$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin --index-gt-api --chunks-to-merge 10
- Dataset:
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index:
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark merge: 10
------------
Indexing. 100000 items
100 % completed, 5074 vectors/s, 19.7 s
Indexing. 100000 items
100 % completed, 5066 vectors/s, 19.7 s
Indexing. 100000 items
100 % completed, 4969 vectors/s, 20.1 s
Indexing. 100000 items
100 % completed, 4845 vectors/s, 20.6 s
Indexing. 100000 items
100 % completed, 4992 vectors/s, 20.0 s
Indexing. 100000 items
100 % completed, 5076 vectors/s, 19.7 s
Indexing. 100000 items
100 % completed, 5093 vectors/s, 19.6 s
Indexing. 100000 items
100 % completed, 5141 vectors/s, 19.5 s
Indexing. 100000 items
100 % completed, 5073 vectors/s, 19.7 s
Indexing. 100000 items
100 % completed, 4819 vectors/s, 20.7 s
Merging. 1000000 items
100 % completed, 25104 vectors/s, 39.8 s
------------
Search 10. 100000 items
100 % completed, 3088 vectors/s, 32.4 s
Recall@1 73.95 %
Recall 73.99 % index_dense_gt (just for reference)$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin
- Dataset:
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index:
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark in-memory
------------
Indexing. 1000000 items
100 % completed, 2481 vectors/s, 403.1 s
Search 10. 100000 items
100 % completed, 3531 vectors/s, 28.3 s
Recall@1 98.79 %
Recall 98.84 %
Join. 1 items
100 % completed, 0 vectors/s, 41.6 s
Recall Joins 98.78 %
Unmatched 0.94 % (9351 items)
Proposals 94.60 / man (94604263 total)
------------
Will benchmark an on-disk view
------------
Search 10. 100000 items
100 % completed, 3644 vectors/s, 27.4 s
Recall@1 98.79 %
Recall 98.84 %
Join. 1 items
100 % completed, 0 vectors/s, 42.6 s
Recall Joins 98.78 %
Unmatched 0.94 % (9351 items)
Proposals 94.60 / man (94604263 total)
------------ |
FYI: With OpenMP:
The merge cases aren't faster than the no merge case with OpenMP because merge phase isn't multithreaded. index_gt (no merge) (this is the base score)$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin --index-gt-api
- OpenMP threads: 24
- Dataset:
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index:
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark in-memory
------------
Indexing. 1000000 items
100 % completed, 18529 vectors/s, 54.0 s
Search 10. 100000 items
100 % completed, 27345 vectors/s, 3.7 s
Recall@1 97.75 %
Recall 97.79 % index_gt (chunks to merge: 2)$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin --index-gt-api --chunks-to-merge 2
- OpenMP threads: 24
- Dataset:
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index:
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark merge: 2
------------
Indexing. 500000 items
100 % completed, 21720 vectors/s, 23.0 s
Indexing. 500000 items
100 % completed, 22820 vectors/s, 21.9 s
Merging. 1000000 items
100 % completed, 47776 vectors/s, 20.9 s
------------
Search 10. 100000 items
100 % completed, 24934 vectors/s, 4.0 s
Recall@1 95.44 %
Recall 95.49 % index_gt (chunks to merge: 10)$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin --index-gt-api --chunks-to-merge 10
- OpenMP threads: 24
- Dataset:
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index:
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark merge: 10
------------
Indexing. 100000 items
100 % completed, 37969 vectors/s, 2.6 s
Indexing. 100000 items
100 % completed, 36041 vectors/s, 2.8 s
Indexing. 100000 items
100 % completed, 33451 vectors/s, 3.0 s
Indexing. 100000 items
100 % completed, 34187 vectors/s, 2.9 s
Indexing. 100000 items
100 % completed, 36278 vectors/s, 2.8 s
Indexing. 100000 items
100 % completed, 34528 vectors/s, 2.9 s
Indexing. 100000 items
100 % completed, 36183 vectors/s, 2.8 s
Indexing. 100000 items
100 % completed, 36425 vectors/s, 2.7 s
Indexing. 100000 items
100 % completed, 35838 vectors/s, 2.8 s
Indexing. 100000 items
100 % completed, 36505 vectors/s, 2.7 s
Merging. 1000000 items
100 % completed, 24843 vectors/s, 40.3 s
------------
Search 10. 100000 items
100 % completed, 21528 vectors/s, 4.6 s
Recall@1 74.34 %
Recall 74.37 % index_dense_gt (just for reference)$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin
- OpenMP threads: 24
- Dataset:
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index:
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark in-memory
------------
Indexing. 1000000 items
100 % completed, 19388 vectors/s, 51.6 s
Search 10. 100000 items
100 % completed, 26497 vectors/s, 3.8 s
Recall@1 98.05 %
Recall 98.10 %
Join. 1 items
100 % completed, 0 vectors/s, 43.7 s
Recall Joins 98.00 %
Unmatched 1.69 % (16939 items)
Proposals 104.84 / man (104844832 total)
------------
Will benchmark an on-disk view
------------
Search 10. 100000 items
100 % completed, 27072 vectors/s, 3.7 s
Recall@1 98.05 %
Recall 98.10 %
Join. 1 items
100 % completed, 0 vectors/s, 39.5 s
Recall Joins 98.00 %
Unmatched 1.69 % (16939 items)
Proposals 104.84 / man (104844832 total)
------------ |
@ashvardanian What do you think about the numbers? How to proceed this from here? |
Hi @kou! Thanks for sharing the results! The accuracy drop for 10 chunks is too significant. Seems like we need a better algorithm. Do you have ideas on how to improve it? |
@ashvardanian I think that the accuracy drop is caused by dropping neighbors when we merge neighbors (the 1. part in this idea #572 (comment) ). We have limited neighbors space that is configured by How about re-computing neighbors only when we need to drop N ( |
#84
This focus only on
index_gt
.index_dense_gt
is out of scope of this PR. If we get consensus of implementation approach, we'll be able to implementindex_dense_gt::merge()
too.This adds mutable
memory_mapped_file_t
and you can create a mutable memory-mapped index with it. You can merge multiple indexes to the mutable memory-mapped index without allocating all data on RAM.index_gt::merge()
does just the followings:The important changes in this PR are the followings:
memory_mapped_file_t
index_gt
withmemory_mapped_file_t
For the former:
memory_mapped_file_t::is_writable
andmemory_mapped_file_t::reserve()
are added.memory_mapped_file_t::open_if_not()
opens a file with write mode whenis_writable
.For the latter: There are some change in the following methods:
index_gt::reset()
: Write header and levels before the memory-mapped file is closdindex_gt::reserve()
: Extend the memory-mapped file and adjustnodes_
after itindex_gt::node_malloc_()
: Allocate a memory from the memory-mapped file