Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add: Add index_gt::merge() #572

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Add: Add index_gt::merge() #572

wants to merge 3 commits into from

Conversation

kou
Copy link

@kou kou commented Mar 4, 2025

#84

This focus only on index_gt. index_dense_gt is out of scope of this PR. If we get consensus of implementation approach, we'll be able to implement index_dense_gt::merge() too.

This adds mutable memory_mapped_file_t and you can create a mutable memory-mapped index with it. You can merge multiple indexes to the mutable memory-mapped index without allocating all data on RAM.

index_gt::merge() does just the followings:

  1. Reserve enough size
  2. Add all values in the source index to destination index

The important changes in this PR are the followings:

  • Mutable memory_mapped_file_t
  • Mutable index_gt with memory_mapped_file_t

For the former: memory_mapped_file_t::is_writable and memory_mapped_file_t::reserve() are added. memory_mapped_file_t::open_if_not() opens a file with write mode when is_writable.

For the latter: There are some change in the following methods:

  • index_gt::reset(): Write header and levels before the memory-mapped file is closd
  • index_gt::reserve(): Extend the memory-mapped file and adjust nodes_ after it
  • index_gt::node_malloc_(): Allocate a memory from the memory-mapped file

This just focus on index_gt. index_dense_gt is out of scope of this
PR. If we get consensus of implementation approach, we'll be able to
implement index_dense_gt::merge() too.

This adds mutable `memory_mapped_file_t` and you can create a mutable
memory-mapped index with it. You can merge multiple indexes to the
mutable memory-mapped index without allocating all data on RAM.
@@ -3272,7 +3449,6 @@ class index_gt {

// We are loading an empty index, no more work to do
if (!header.size) {
reset();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to this PR but I add this PR because I need similar change in view(). If this should be a separated PR, I'll open a separated PR.

We don't need to call reset() here because we don't change anything after the above reset().

Comment on lines -3423 to +3619
serialization_result_t stream_result = load_from_stream(
[&](void* buffer, std::size_t length) {
if (offset + length > file.size())
return false;
std::memcpy(buffer, file.data() + offset, length);
offset += length;
return true;
},
std::forward<progress_at>(progress));

return stream_result;
is_mutable_ = true;
return {};
} else {
serialization_result_t io_result = file.open_if_not();
if (!io_result)
return io_result;

serialization_result_t stream_result = load_from_stream(
[&](void* buffer, std::size_t length) {
if (offset + length > file.size())
return false;
std::memcpy(buffer, file.data() + offset, length);
offset += length;
return true;
},
std::forward<progress_at>(progress));
return stream_result;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is only changed because this is moved to else.

if (!header.size) {
reset();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to call reset() because we don't change anything after the above reset().

@kou kou mentioned this pull request Mar 4, 2025
3 tasks
// Add all values in `index` to this index.
add_result_t merge_result;
for (const auto& member : index) {
auto& value = get_value(member);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to filter the target values as suggested in #84 (comment) , we can do it here.

@ashvardanian
Copy link
Contributor

Thank you, @kou! That's a good start! I think our key goal for the "merge" operation is to reduce the asymptotic complexity from ~$O(N \log N)$ to something closer to $O(N)$.

@kou
Copy link
Author

kou commented Mar 5, 2025

OK. Should we work on the key goal in this PR? Or can we work on it as a follow-up task?

For $O(N)$, we may need to merge two HNSWs directly instead of adding values in an index_gt to another. I'm not sure whether we can keep efficient HNSW by it... Do you have any idea how to implement it or do you know any prior work? I haven't read carefully but I found an open issue in apache/lucene: apache/lucene#12440

@kou
Copy link
Author

kou commented Mar 11, 2025

Should we work on the key goal in this PR? Or can we work on it as a follow-up task?

@ashvardanian Do you have any preference on this? I prefer the follow-up approach. If we use the follow-up approach, we can focus primarily on mutable memory_mapped_file_t in this PR and we can focus primarily on efficient merging in a follow-up PR.

@ashvardanian
Copy link
Contributor

It's important to nail the algorithm before making other changes. Would you like to try and design it?

@kou
Copy link
Author

kou commented Mar 14, 2025

@ashvardanian It makes sense.

Here are the "merge graphs" features in other HNSW related products:

It seems that there is no widely used merge graphs algorithm.

Here is my idea:

A: Base graph
B: Merged graph

  1. Add B nodes that exist in level > 0 layers to A with normal way but keep existing connections in B (They connect to both of nodes in A and B)
  2. Add other B nodes (that exist only in level 0 layer) to A without changing their connections (They connect to only nodes in B)

This will reduce add costs. If we have enough nodes in level > 0 layers, it may be able to keep recall/quality...?

What do you think about this research result?

@ashvardanian
Copy link
Contributor

ashvardanian commented Mar 14, 2025

@kou, I think it’s a valid idea! Once implemented, we can benchmark it on carious datasets, and if recall/performance us preserved, merge!

@kou
Copy link
Author

kou commented Mar 17, 2025

OK. I'll implement it!

@kou
Copy link
Author

kou commented Mar 18, 2025

@ashvardanian I've implemented the idea. Do you have any idea how to benchmark the idea?

(I'll update the PR description after we found that the idea works well.)

@ashvardanian
Copy link
Contributor

@kou, I think its worth testing the idea in 2 modes - balanced and unbalanced. In the balanced case, we should take a large dataset split into halfs, construct separately and then measure the merge time. Then we should also build a solid index at once, and compare both the recall and throughput curves. For unbalanced setup we should do the same, but probably with more of a 10:1 size distribution. Does this make sense?

@kou
Copy link
Author

kou commented Mar 18, 2025

It makes sense.

Do you have a suggested dataset for this?
And do you think that we can reuse existing https://github.com/unum-cloud/usearch/blob/main/cpp/bench.cpp for this? Or should we create a new benchmark tool?

@ashvardanian
Copy link
Contributor

BENCHMARKS.md links several dataset worth trying. As for code, yes, we can extend the existing suite.

@kou
Copy link
Author

kou commented Mar 20, 2025

@ashvardanian Thanks. I've added support for benchmarking index_gt::merge to existing bench.cpp.

Here is a summary of the benchmark results with the "Unum UForm Wiki" dataset:

Target Indexing Indexing diff Searching Searching diff Recall@1 Recall@1 diff Recall Recall diff
index_gt (no merge) (this is the base score) 405.5s 0.0s 27.6s 0.0s 98.79% 0.0% 98.84% 0.0%
index_gt (chunks to merge: 2) 359.9s (336.8s (build) + 23.1s (merge)) -45.6s 30.2s +2.6s 97.21% -1.58% 97.26% -1.58%
index_gt (chunks to merge: 10) 239.1s (199.3s (build) + 39.8s (merge)) -166.4s 32.4s +4.8s 73.95% -24.84% 73.99% -24.85%
index_dense_gt (just for reference) 403.1s -2.4s 28.3s +0.7s 98.78% -0.01% 98.84% 0.0%

It seems that splitting to halves isn't bad (less indexing time and less recall loss) but splitting to 10 chunks is bad (less indexing time but too much recall loss) with this idea.

Here are details of them:

index_gt (no merge) (this is the base score)
$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin --index-gt-api
- Dataset: 
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index: 
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark in-memory

------------
Indexing. 1000000 items
100 % completed, 2466 vectors/s, 405.5 s
Search 10. 100000 items
100 % completed, 3628 vectors/s, 27.6 s
Recall@1 98.79 %
Recall 98.84 %
Will benchmark an on-disk view

------------
Search 10. 100000 items
100 % completed, 3567 vectors/s, 28.0 s
Recall@1 98.79 %
Recall 98.84 %
index_gt (chunks to merge: 2)
$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin --index-gt-api --chunks-to-merge 2
- Dataset: 
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index: 
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark merge: 2

------------
Indexing. 500000 items
100 % completed, 2983 vectors/s, 167.6 s
Indexing. 500000 items
100 % completed, 2955 vectors/s, 169.2 s
Merging. 1000000 items
100 % completed, 43323 vectors/s, 23.1 s

------------
Search 10. 100000 items
100 % completed, 3314 vectors/s, 30.2 s
Recall@1 97.21 %
Recall 97.26 %
index_gt (chunks to merge: 10)
$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin --index-gt-api --chunks-to-merge 10
- Dataset: 
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index: 
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark merge: 10

------------
Indexing. 100000 items
100 % completed, 5074 vectors/s, 19.7 s
Indexing. 100000 items
100 % completed, 5066 vectors/s, 19.7 s
Indexing. 100000 items
100 % completed, 4969 vectors/s, 20.1 s
Indexing. 100000 items
100 % completed, 4845 vectors/s, 20.6 s
Indexing. 100000 items
100 % completed, 4992 vectors/s, 20.0 s
Indexing. 100000 items
100 % completed, 5076 vectors/s, 19.7 s
Indexing. 100000 items
100 % completed, 5093 vectors/s, 19.6 s
Indexing. 100000 items
100 % completed, 5141 vectors/s, 19.5 s
Indexing. 100000 items
100 % completed, 5073 vectors/s, 19.7 s
Indexing. 100000 items
100 % completed, 4819 vectors/s, 20.7 s
Merging. 1000000 items
100 % completed, 25104 vectors/s, 39.8 s

------------
Search 10. 100000 items
100 % completed, 3088 vectors/s, 32.4 s
Recall@1 73.95 %
Recall 73.99 %
index_dense_gt (just for reference)
$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin
- Dataset: 
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index: 
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark in-memory

------------
Indexing. 1000000 items
100 % completed, 2481 vectors/s, 403.1 s
Search 10. 100000 items
100 % completed, 3531 vectors/s, 28.3 s
Recall@1 98.79 %
Recall 98.84 %
Join. 1 items
100 % completed, 0 vectors/s, 41.6 s
Recall Joins 98.78 %
Unmatched 0.94 % (9351 items)
Proposals 94.60 / man (94604263 total)
------------

Will benchmark an on-disk view

------------
Search 10. 100000 items
100 % completed, 3644 vectors/s, 27.4 s
Recall@1 98.79 %
Recall 98.84 %
Join. 1 items
100 % completed, 0 vectors/s, 42.6 s
Recall Joins 98.78 %
Unmatched 0.94 % (9351 items)
Proposals 94.60 / man (94604263 total)
------------

@kou
Copy link
Author

kou commented Mar 20, 2025

FYI: With OpenMP:

Target Indexing Indexing diff Searching Searching diff Recall@1 Recall@1 diff Recall Recall diff
index_gt (no merge) (this is the base score) 54.0s 0.0s 3.7s 0.0s 97.75% 0.0% 97.79% 0.0%
index_gt (chunks to merge: 2) 65.8s (44.9s (build) + 20.9s (merge)) +12.8s 4.0s +0.3s 95.44% -2.31% 95.49% -2.30%
index_gt (chunks to merge: 10) 68.3s (28.0s (build) + 40.3s (merge)) +14.3s 4.6s +0.9s 74.34% -23.41% 74.37% -23.42%
index_dense_gt (just for reference) 51.6s -2.4s 3.8s +0.1s 98.05% +0.3% 98.00% +0.21%

The merge cases aren't faster than the no merge case with OpenMP because merge phase isn't multithreaded.

index_gt (no merge) (this is the base score)
$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin --index-gt-api
- OpenMP threads: 24
- Dataset: 
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index: 
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark in-memory

------------
Indexing. 1000000 items
100 % completed, 18529 vectors/s, 54.0 s
Search 10. 100000 items
100 % completed, 27345 vectors/s, 3.7 s
Recall@1 97.75 %
Recall 97.79 %
index_gt (chunks to merge: 2)
$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin --index-gt-api --chunks-to-merge 2
- OpenMP threads: 24
- Dataset: 
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index: 
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark merge: 2

------------
Indexing. 500000 items
100 % completed, 21720 vectors/s, 23.0 s
Indexing. 500000 items
100 % completed, 22820 vectors/s, 21.9 s
Merging. 1000000 items
100 % completed, 47776 vectors/s, 20.9 s

------------
Search 10. 100000 items
100 % completed, 24934 vectors/s, 4.0 s
Recall@1 95.44 %
Recall 95.49 %
index_gt (chunks to merge: 10)
$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin --index-gt-api --chunks-to-merge 10
- OpenMP threads: 24
- Dataset: 
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index: 
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark merge: 10

------------
Indexing. 100000 items
100 % completed, 37969 vectors/s, 2.6 s
Indexing. 100000 items
100 % completed, 36041 vectors/s, 2.8 s
Indexing. 100000 items
100 % completed, 33451 vectors/s, 3.0 s
Indexing. 100000 items
100 % completed, 34187 vectors/s, 2.9 s
Indexing. 100000 items
100 % completed, 36278 vectors/s, 2.8 s
Indexing. 100000 items
100 % completed, 34528 vectors/s, 2.9 s
Indexing. 100000 items
100 % completed, 36183 vectors/s, 2.8 s
Indexing. 100000 items
100 % completed, 36425 vectors/s, 2.7 s
Indexing. 100000 items
100 % completed, 35838 vectors/s, 2.8 s
Indexing. 100000 items
100 % completed, 36505 vectors/s, 2.7 s
Merging. 1000000 items
100 % completed, 24843 vectors/s, 40.3 s

------------
Search 10. 100000 items
100 % completed, 21528 vectors/s, 4.6 s
Recall@1 74.34 %
Recall 74.37 %
index_dense_gt (just for reference)
$ ./bench_cpp --vectors datasets/wiki_1M/base.1M.fbin --queries datasets/wiki_1M/query.public.100K.fbin --neighbors datasets/wiki_1M/groundtruth.public.100K.ibin
- OpenMP threads: 24
- Dataset: 
-- Base vectors path: datasets/wiki_1M/base.1M.fbin
-- Query vectors path: datasets/wiki_1M/query.public.100K.fbin
-- Ground truth neighbors path: datasets/wiki_1M/groundtruth.public.100K.ibin
-- Dimensions: 256
-- Vectors count: 1000000
-- Queries count: 100000
-- Neighbors per query: 10
- Index: 
-- Connectivity: 16
-- Expansion @ Add: 128
-- Expansion @ Search: 64
-- Quantization: f32
-- Metric: ip
-- Hardware acceleration: serial
Will benchmark in-memory

------------
Indexing. 1000000 items
100 % completed, 19388 vectors/s, 51.6 s
Search 10. 100000 items
100 % completed, 26497 vectors/s, 3.8 s
Recall@1 98.05 %
Recall 98.10 %
Join. 1 items
100 % completed, 0 vectors/s, 43.7 s
Recall Joins 98.00 %
Unmatched 1.69 % (16939 items)
Proposals 104.84 / man (104844832 total)
------------

Will benchmark an on-disk view

------------
Search 10. 100000 items
100 % completed, 27072 vectors/s, 3.7 s
Recall@1 98.05 %
Recall 98.10 %
Join. 1 items
100 % completed, 0 vectors/s, 39.5 s
Recall Joins 98.00 %
Unmatched 1.69 % (16939 items)
Proposals 104.84 / man (104844832 total)
------------

@kou
Copy link
Author

kou commented Mar 25, 2025

@ashvardanian What do you think about the numbers? How to proceed this from here?

@ashvardanian
Copy link
Contributor

Hi @kou! Thanks for sharing the results! The accuracy drop for 10 chunks is too significant. Seems like we need a better algorithm. Do you have ideas on how to improve it?

@kou
Copy link
Author

kou commented Mar 25, 2025

@ashvardanian I think that the accuracy drop is caused by dropping neighbors when we merge neighbors (the 1. part in this idea #572 (comment) ). We have limited neighbors space that is configured by index_gt::connectivity().

How about re-computing neighbors only when we need to drop N (index_gt::connectivity() * 0.1 or something) neighbors in the merge phase? It seems that we can use index_gt::update() for re-computing neighbors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants