De-dup raw vectors? #15440

kaivalnp · 2025-11-20T04:19:39Z

Description

Demonstrating the proposal to de-duplicate raw vectors in Lucene!
Note: Right now this is very crude, and only for demonstration purposes.

kaivalnp · 2025-11-20T04:20:17Z

File Layout

Today, the .vec file is partitioned per-field, and looks like:

# field 1 begin:  
(vector for field 1, document d1) # position x0  
(vector for field 1, document d2)  
(vector for field 1, document d3)  
# field 1 end, field 2 begin:  
(vector for field 2, document d1) # position x1  
(vector for field 2, document d3)  
# field 2 end, field 3 begin:  
(vector for field 3, document d1) # position x2  
(vector for field 3, document d2)  
# field 3 end, and so on...

The .vem file contains per-field tuples to denote (position, length) of the corresponding vector "block":

# (field number, offset of vector "block", length of vector "block", ...)  
# "..." represents other metadata, including dimension, ord -> doc mapping, etc.
(1, x0, x1 - x0, ...)  
(2, x1, x2 - x1, ...)  
# and so on...

Proposing to change the .vec file to be partitioned per-document instead, something like:

# document d1 begin:  
(vector for field 1, document d1) # position x0  
(vector for field 2, document d1) # position x1  
(vector for field 3, document d1) # position x2  
# document d1 end, document 2 begin:  
(vector for field 1, document d2) # position x3  
(vector for field 3, document d2) # position x4  
# document d2 end, document 3 begin:  
(vector for field 1, document d3) # position x5  
(vector for field 2, document d3) # position x6  
# document d3 end, and so on...

Correspondingly, the .vem file will contain per-field mappings of ord -> position of vector in the raw file:

# (field number, ord -> position mapping as array, ...)  
# "..." represents other metadata, including dimension, ord -> doc mapping, etc. which is unchanged  
(1, [x0, x3, x5], ...) # {ord 0 -> position x0, ord 1 -> position x3, ord 2 -> position x5}  
(2, [x1, x4], ...) # {ord 0 -> position x1, ord 1 -> position x4}  
(3, [x2, x6], ...) # {ord 0 -> position x2, ord 1 -> position x6}  
# and so on...

In case of duplicate vectors within a document, we can simply "point" to a pre-existing vector, without writing another copy on disk!

Earlier, the offset of the vector at ordinal ord in field f was calculated by seeking to ord * vectorByteSize inside the vector "block" of field f.

Now, we're storing an additional ord -> position of vector mapping to "point" to the vector in the raw vector file, also used during search.

kaivalnp · 2025-11-20T04:24:29Z

Notes

Right now this is a crude implementation, rough and inefficient, only for demonstration purposes!
Basically a copy of Lucene99FlatVectors*, except that raw vectors are de-duped and written according to the new layout^ during flush
- Additionally, an ord -> position of vector mapping is stored and used during searching
Does not support an index sort yet
Does not support merging yet
- This is mainly an API challenge, because vector merging is expected to be field-by-field -- but seems doable with a new finishMerge API that does the equivalent of flush?

Benchmark

In order to index everything in a single segment, I had to:

Set number of indexing threads to 1
Increase the writer buffer to be sufficiently high for all vectors

Made use of the option added in mikemccand/luceneutil#468 (filterStrategy) -- which creates and searches a separate KNN field with a subset of documents (with index-time-filter)

Cohere vectors, 768d, MAXIMUM_INNER_PRODUCT similarity

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)          filterStrategy  filterSelectivity  vec_disk(MB)  vec_RAM(MB)  indexType
 0.899        3.505   3.497        0.998  100000   100      50       32        250         no     6995    157.61        634.49            0.01             1          299.24   query-time-pre-filter               0.50       292.969      292.969       HNSW
 0.915        3.266   3.258        0.998  100000   100      50       32        250         no     5742      0.00      Infinity            0.11             1          299.24   query-time-pre-filter               0.20       292.969      292.969       HNSW
 0.903        2.351   2.343        0.997  100000   100      50       32        250         no     3657      0.00      Infinity            0.10             1          299.24   query-time-pre-filter               0.10       292.969      292.969       HNSW
 1.000        0.357   0.349        0.978  100000   100      50       32        250         no     1039      0.00      Infinity            0.10             1          299.24   query-time-pre-filter               0.01       292.969      292.969       HNSW
 0.498        1.185   1.178        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          299.24  query-time-post-filter               0.50       292.969      292.969       HNSW
 0.202        1.165   1.157        0.993  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          299.24  query-time-post-filter               0.20       292.969      292.969       HNSW
 0.100        1.230   1.222        0.993  100000   100      50       32        250         no     3986      0.00      Infinity            0.11             1          299.24  query-time-post-filter               0.10       292.969      292.969       HNSW
 0.010        1.181   1.173        0.993  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          299.24  query-time-post-filter               0.01       292.969      292.969       HNSW
 0.940        1.065   1.057        0.992  100000   100      50       32        250         no     3939    258.24        387.23            0.01             1          449.10       index-time-filter               0.50       292.969      292.969       HNSW
 0.961        0.913   0.906        0.992  100000   100      50       32        250         no     3568    196.23        509.62            0.01             1          359.42       index-time-filter               0.20       292.969      292.969       HNSW
 0.976        0.679   0.671        0.988  100000   100      50       32        250         no     3172    167.67        596.42            0.01             1          329.38       index-time-filter               0.10       292.969      292.969       HNSW
 1.000        0.160   0.152        0.950  100000   100      50       32        250         no      984    155.23        644.19            0.01             1          302.16       index-time-filter               0.01       292.969      292.969       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)          filterStrategy  filterSelectivity  vec_disk(MB)  vec_RAM(MB)  indexType
 0.904        3.705   3.697        0.998  100000   100      50       32        250         no     6982    159.17        628.28            0.01             1          300.00   query-time-pre-filter               0.50       292.969      292.969       HNSW
 0.917        3.455   3.448        0.998  100000   100      50       32        250         no     5786      0.00      Infinity            0.10             1          300.00   query-time-pre-filter               0.20       292.969      292.969       HNSW
 0.900        2.437   2.430        0.997  100000   100      50       32        250         no     3553      0.00      Infinity            0.10             1          300.00   query-time-pre-filter               0.10       292.969      292.969       HNSW
 1.000        0.366   0.358        0.978  100000   100      50       32        250         no     1023      0.00      Infinity            0.10             1          300.00   query-time-pre-filter               0.01       292.969      292.969       HNSW
 0.506        1.263   1.255        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          300.00  query-time-post-filter               0.50       292.969      292.969       HNSW
 0.206        1.255   1.247        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          300.00  query-time-post-filter               0.20       292.969      292.969       HNSW
 0.100        1.257   1.249        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          300.00  query-time-post-filter               0.10       292.969      292.969       HNSW
 0.010        1.287   1.279        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          300.00  query-time-post-filter               0.01       292.969      292.969       HNSW
 0.940        1.138   1.130        0.993  100000   100      50       32        250         no     3927    249.90        400.17            0.01             1          303.57       index-time-filter               0.50       292.969      292.969       HNSW
 0.963        1.001   0.993        0.992  100000   100      50       32        250         no     3598    188.64        530.11            0.01             1          301.41       index-time-filter               0.20       292.969      292.969       HNSW
 0.977        0.791   0.783        0.990  100000   100      50       32        250         no     3159    168.33        594.09            0.01             1          300.65       index-time-filter               0.10       292.969      292.969       HNSW
 1.000        0.209   0.201        0.962  100000   100      50       32        250         no     1023    155.47        643.22            0.01             1          300.05       index-time-filter               0.01       292.969      292.969       HNSW

Note the reduction in index_size(MB) (when index-time-filter is used) due to re-use of raw vectors!
There is a slight increase in latency with this PR, presumably because of the extra lookup step of the vector position..

Kaival Parikh added 2 commits November 19, 2025 20:02

Copy Lucene99 flat vector format -> Lucene104

3a2b3bf

Crude version of the proposal

c5f1d2b

github-actions bot added module:core/index module:core/codecs labels Nov 20, 2025

kaivalnp mentioned this pull request Nov 20, 2025

Support multiple HNSW graphs backed by the same vectors #14758

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

De-dup raw vectors? #15440

De-dup raw vectors? #15440

Uh oh!

kaivalnp commented Nov 20, 2025

Uh oh!

kaivalnp commented Nov 20, 2025

Uh oh!

kaivalnp commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

De-dup raw vectors? #15440

Are you sure you want to change the base?

De-dup raw vectors? #15440

Uh oh!

Conversation

kaivalnp commented Nov 20, 2025

Description

Uh oh!

kaivalnp commented Nov 20, 2025

File Layout

Uh oh!

kaivalnp commented Nov 20, 2025

Notes

Benchmark

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant