Skip to content

Conversation

@kaivalnp
Copy link
Contributor

Description

Closes #14758

Demonstrating the proposal to de-duplicate raw vectors in Lucene!
Note: Right now this is very crude, and only for demonstration purposes.

@kaivalnp
Copy link
Contributor Author

File Layout

Today, the .vec file is partitioned per-field, and looks like:

# field 1 begin:  
(vector for field 1, document d1) # position x0  
(vector for field 1, document d2)  
(vector for field 1, document d3)  
# field 1 end, field 2 begin:  
(vector for field 2, document d1) # position x1  
(vector for field 2, document d3)  
# field 2 end, field 3 begin:  
(vector for field 3, document d1) # position x2  
(vector for field 3, document d2)  
# field 3 end, and so on...

The .vem file contains per-field tuples to denote (position, length) of the corresponding vector "block":

# (field number, offset of vector "block", length of vector "block", ...)  
# "..." represents other metadata, including dimension, ord -> doc mapping, etc.
(1, x0, x1 - x0, ...)  
(2, x1, x2 - x1, ...)  
# and so on...  

Proposing to change the .vec file to be partitioned per-document instead, something like:

# document d1 begin:  
(vector for field 1, document d1) # position x0  
(vector for field 2, document d1) # position x1  
(vector for field 3, document d1) # position x2  
# document d1 end, document 2 begin:  
(vector for field 1, document d2) # position x3  
(vector for field 3, document d2) # position x4  
# document d2 end, document 3 begin:  
(vector for field 1, document d3) # position x5  
(vector for field 2, document d3) # position x6  
# document d3 end, and so on...

Correspondingly, the .vem file will contain per-field mappings of ord -> position of vector in the raw file:

# (field number, ord -> position mapping as array, ...)  
# "..." represents other metadata, including dimension, ord -> doc mapping, etc. which is unchanged  
(1, [x0, x3, x5], ...) # {ord 0 -> position x0, ord 1 -> position x3, ord 2 -> position x5}  
(2, [x1, x4], ...) # {ord 0 -> position x1, ord 1 -> position x4}  
(3, [x2, x6], ...) # {ord 0 -> position x2, ord 1 -> position x6}  
# and so on...

In case of duplicate vectors within a document, we can simply "point" to a pre-existing vector, without writing another copy on disk!

Earlier, the offset of the vector at ordinal ord in field f was calculated by seeking to ord * vectorByteSize inside the vector "block" of field f.

Now, we're storing an additional ord -> position of vector mapping to "point" to the vector in the raw vector file, also used during search.

@kaivalnp
Copy link
Contributor Author

Notes

  • Right now this is a crude implementation, rough and inefficient, only for demonstration purposes!
  • Basically a copy of Lucene99FlatVectors*, except that raw vectors are de-duped and written according to the new layout^ during flush
    • Additionally, an ord -> position of vector mapping is stored and used during searching
  • Does not support an index sort yet
  • Does not support merging yet
    • This is mainly an API challenge, because vector merging is expected to be field-by-field -- but seems doable with a new finishMerge API that does the equivalent of flush?

Benchmark

In order to index everything in a single segment, I had to:

Made use of the option added in mikemccand/luceneutil#468 (filterStrategy) -- which creates and searches a separate KNN field with a subset of documents (with index-time-filter)

Cohere vectors, 768d, MAXIMUM_INNER_PRODUCT similarity

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)          filterStrategy  filterSelectivity  vec_disk(MB)  vec_RAM(MB)  indexType
 0.899        3.505   3.497        0.998  100000   100      50       32        250         no     6995    157.61        634.49            0.01             1          299.24   query-time-pre-filter               0.50       292.969      292.969       HNSW
 0.915        3.266   3.258        0.998  100000   100      50       32        250         no     5742      0.00      Infinity            0.11             1          299.24   query-time-pre-filter               0.20       292.969      292.969       HNSW
 0.903        2.351   2.343        0.997  100000   100      50       32        250         no     3657      0.00      Infinity            0.10             1          299.24   query-time-pre-filter               0.10       292.969      292.969       HNSW
 1.000        0.357   0.349        0.978  100000   100      50       32        250         no     1039      0.00      Infinity            0.10             1          299.24   query-time-pre-filter               0.01       292.969      292.969       HNSW
 0.498        1.185   1.178        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          299.24  query-time-post-filter               0.50       292.969      292.969       HNSW
 0.202        1.165   1.157        0.993  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          299.24  query-time-post-filter               0.20       292.969      292.969       HNSW
 0.100        1.230   1.222        0.993  100000   100      50       32        250         no     3986      0.00      Infinity            0.11             1          299.24  query-time-post-filter               0.10       292.969      292.969       HNSW
 0.010        1.181   1.173        0.993  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          299.24  query-time-post-filter               0.01       292.969      292.969       HNSW
 0.940        1.065   1.057        0.992  100000   100      50       32        250         no     3939    258.24        387.23            0.01             1          449.10       index-time-filter               0.50       292.969      292.969       HNSW
 0.961        0.913   0.906        0.992  100000   100      50       32        250         no     3568    196.23        509.62            0.01             1          359.42       index-time-filter               0.20       292.969      292.969       HNSW
 0.976        0.679   0.671        0.988  100000   100      50       32        250         no     3172    167.67        596.42            0.01             1          329.38       index-time-filter               0.10       292.969      292.969       HNSW
 1.000        0.160   0.152        0.950  100000   100      50       32        250         no      984    155.23        644.19            0.01             1          302.16       index-time-filter               0.01       292.969      292.969       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)          filterStrategy  filterSelectivity  vec_disk(MB)  vec_RAM(MB)  indexType
 0.904        3.705   3.697        0.998  100000   100      50       32        250         no     6982    159.17        628.28            0.01             1          300.00   query-time-pre-filter               0.50       292.969      292.969       HNSW
 0.917        3.455   3.448        0.998  100000   100      50       32        250         no     5786      0.00      Infinity            0.10             1          300.00   query-time-pre-filter               0.20       292.969      292.969       HNSW
 0.900        2.437   2.430        0.997  100000   100      50       32        250         no     3553      0.00      Infinity            0.10             1          300.00   query-time-pre-filter               0.10       292.969      292.969       HNSW
 1.000        0.366   0.358        0.978  100000   100      50       32        250         no     1023      0.00      Infinity            0.10             1          300.00   query-time-pre-filter               0.01       292.969      292.969       HNSW
 0.506        1.263   1.255        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          300.00  query-time-post-filter               0.50       292.969      292.969       HNSW
 0.206        1.255   1.247        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          300.00  query-time-post-filter               0.20       292.969      292.969       HNSW
 0.100        1.257   1.249        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          300.00  query-time-post-filter               0.10       292.969      292.969       HNSW
 0.010        1.287   1.279        0.994  100000   100      50       32        250         no     3986      0.00      Infinity            0.10             1          300.00  query-time-post-filter               0.01       292.969      292.969       HNSW
 0.940        1.138   1.130        0.993  100000   100      50       32        250         no     3927    249.90        400.17            0.01             1          303.57       index-time-filter               0.50       292.969      292.969       HNSW
 0.963        1.001   0.993        0.992  100000   100      50       32        250         no     3598    188.64        530.11            0.01             1          301.41       index-time-filter               0.20       292.969      292.969       HNSW
 0.977        0.791   0.783        0.990  100000   100      50       32        250         no     3159    168.33        594.09            0.01             1          300.65       index-time-filter               0.10       292.969      292.969       HNSW
 1.000        0.209   0.201        0.962  100000   100      50       32        250         no     1023    155.47        643.22            0.01             1          300.05       index-time-filter               0.01       292.969      292.969       HNSW

Note the reduction in index_size(MB) (when index-time-filter is used) due to re-use of raw vectors!
There is a slight increase in latency with this PR, presumably because of the extra lookup step of the vector position..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support multiple HNSW graphs backed by the same vectors

1 participant