Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Better Binary Quantizer format for dense vectors #13651

Draft
wants to merge 163 commits into
base: main
Choose a base branch
from

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Aug 13, 2024

Highlevel design

RaBitQ is basically a better binary quantization, which works across all models we have tested against. Like PQ, it does require coarse grained clustering to be effective at higher vector densities (effective being defined as only requiring 5x or lower oversampling for recall>95%). But in our testing, the number of vectors required per cluster can be exceptionally large (10s to 100s of millions).

The euclidean vectors as stored in the index:

quantized vector distance_to_centroid vector magnitude
(vector_dimension/8) bytes float float

For dot-product vectors:

quantized vector vector dot-product with binarized self vector magnitude centroid dot-product
(vector_dimension/8) bytes float float float

The vector metadata, in addition to all the regular things (similarity, encoding, sparse vector DISI, etc.).

For indexing into HNSW we actually have a multi-step process. Better binary encodes the query vectors different than the index vectors. Consequently, during segment merge & HNSW building, another temporary file is written containing the query quantized vectors over the configured centroids. One downside is that this temporary file will actually be larger than the regular vector index. This is because we use asymmetric quantization to keep good information around. But once the merge is complete, this file is deleted.

We then read from the query temporary file when adding a vector to the graph and when exploring HNSW, we search the indexed quantized values.

closes: #13650

@ChrisHegarty ChrisHegarty requested review from ChrisHegarty and removed request for ChrisHegarty August 14, 2024 13:43
@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Aug 21, 2024

@benwtrent

possibly switch to LongValues for storing vectorOrd -> centroidOrd mapping

I was thinking about adding centroids mappings as LongValues at the end of meta file, but this could potentially make meta file quite large (for 100M docs, we would need extra 100Mb). We really try to keep meta files small, so I would prefer either:

  • to keep the current approach (add a byte at the end of each vector in the vectors file). Indeed, this may throw off paging size, but may be effect on memory mapped files is not big?
  • adding an extra file for centroids mapping. Centroids mapping can be accessed through memory mapped file, or loaded directly into memory on first use.

What do you think?

For now, we keep adopting the 1st (current) approach.

@benwtrent
Copy link
Member Author

100MB assumes that even when compressed, it's a single byte per centroid. 100M vectors might only have 2 centroids and thus only need two bits two store.

Also, I would expect the centroids to be at the end of the "veb" file, not metadata. Like we do already for the sparse vector ord to doc resolution.

But, either solution needs testing for sure.

@benwtrent benwtrent marked this pull request as ready for review October 18, 2024 20:19
@benwtrent
Copy link
Member Author

I will open a PR against Lucene Util to update it to utilize these formats and show y'all some runs with it soon. But The PR is ready for general review.

Copy link

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent
Copy link
Member Author

Here is some Lucene Util Benchmarking. Some of these numbers actually contradict some of my previous benchmarking for int4. Which is frustrating, I wonder what I did wrong then or now. Or of float32 got faster between then and now :)

Regardless, this shows that bit quantization is generally as fast as int4 search or faster and you can get good recall with some oversampling. Combining with the 32x reduction in space its pretty nice.

The oversampling rates were [1, 1.5, 2, 3, 4, 5]. HNSW params m=16,efsearch=100. Recall@100.

Cohere v2 1M

quantization Index Time Force Merge time Mem Required
1 bit 395.18 411.67 175.9MB
4 bit (compress) 1877.47 491.13 439.7MB
7 bit 500.59 820.53 833.9MB
raw 493.44 792.04 3132.8MB

cohere-v2-bit-1M

Cohere v3 1M

1M Cohere v3 1024

quantization Index Time Force Merge time Mem Required
1 bit 338.97 342.61 208MB
4 bit (compress) 1113.06 5490.36 578MB
7 bit 437.63 744.12 1094MB
raw 408.75 798.11 4162MB

cohere-v3-bit-1M

e5Small

quantization Index Time Force Merge time Mem Required
1 bit 161.84 42.37 57.6MB
4 bit (compress) 665.54 660.33 123.2MB
7 bit 267.13 89.99 219.6MB
raw 249.26 77.81 793.5MB

e5small-bit-500k

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* <li><b>vint</b> the vector dimensions
* <li><b>vlong</b> the offset to the vector data in the .veb file
* <li><b>vlong</b> the length of the vector data in the .veb file
* <li><b>vint</b> the number of vectors
Copy link
Contributor

@mayya-sharipova mayya-sharipova Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also:

 <li><b>[float]</b> clusterCenter
 <li><b>int</b> dotProduct of clusterCenter with itself

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! Thanks Ben and the team!

@benwtrent benwtrent marked this pull request as draft November 1, 2024 15:06
@benwtrent
Copy link
Member Author

Hey @ShashwatShivam mikemccand/luceneutil@main...benwtrent:luceneutil:bbq

that is the testing script I use.

But if Lucene has since been updated with a 101 codec, I would need to update this branch.

@ShashwatShivam
Copy link

@benwtrent thanks for giving the link to the testing script, it works! One question - the index size it reports is larger than the HNSW index size. For e.g. I was working with a Cohere 768 dim dataset with 500k docs and the index sizes were 1488.83 MB and 1544.79 MB for HNSW and RaBitQ (Lucene101HnswBinaryQuantizedVectorsFormat) respectively, which seems incorrect. Could you please tell me why this discrepancy occurs, if you've seen this issue before?

@benwtrent
Copy link
Member Author

@ShashwatShivam why do you think the index size (total size of all the files) should be smaller?

We store the binary quantized vectors and the floating point vectors. So, I would expect about a 5% increase in disk size just from vectors only.

I have also noticed that the HNSW graph itself ends up being more densely connected, but this is only a marginal increase in disk space as well.

@ShashwatShivam
Copy link

@benwtrent makes sense, I wasn't accounting for the fact that the floating vectors are being stored too. I guess I should have instead asked how to reproduce the 'memory required' column, which shows a marked reduction for 1 bit quantization v/s raw?

@benwtrent
Copy link
Member Author

@ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together.

@benwtrent benwtrent changed the title Add a Better Binary Quantizer (RaBitQ) format for dense vectors Add a Better Binary Quantizer format for dense vectors Nov 8, 2024
@ShashwatShivam
Copy link

Hey @benwtrent,
Thank you for all your help so far! I have a question about the oversampling used to increase recall. From what I understand, it scales up the top-k and fanout values by the oversampling factor. In the final match set, do we return only the best top-k documents (not scaled up, but the original value)? I couldn't locate the code where the reranking or selection of the best k results from the expanded match set happens. Could you please help me find that part?
Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add higher quantization level for kNN vector search
8 participants