Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Better Binary Quantizer format for dense vectors #13651

Draft
wants to merge 163 commits into
base: main
Choose a base branch
from

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Aug 13, 2024

Highlevel design

RaBitQ is basically a better binary quantization, which works across all models we have tested against. Like PQ, it does require coarse grained clustering to be effective at higher vector densities (effective being defined as only requiring 5x or lower oversampling for recall>95%). But in our testing, the number of vectors required per cluster can be exceptionally large (10s to 100s of millions).

The euclidean vectors as stored in the index:

quantized vector distance_to_centroid vector magnitude
(vector_dimension/8) bytes float float

For dot-product vectors:

quantized vector vector dot-product with binarized self vector magnitude centroid dot-product
(vector_dimension/8) bytes float float float

The vector metadata, in addition to all the regular things (similarity, encoding, sparse vector DISI, etc.).

For indexing into HNSW we actually have a multi-step process. Better binary encodes the query vectors different than the index vectors. Consequently, during segment merge & HNSW building, another temporary file is written containing the query quantized vectors over the configured centroids. One downside is that this temporary file will actually be larger than the regular vector index. This is because we use asymmetric quantization to keep good information around. But once the merge is complete, this file is deleted.

We then read from the query temporary file when adding a vector to the graph and when exploring HNSW, we search the indexed quantized values.

closes: #13650

@ChrisHegarty ChrisHegarty requested review from ChrisHegarty and removed request for ChrisHegarty August 14, 2024 13:43
@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Aug 21, 2024

@benwtrent

possibly switch to LongValues for storing vectorOrd -> centroidOrd mapping

I was thinking about adding centroids mappings as LongValues at the end of meta file, but this could potentially make meta file quite large (for 100M docs, we would need extra 100Mb). We really try to keep meta files small, so I would prefer either:

  • to keep the current approach (add a byte at the end of each vector in the vectors file). Indeed, this may throw off paging size, but may be effect on memory mapped files is not big?
  • adding an extra file for centroids mapping. Centroids mapping can be accessed through memory mapped file, or loaded directly into memory on first use.

What do you think?

For now, we keep adopting the 1st (current) approach.

@benwtrent
Copy link
Member Author

100MB assumes that even when compressed, it's a single byte per centroid. 100M vectors might only have 2 centroids and thus only need two bits two store.

Also, I would expect the centroids to be at the end of the "veb" file, not metadata. Like we do already for the sparse vector ord to doc resolution.

But, either solution needs testing for sure.

Copy link

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent
Copy link
Member Author

Here is some Lucene Util Benchmarking. Some of these numbers actually contradict some of my previous benchmarking for int4. Which is frustrating, I wonder what I did wrong then or now. Or of float32 got faster between then and now :)

Regardless, this shows that bit quantization is generally as fast as int4 search or faster and you can get good recall with some oversampling. Combining with the 32x reduction in space its pretty nice.

The oversampling rates were [1, 1.5, 2, 3, 4, 5]. HNSW params m=16,efsearch=100. Recall@100.

Cohere v2 1M

quantization Index Time Force Merge time Mem Required
1 bit 395.18 411.67 175.9MB
4 bit (compress) 1877.47 491.13 439.7MB
7 bit 500.59 820.53 833.9MB
raw 493.44 792.04 3132.8MB

cohere-v2-bit-1M

Cohere v3 1M

1M Cohere v3 1024

quantization Index Time Force Merge time Mem Required
1 bit 338.97 342.61 208MB
4 bit (compress) 1113.06 5490.36 578MB
7 bit 437.63 744.12 1094MB
raw 408.75 798.11 4162MB

cohere-v3-bit-1M

e5Small

quantization Index Time Force Merge time Mem Required
1 bit 161.84 42.37 57.6MB
4 bit (compress) 665.54 660.33 123.2MB
7 bit 267.13 89.99 219.6MB
raw 249.26 77.81 793.5MB

e5small-bit-500k

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* <li><b>vint</b> the vector dimensions
* <li><b>vlong</b> the offset to the vector data in the .veb file
* <li><b>vlong</b> the length of the vector data in the .veb file
* <li><b>vint</b> the number of vectors
Copy link
Contributor

@mayya-sharipova mayya-sharipova Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also:

 <li><b>[float]</b> clusterCenter
 <li><b>int</b> dotProduct of clusterCenter with itself

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! Thanks Ben and the team!

@benwtrent benwtrent marked this pull request as draft November 1, 2024 15:06
@benwtrent
Copy link
Member Author

Hey @ShashwatShivam mikemccand/luceneutil@main...benwtrent:luceneutil:bbq

that is the testing script I use.

But if Lucene has since been updated with a 101 codec, I would need to update this branch.

@ShashwatShivam
Copy link

@benwtrent thanks for giving the link to the testing script, it works! One question - the index size it reports is larger than the HNSW index size. For e.g. I was working with a Cohere 768 dim dataset with 500k docs and the index sizes were 1488.83 MB and 1544.79 MB for HNSW and RaBitQ (Lucene101HnswBinaryQuantizedVectorsFormat) respectively, which seems incorrect. Could you please tell me why this discrepancy occurs, if you've seen this issue before?

@benwtrent
Copy link
Member Author

@ShashwatShivam why do you think the index size (total size of all the files) should be smaller?

We store the binary quantized vectors and the floating point vectors. So, I would expect about a 5% increase in disk size just from vectors only.

I have also noticed that the HNSW graph itself ends up being more densely connected, but this is only a marginal increase in disk space as well.

@ShashwatShivam
Copy link

@benwtrent makes sense, I wasn't accounting for the fact that the floating vectors are being stored too. I guess I should have instead asked how to reproduce the 'memory required' column, which shows a marked reduction for 1 bit quantization v/s raw?

@benwtrent
Copy link
Member Author

@ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together.

@benwtrent benwtrent changed the title Add a Better Binary Quantizer (RaBitQ) format for dense vectors Add a Better Binary Quantizer format for dense vectors Nov 8, 2024
@ShashwatShivam
Copy link

Hey @benwtrent,
Thank you for all your help so far! I have a question about the oversampling used to increase recall. From what I understand, it scales up the top-k and fanout values by the oversampling factor. In the final match set, do we return only the best top-k documents (not scaled up, but the original value)? I couldn't locate the code where the reranking or selection of the best k results from the expanded match set happens. Could you please help me find that part?
Thanks again!

@mikemccand
Copy link
Member

@ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together.

Once this cool change is merged let's fix luceneutil's KNN benchy tooling (knnPerfTest.py, KnnGraphTester.java) to compute/report the "memory column" ("hot RAM", "searchable RAM", something)? Basically everything except the original (float32 or byte) vectors. I'll open an upstream luceneutil issue...

@benwtrent
Copy link
Member Author

Quick update, we have been bothered with some of the numbers (for example, models like "gist" perform poorly) and we have some improvements to get done first before flipping back to "ready for review".

@mikemccand YES! That would be great! "Memory required" would be the quantized file size + hnsw graph file size (if the graph exists).

@ShashwatShivam

Sorry for the late reply. There are no "out of the box" rescoring actions directly in Lucene. Mainly because the individual tools are (mostly) already available to you. You can ask for more overall vectors with one query, and then rescore the individual documents according to the raw vector comparisons. I admit, this requires some Lucene API know how.

It would be good for a "vector scorer" to indicate if its an estimation or not to allow for smarter actions in the knn doc collector...

@ShashwatShivam
Copy link

I conducted a benchmark using Cohere's 768-dimensional data. Here are the steps I followed for reproducibility:

  1. Set up the luceneutil repository following the installation instructions provided.

  2. Switch branches to this specific branch since the latest mainline branch is not compatible with the feature needed for this experiment.

  3. Change the branch of lucene_candidate to benwtrent:feature/adv-binarization-format to incorporate advanced binarization formats.

  4. Run knnPerfTest.py after specifying the document and query file paths to the stored Cohere data files. The runtime parameters were set as follows:

    • nDoc = 500,000
    • topk = 10
    • fanout = 100
    • maxConn = 32
    • beamWidth = 100
    • oversample values tested: {1, 1.5, 2, 3, 4, 5}

    I used quantizeBits = 1 for RaBitQ+HNSW and quantizeBits = 32 for regular HNSW.

A comparison was performed between HNSW and RaBitQ, and I observed the recall-latency tradeoff, which is shown in the attached image:
output.

@tanyaroosta
Copy link

@gaoj0017
Copy link

Thanks, Tanya @tanyaroosta , for sharing our blog about RaBitQ in this thread. I am the first author of the RaBitQ paper. I am glad to know that our RaBitQ method has been discussed in the threads here. Regarding the BBQ (Better Binary Quantization) method mentioned in these threads, my understanding is that it majorly follows the framework of RaBitQ and makes some minor modifications for practical performance consideration. The claimed key features of BBQ as described in a blog from Elastic - Better Binary Quantization (BBQ) in Lucene and Elasticsearch - e.g., normalization around a centroid, multiple error correction values, asymmetric quantization, bit-wise operations, all originate from our RaBitQ paper.

We note that it is quite often that the industry customizes some methods from academia to better suit their applications, but the industry rarely gives the variant a new name and claim as a new method. For example, the PQ and HNSW methods are from academia and have been widely adopted in the industry with some modifications, but the industry still respects their original names. We believe the same practice should be followed for RaBitQ.

In addition, we would like to share that we have extended RaBitQ to support quantization beyond 1-bit per dimension (e.g., 2-bit, 3-bit, …). The paper of the extended RaBitQ was made available in Sep 2024. It achieves so by constructing a larger codebook than that of RaBitQ and can be equivalently understood as an optimized scalar quantization method. For details, please refer to the paper and also a blog that we have recently posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add higher quantization level for kNN vector search
10 participants