-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Better Binary Quantizer format for dense vectors #13651
base: main
Are you sure you want to change the base?
Add a Better Binary Quantizer format for dense vectors #13651
Conversation
I was thinking about adding centroids mappings as LongValues at the end of meta file, but this could potentially make meta file quite large (for 100M docs, we would need extra 100Mb). We really try to keep meta files small, so I would prefer either:
What do you think? For now, we keep adopting the 1st (current) approach. |
100MB assumes that even when compressed, it's a single byte per centroid. 100M vectors might only have 2 centroids and thus only need two bits two store. Also, I would expect the centroids to be at the end of the "veb" file, not metadata. Like we do already for the sparse vector ord to doc resolution. But, either solution needs testing for sure. |
…t/lucene into feature/adv-binarization-format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Here is some Lucene Util Benchmarking. Some of these numbers actually contradict some of my previous benchmarking for int4. Which is frustrating, I wonder what I did wrong then or now. Or of float32 got faster between then and now :) Regardless, this shows that bit quantization is generally as fast as int4 search or faster and you can get good recall with some oversampling. Combining with the 32x reduction in space its pretty nice. The oversampling rates were Cohere v2 1M
Cohere v3 1M1M Cohere v3 1024
e5Small
|
lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorUtilSupport.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* <li><b>vint</b> the vector dimensions | ||
* <li><b>vlong</b> the offset to the vector data in the .veb file | ||
* <li><b>vlong</b> the length of the vector data in the .veb file | ||
* <li><b>vint</b> the number of vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also:
<li><b>[float]</b> clusterCenter
<li><b>int</b> dotProduct of clusterCenter with itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! Thanks Ben and the team!
Hey @ShashwatShivam mikemccand/luceneutil@main...benwtrent:luceneutil:bbq that is the testing script I use. But if Lucene has since been updated with a 101 codec, I would need to update this branch. |
@benwtrent thanks for giving the link to the testing script, it works! One question - the index size it reports is larger than the HNSW index size. For e.g. I was working with a Cohere 768 dim dataset with 500k docs and the index sizes were 1488.83 MB and 1544.79 MB for HNSW and RaBitQ (Lucene101HnswBinaryQuantizedVectorsFormat) respectively, which seems incorrect. Could you please tell me why this discrepancy occurs, if you've seen this issue before? |
@ShashwatShivam why do you think the index size (total size of all the files) should be smaller? We store the binary quantized vectors and the floating point vectors. So, I would expect about a 5% increase in disk size just from vectors only. I have also noticed that the HNSW graph itself ends up being more densely connected, but this is only a marginal increase in disk space as well. |
@benwtrent makes sense, I wasn't accounting for the fact that the floating vectors are being stored too. I guess I should have instead asked how to reproduce the 'memory required' column, which shows a marked reduction for 1 bit quantization v/s raw? |
@ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together. |
Hey @benwtrent, |
Once this cool change is merged let's fix |
Quick update, we have been bothered with some of the numbers (for example, models like "gist" perform poorly) and we have some improvements to get done first before flipping back to "ready for review". @mikemccand YES! That would be great! "Memory required" would be the quantized file size + hnsw graph file size (if the graph exists). Sorry for the late reply. There are no "out of the box" rescoring actions directly in Lucene. Mainly because the individual tools are (mostly) already available to you. You can ask for more overall vectors with one query, and then rescore the individual documents according to the raw vector comparisons. I admit, this requires some Lucene API know how. It would be good for a "vector scorer" to indicate if its an estimation or not to allow for smarter actions in the knn doc collector... |
I conducted a benchmark using Cohere's 768-dimensional data. Here are the steps I followed for reproducibility:
A comparison was performed between HNSW and RaBitQ, and I observed the recall-latency tradeoff, which is shown in the attached image: |
FYI, a blog post on RaBitQ: https://dev.to/gaoj0017/quantization-in-the-counterintuitive-high-dimensional-space-4feg |
Thanks, Tanya @tanyaroosta , for sharing our blog about RaBitQ in this thread. I am the first author of the RaBitQ paper. I am glad to know that our RaBitQ method has been discussed in the threads here. Regarding the BBQ (Better Binary Quantization) method mentioned in these threads, my understanding is that it majorly follows the framework of RaBitQ and makes some minor modifications for practical performance consideration. The claimed key features of BBQ as described in a blog from Elastic - Better Binary Quantization (BBQ) in Lucene and Elasticsearch - e.g., normalization around a centroid, multiple error correction values, asymmetric quantization, bit-wise operations, all originate from our RaBitQ paper. We note that it is quite often that the industry customizes some methods from academia to better suit their applications, but the industry rarely gives the variant a new name and claim as a new method. For example, the PQ and HNSW methods are from academia and have been widely adopted in the industry with some modifications, but the industry still respects their original names. We believe the same practice should be followed for RaBitQ. In addition, we would like to share that we have extended RaBitQ to support quantization beyond 1-bit per dimension (e.g., 2-bit, 3-bit, …). The paper of the extended RaBitQ was made available in Sep 2024. It achieves so by constructing a larger codebook than that of RaBitQ and can be equivalently understood as an optimized scalar quantization method. For details, please refer to the paper and also a blog that we have recently posted. |
Highlevel design
RaBitQ is basically a better binary quantization, which works across all models we have tested against. Like PQ, it does require coarse grained clustering to be effective at higher vector densities (effective being defined as only requiring 5x or lower oversampling for recall>95%). But in our testing, the number of vectors required per cluster can be exceptionally large (10s to 100s of millions).
The euclidean vectors as stored in the index:
For dot-product vectors:
The vector metadata, in addition to all the regular things (similarity, encoding, sparse vector DISI, etc.).
For indexing into HNSW we actually have a multi-step process. Better binary encodes the query vectors different than the index vectors. Consequently, during segment merge & HNSW building, another temporary file is written containing the query quantized vectors over the configured centroids. One downside is that this temporary file will actually be larger than the regular vector index. This is because we use asymmetric quantization to keep good information around. But once the merge is complete, this file is deleted.
We then read from the query temporary file when adding a vector to the graph and when exploring HNSW, we search the indexed quantized values.
closes: #13650