-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Better Binary Quantizer format for dense vectors #13651
base: main
Are you sure you want to change the base?
Add a Better Binary Quantizer format for dense vectors #13651
Conversation
I was thinking about adding centroids mappings as LongValues at the end of meta file, but this could potentially make meta file quite large (for 100M docs, we would need extra 100Mb). We really try to keep meta files small, so I would prefer either:
What do you think? For now, we keep adopting the 1st (current) approach. |
100MB assumes that even when compressed, it's a single byte per centroid. 100M vectors might only have 2 centroids and thus only need two bits two store. Also, I would expect the centroids to be at the end of the "veb" file, not metadata. Like we do already for the sparse vector ord to doc resolution. But, either solution needs testing for sure. |
…t/lucene into feature/adv-binarization-format
I will open a PR against Lucene Util to update it to utilize these formats and show y'all some runs with it soon. But The PR is ready for general review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Here is some Lucene Util Benchmarking. Some of these numbers actually contradict some of my previous benchmarking for int4. Which is frustrating, I wonder what I did wrong then or now. Or of float32 got faster between then and now :) Regardless, this shows that bit quantization is generally as fast as int4 search or faster and you can get good recall with some oversampling. Combining with the 32x reduction in space its pretty nice. The oversampling rates were Cohere v2 1M
Cohere v3 1M1M Cohere v3 1024
e5Small
|
lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorUtilSupport.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* <li><b>vint</b> the vector dimensions | ||
* <li><b>vlong</b> the offset to the vector data in the .veb file | ||
* <li><b>vlong</b> the length of the vector data in the .veb file | ||
* <li><b>vint</b> the number of vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also:
<li><b>[float]</b> clusterCenter
<li><b>int</b> dotProduct of clusterCenter with itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! Thanks Ben and the team!
Hey @ShashwatShivam mikemccand/luceneutil@main...benwtrent:luceneutil:bbq that is the testing script I use. But if Lucene has since been updated with a 101 codec, I would need to update this branch. |
@benwtrent thanks for giving the link to the testing script, it works! One question - the index size it reports is larger than the HNSW index size. For e.g. I was working with a Cohere 768 dim dataset with 500k docs and the index sizes were 1488.83 MB and 1544.79 MB for HNSW and RaBitQ (Lucene101HnswBinaryQuantizedVectorsFormat) respectively, which seems incorrect. Could you please tell me why this discrepancy occurs, if you've seen this issue before? |
@ShashwatShivam why do you think the index size (total size of all the files) should be smaller? We store the binary quantized vectors and the floating point vectors. So, I would expect about a 5% increase in disk size just from vectors only. I have also noticed that the HNSW graph itself ends up being more densely connected, but this is only a marginal increase in disk space as well. |
@benwtrent makes sense, I wasn't accounting for the fact that the floating vectors are being stored too. I guess I should have instead asked how to reproduce the 'memory required' column, which shows a marked reduction for 1 bit quantization v/s raw? |
@ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together. |
Hey @benwtrent, |
Highlevel design
RaBitQ is basically a better binary quantization, which works across all models we have tested against. Like PQ, it does require coarse grained clustering to be effective at higher vector densities (effective being defined as only requiring 5x or lower oversampling for recall>95%). But in our testing, the number of vectors required per cluster can be exceptionally large (10s to 100s of millions).
The euclidean vectors as stored in the index:
For dot-product vectors:
The vector metadata, in addition to all the regular things (similarity, encoding, sparse vector DISI, etc.).
For indexing into HNSW we actually have a multi-step process. Better binary encodes the query vectors different than the index vectors. Consequently, during segment merge & HNSW building, another temporary file is written containing the query quantized vectors over the configured centroids. One downside is that this temporary file will actually be larger than the regular vector index. This is because we use asymmetric quantization to keep good information around. But once the merge is complete, this file is deleted.
We then read from the query temporary file when adding a vector to the graph and when exploring HNSW, we search the indexed quantized values.
closes: #13650