Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Better Binary Quantizer format for dense vectors #13651

Draft
wants to merge 163 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 159 commits
Commits
Show all changes
163 commits
Select commit Hold shift + click to select a range
2c4cca9
iter
benwtrent Aug 12, 2024
d8f1aae
iter
benwtrent Aug 12, 2024
20aa776
iter
benwtrent Aug 12, 2024
df54dde
iter
benwtrent Aug 13, 2024
1b31e3e
iter
benwtrent Aug 13, 2024
9d783ff
iter
benwtrent Aug 13, 2024
3415d52
iter
benwtrent Aug 13, 2024
01acdf2
fleshed out a basic binary quantizer class; needs cleanup/iter
john-wagster Aug 13, 2024
1bf59f4
fleshed out a basic binary quantizer class; needs cleanup/iter
john-wagster Aug 14, 2024
dc0e2aa
iter
benwtrent Aug 14, 2024
71cf39a
iter
benwtrent Aug 14, 2024
938d0ad
iter
benwtrent Aug 14, 2024
91cf834
bin quantizer; cleanup/iter
john-wagster Aug 14, 2024
f6e71d7
iter
benwtrent Aug 15, 2024
d84064a
bin scorer; cleanup/iter
john-wagster Aug 15, 2024
b05f906
bin scorer; cleanup/iter
john-wagster Aug 16, 2024
ecdcd4f
Correct errors in format reading
mayya-sharipova Aug 19, 2024
c56990e
More corrections in format
mayya-sharipova Aug 20, 2024
2499263
bin scorer; cleanup/iter
john-wagster Aug 20, 2024
56b133b
Better centroid re-calculation based on weighted sum
mayya-sharipova Aug 21, 2024
8f4f935
bin scorer; cleanup/iter
john-wagster Aug 22, 2024
0c4d66b
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster Aug 22, 2024
19953fc
Merge branch 'main' into feature/adv-binarization-format
ChrisHegarty Aug 22, 2024
fb5faea
remove export from sandbox module-info
ChrisHegarty Aug 22, 2024
c8d295b
fix warnings: unused, forbidden, lint, headers, etc
ChrisHegarty Aug 22, 2024
88f0219
spotless
ChrisHegarty Aug 22, 2024
f2d2896
vectorize ipByteBin on ARM
ChrisHegarty Aug 22, 2024
5e87c1e
bin scorer; cleanup/iter; merged
john-wagster Aug 22, 2024
8a9a827
format cleanup
ChrisHegarty Aug 22, 2024
2163490
Merge remote-tracking branch 'benwtrent/feature/adv-binarization-form…
ChrisHegarty Aug 22, 2024
9c6f02c
bin scorer; cleanup/iter - fixed bad padding and got assertion check …
john-wagster Aug 22, 2024
11880e6
Address when number of centroids > 1
mayya-sharipova Aug 22, 2024
831ff25
Spotless
mayya-sharipova Aug 22, 2024
bc92a2e
bin scorer; cleanup/iter
john-wagster Aug 22, 2024
4993087
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster Aug 22, 2024
37a541d
bin scorer; cleanup/iter - additional fixmes and cleanup
john-wagster Aug 22, 2024
32241b8
bin scorer; cleanup/iter - setting up for tests
john-wagster Aug 22, 2024
422406a
bin scorer; cleanup/iter - setting up for tests
john-wagster Aug 23, 2024
1e0c321
bin scorer; cleanup/iter - setting up for tests
john-wagster Aug 23, 2024
f4a44fe
bin scorer; cleanup/iter - setting up for tests
john-wagster Aug 23, 2024
43079e0
bin scorer; cleanup/iter - got very basic euclidian tests working
john-wagster Aug 23, 2024
8dfa060
bin scorer; cleanup/iter - spotless
john-wagster Aug 23, 2024
3a16d80
test Panama and default impls of ipByteBin
ChrisHegarty Aug 23, 2024
dd8348a
add boundary value test for ipByteBin
ChrisHegarty Aug 23, 2024
90febd9
more ipByteBin tests
ChrisHegarty Aug 23, 2024
5004405
bin scorer; cleanup/iter - test fixes and clean
john-wagster Aug 23, 2024
1be3f39
Testing multiple clusters
mayya-sharipova Aug 23, 2024
b4937c9
bin scorer; cleanup/iter - introduce mip throughout the reader, write…
john-wagster Aug 24, 2024
3f01539
bin scorer; cleanup/iter - introduce mip throughout the reader, write…
john-wagster Aug 24, 2024
0cbd0f8
bin scorer; cleanup/iter - introduce MIP tests
john-wagster Aug 24, 2024
680d5b0
bin scorer; cleanup/iter - introduce MIP tests
john-wagster Aug 24, 2024
a523661
bin scorer; cleanup/iter - MIP tests working
john-wagster Aug 24, 2024
1b3b7e8
panama128 minor cleanup
ChrisHegarty Aug 26, 2024
d6fc7ce
Fix some errors in HNSW format
mayya-sharipova Aug 26, 2024
665c3dd
Fix another error
mayya-sharipova Aug 26, 2024
abf81ef
Minor test fix
mayya-sharipova Aug 26, 2024
ae429cc
simplify 128 and add 256 panama impls
ChrisHegarty Aug 27, 2024
bcd7037
bin scorer; cleanup/iter - got tests working?
john-wagster Aug 27, 2024
006aa07
Make clusterID of type short, handle multiple clusters during scoring
mayya-sharipova Aug 27, 2024
6c413b3
bin scorer; cleanup/iter - added cache of target factors
john-wagster Aug 28, 2024
d762ddf
bin scorer; cleanup/iter - clean up
john-wagster Aug 28, 2024
447b3df
Fix error in offheap vector values
mayya-sharipova Aug 28, 2024
8eb5163
bin scorer; cleanup/iter - no lru for now
john-wagster Aug 28, 2024
8fa9a41
bin scorer; cleanup/iter - no cache and added tests
john-wagster Aug 29, 2024
6c5b980
Small modifications to tests
mayya-sharipova Aug 29, 2024
2b6a066
Addressing precommit errors
mayya-sharipova Aug 29, 2024
623ec3d
Add basic documentation for build
mayya-sharipova Aug 29, 2024
82a9498
bin scorer; cleanup/iter - clean up, fixes, and added some temporary …
john-wagster Aug 30, 2024
e68a121
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster Aug 30, 2024
23c18af
Fix build failures
mayya-sharipova Aug 30, 2024
2311e0c
bin scorer; cleanup/iter - minor clean up, fixes
john-wagster Sep 1, 2024
1246fba
merge
john-wagster Sep 1, 2024
c426ed0
spotless
john-wagster Sep 1, 2024
5c90e00
fixed test
john-wagster Sep 1, 2024
274fdad
Merge branch 'main' into feature/adv-binarization-format
ChrisHegarty Sep 3, 2024
6ac59b9
Remove default posting format override
ChrisHegarty Sep 3, 2024
b485408
spotless
ChrisHegarty Sep 3, 2024
7788699
Make default number of vectors per cluster static
mayya-sharipova Sep 3, 2024
bd22a92
Add search for Lucene912BinaryQuantizedVectorsReader
mayya-sharipova Sep 3, 2024
df3075d
optimization
benwtrent Sep 3, 2024
5318f10
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent Sep 3, 2024
5967fbd
bin scorer; cleanup/iter - mip fixes scores recovered
john-wagster Sep 4, 2024
8d7693a
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster Sep 4, 2024
1cb15ab
more fixes
benwtrent Sep 4, 2024
93c252c
Add debug information to writer
mayya-sharipova Sep 4, 2024
42e27cb
adj clustering
benwtrent Sep 4, 2024
f169399
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent Sep 4, 2024
27520ba
Fix error of quantizing each query vector separately
mayya-sharipova Sep 4, 2024
b782888
bin scorer; cleanup/iter - only store the set amount of corrective va…
john-wagster Sep 4, 2024
42caf1d
iter
benwtrent Sep 5, 2024
cf54e64
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent Sep 5, 2024
a1f99f0
iter
benwtrent Sep 5, 2024
99f88d1
Tidying
mayya-sharipova Sep 5, 2024
e25107e
Correct how query quantized vectors are accessed in the case of multi…
mayya-sharipova Sep 5, 2024
c783378
fixed how errorbounds are calculated and added mip error bounds calc
john-wagster Sep 5, 2024
4bb5de0
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster Sep 5, 2024
8f0755e
fixed small bug and test
john-wagster Sep 5, 2024
1a1144b
fixed test now that corrective factors are dynamic
john-wagster Sep 5, 2024
f213606
Temprorarily comment out the test about number of vectors in cluster
mayya-sharipova Sep 5, 2024
22c61a1
Fix test with corrections
mayya-sharipova Sep 5, 2024
ca53157
adjusting centroid storage
benwtrent Sep 5, 2024
d60bb48
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent Sep 5, 2024
c918f1f
fixing some tests
benwtrent Sep 5, 2024
a17f0fd
reverting unnecessary change
benwtrent Sep 5, 2024
2fd2c3a
more corrective factor cleanup
john-wagster Sep 5, 2024
961813b
Spotless
mayya-sharipova Sep 6, 2024
594e427
Correct the test to account some wrong assignment of centroids
mayya-sharipova Sep 6, 2024
39de717
updating testbinaryquantization
john-wagster Sep 6, 2024
13520b5
merging
john-wagster Sep 6, 2024
4a899b5
spotless
john-wagster Sep 6, 2024
febf6cb
Fixing scoring
benwtrent Sep 6, 2024
e5d5db5
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent Sep 6, 2024
bb706d1
store self centroid dot product alongside each centroid
tteofili Sep 9, 2024
db3d7a9
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
tteofili Sep 9, 2024
8d0a989
Add basic unit test coverage for BQVectorUtils
ChrisHegarty Sep 9, 2024
5bf8dcc
fixing cosine & dp
benwtrent Sep 9, 2024
998f596
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent Sep 9, 2024
e1ca1bf
iter
benwtrent Sep 9, 2024
4f463e8
fix error correction for euclidean
benwtrent Sep 9, 2024
5e62f06
clean up
john-wagster Sep 9, 2024
eff98d7
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster Sep 9, 2024
6ba73d8
precision
john-wagster Sep 9, 2024
93e2229
fixed tests
john-wagster Sep 9, 2024
01a2719
Fixing scoring to avoid NaN
benwtrent Sep 9, 2024
33888d3
normalize merged centroids
benwtrent Sep 9, 2024
1497e62
removing bias change
benwtrent Sep 9, 2024
c3f067f
fixed a bug in the ipbytebin dims check which was bypassing panama
john-wagster Sep 11, 2024
2389442
Merge branch 'main' into feature/adv-binarization-format
john-wagster Sep 11, 2024
52d39fd
fixing Search to respect updated interface
john-wagster Sep 11, 2024
31d9634
updating since Records were added
john-wagster Sep 11, 2024
2b55ca9
no-commit add more sandbox helpers
benwtrent Sep 12, 2024
5a3bbd6
fixing cosine & dimension padding handling
benwtrent Sep 12, 2024
fd8e7db
Normalize vectors before clustering for COSINE similarity
mayya-sharipova Sep 17, 2024
f30ee8c
Correct error
mayya-sharipova Sep 17, 2024
9f8108c
Spotless
mayya-sharipova Sep 17, 2024
48a8bd0
Corrections:
mayya-sharipova Sep 17, 2024
3acf852
Fixing centroid merge
benwtrent Sep 17, 2024
4f81956
Cast to long when multiplyExact to avoid integer overflow
mayya-sharipova Sep 18, 2024
0835357
adjusting clustering limitations
benwtrent Sep 19, 2024
c0654ee
fixing ip binning
benwtrent Sep 19, 2024
4bf934d
set minimum to 1M vectors per cluster
benwtrent Sep 19, 2024
6e3f5f8
fixing cdotc storage etc.
benwtrent Sep 20, 2024
9e3b099
Fixing more cdotc optimizations
benwtrent Sep 20, 2024
979caf6
removing unnecessary todo comments
benwtrent Sep 20, 2024
2de38ba
fixing tests
benwtrent Sep 20, 2024
dd0033d
removing multiple centroid support
benwtrent Sep 23, 2024
41fce8d
make merging faster
benwtrent Sep 24, 2024
f9a3fbd
removing unused code
benwtrent Sep 25, 2024
183104d
adjusting unused files
benwtrent Oct 16, 2024
6c1577f
removing unnecessary changes and files
benwtrent Oct 16, 2024
08cd4fa
Merge remote-tracking branch 'upstream/main' into feature/adv-binariz…
benwtrent Oct 16, 2024
e85736d
iter
benwtrent Oct 17, 2024
714531f
Merge remote-tracking branch 'upstream/main' into feature/adv-binariz…
benwtrent Oct 17, 2024
c0abb06
iter
benwtrent Oct 17, 2024
0b821a4
more clean up
benwtrent Oct 17, 2024
3b67850
we did it
benwtrent Oct 18, 2024
9fa97fb
adding CHANGES
benwtrent Oct 18, 2024
1f2f41c
adj changes
benwtrent Oct 18, 2024
f4bef77
fixing up docs
benwtrent Oct 18, 2024
f7b0ec0
addressing pr comments
benwtrent Oct 22, 2024
f903e00
adjusting tests
benwtrent Oct 22, 2024
12340ac
Merge remote-tracking branch 'upstream/main' into feature/adv-binariz…
benwtrent Nov 6, 2024
e562aca
merging in main, fixing tests
benwtrent Nov 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion lucene/CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,12 @@ API Changes

New Features
---------------------
(No changes)

* GITHUB#13651: New binary quantized vector formats `Lucene101HnswBinaryQuantizedVectorsFormat` and
`Lucene101BinaryQuantizedVectorsFormat`. This results in a 32x reduction in memory requirements for fast vector search
while achieving nice recall properties only requiring about 5x oversampling with rescoring on larger dimensional vectors.
The format is based on the RaBitQ algorithm & paper: https://arxiv.org/abs/2405.12497.
(John Wagster, Mayya Sharipova, Chris Hegarty, Tom Veasey, Ben Trent)

Improvements
---------------------
Expand Down
5 changes: 4 additions & 1 deletion lucene/core/src/java/module-info.java
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
exports org.apache.lucene.codecs.lucene99;
exports org.apache.lucene.codecs.lucene912;
exports org.apache.lucene.codecs.lucene100;
exports org.apache.lucene.codecs.lucene101;
exports org.apache.lucene.codecs.perfield;
exports org.apache.lucene.codecs;
exports org.apache.lucene.document;
Expand Down Expand Up @@ -79,7 +80,9 @@
provides org.apache.lucene.codecs.KnnVectorsFormat with
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat,
org.apache.lucene.codecs.lucene99.Lucene99HnswScalarQuantizedVectorsFormat,
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat;
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat,
org.apache.lucene.codecs.lucene101.Lucene101BinaryQuantizedVectorsFormat,
org.apache.lucene.codecs.lucene101.Lucene101HnswBinaryQuantizedVectorsFormat;
provides org.apache.lucene.codecs.PostingsFormat with
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat;
provides org.apache.lucene.index.SortFieldProvider with
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.codecs.lucene101;

import static org.apache.lucene.util.quantization.BQSpaceUtils.constSqrt;

import java.io.IOException;
import org.apache.lucene.index.ByteVectorValues;
import org.apache.lucene.search.VectorScorer;
import org.apache.lucene.util.VectorUtil;
import org.apache.lucene.util.quantization.BQSpaceUtils;
import org.apache.lucene.util.quantization.BinaryQuantizer;

/**
* A version of {@link ByteVectorValues}, but additionally retrieving score correction values offset
* for binarization quantization scores.
*
* @lucene.experimental
*/
public abstract class BinarizedByteVectorValues extends ByteVectorValues {

/**
* Retrieve the corrective terms for the given vector ordinal. For the dot-product family of
* distances, the corrective terms are, in order
*
* <ul>
* <li>the dot-product of the normalized, centered vector with its binarized self
* <li>the norm of the centered vector
* <li>the dot-product of the vector with the centroid
* </ul>
*
* For euclidean:
*
* <ul>
* <li>The euclidean distance to the centroid
* <li>The sum of the dimensions divided by the vector norm
* </ul>
*
* @param vectorOrd the vector ordinal
* @return the corrective terms
* @throws IOException if an I/O error occurs
*/
public abstract float[] getCorrectiveTerms(int vectorOrd) throws IOException;

/**
* @return the quantizer used to quantize the vectors
*/
public abstract BinaryQuantizer getQuantizer();

public abstract float[] getCentroid() throws IOException;

int discretizedDimensions() {
return BQSpaceUtils.discretize(dimension(), 64);
}

float sqrtDimensions() {
return (float) constSqrt(dimension());
}

float maxX1() {
return (float) (1.9 / constSqrt(discretizedDimensions() - 1.0));
}

/**
* Return a {@link VectorScorer} for the given query vector.
*
* @param query the query vector
* @return a {@link VectorScorer} instance or null
*/
public abstract VectorScorer scorer(float[] query) throws IOException;

@Override
public abstract BinarizedByteVectorValues copy() throws IOException;

float getCentroidDP() throws IOException {
// this only gets executed on-merge
float[] centroid = getCentroid();
return VectorUtil.dotProduct(centroid, centroid);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.codecs.lucene101;

import static org.apache.lucene.index.VectorSimilarityFunction.COSINE;
import static org.apache.lucene.index.VectorSimilarityFunction.EUCLIDEAN;
import static org.apache.lucene.index.VectorSimilarityFunction.MAXIMUM_INNER_PRODUCT;

import java.io.IOException;
import org.apache.lucene.codecs.hnsw.FlatVectorsScorer;
import org.apache.lucene.index.KnnVectorValues;
import org.apache.lucene.index.VectorSimilarityFunction;
import org.apache.lucene.util.ArrayUtil;
import org.apache.lucene.util.VectorUtil;
import org.apache.lucene.util.hnsw.RandomVectorScorer;
import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier;
import org.apache.lucene.util.quantization.BQSpaceUtils;
import org.apache.lucene.util.quantization.BinaryQuantizer;

/** Vector scorer over binarized vector values */
public class Lucene101BinaryFlatVectorsScorer implements FlatVectorsScorer {
private final FlatVectorsScorer nonQuantizedDelegate;

public Lucene101BinaryFlatVectorsScorer(FlatVectorsScorer nonQuantizedDelegate) {
this.nonQuantizedDelegate = nonQuantizedDelegate;
}

@Override
public RandomVectorScorerSupplier getRandomVectorScorerSupplier(
VectorSimilarityFunction similarityFunction, KnnVectorValues vectorValues)
throws IOException {
if (vectorValues instanceof BinarizedByteVectorValues) {
throw new UnsupportedOperationException(
"getRandomVectorScorerSupplier(VectorSimilarityFunction,RandomAccessVectorValues) not implemented for binarized format");
}
return nonQuantizedDelegate.getRandomVectorScorerSupplier(similarityFunction, vectorValues);
}

@Override
public RandomVectorScorer getRandomVectorScorer(
VectorSimilarityFunction similarityFunction, KnnVectorValues vectorValues, float[] target)
throws IOException {
if (vectorValues instanceof BinarizedByteVectorValues binarizedVectors) {
BinaryQuantizer quantizer = binarizedVectors.getQuantizer();
float[] centroid = binarizedVectors.getCentroid();
// FIXME: precompute this once?
int discretizedDimensions = BQSpaceUtils.discretize(target.length, 64);
if (similarityFunction == COSINE) {
float[] copy = ArrayUtil.copyOfSubArray(target, 0, target.length);
VectorUtil.l2normalize(copy);
target = copy;
}
byte[] quantized = new byte[BQSpaceUtils.B_QUERY * discretizedDimensions / 8];
BinaryQuantizer.QueryFactors factors =
quantizer.quantizeForQuery(target, quantized, centroid);
BinaryQueryVector queryVector = new BinaryQueryVector(quantized, factors);
return new BinarizedRandomVectorScorer(queryVector, binarizedVectors, similarityFunction);
}
return nonQuantizedDelegate.getRandomVectorScorer(similarityFunction, vectorValues, target);
}

@Override
public RandomVectorScorer getRandomVectorScorer(
VectorSimilarityFunction similarityFunction, KnnVectorValues vectorValues, byte[] target)
throws IOException {
return nonQuantizedDelegate.getRandomVectorScorer(similarityFunction, vectorValues, target);
}

RandomVectorScorerSupplier getRandomVectorScorerSupplier(
VectorSimilarityFunction similarityFunction,
Lucene101BinaryQuantizedVectorsWriter.OffHeapBinarizedQueryVectorValues scoringVectors,
BinarizedByteVectorValues targetVectors) {
return new BinarizedRandomVectorScorerSupplier(
scoringVectors, targetVectors, similarityFunction);
}

@Override
public String toString() {
return "Lucene101BinaryFlatVectorsScorer(nonQuantizedDelegate=" + nonQuantizedDelegate + ")";
}

/** Vector scorer supplier over binarized vector values */
static class BinarizedRandomVectorScorerSupplier implements RandomVectorScorerSupplier {
private final Lucene101BinaryQuantizedVectorsWriter.OffHeapBinarizedQueryVectorValues
queryVectors;
private final BinarizedByteVectorValues targetVectors;
private final VectorSimilarityFunction similarityFunction;

BinarizedRandomVectorScorerSupplier(
Lucene101BinaryQuantizedVectorsWriter.OffHeapBinarizedQueryVectorValues queryVectors,
BinarizedByteVectorValues targetVectors,
VectorSimilarityFunction similarityFunction) {
this.queryVectors = queryVectors;
this.targetVectors = targetVectors;
this.similarityFunction = similarityFunction;
}

@Override
public RandomVectorScorer scorer(int ord) throws IOException {
byte[] vector = queryVectors.vectorValue(ord);
float[] correctiveTerms = queryVectors.getCorrectiveTerms(ord);
assert correctiveTerms.length == (similarityFunction != EUCLIDEAN ? 6 : 4);
float distanceToCentroid = correctiveTerms[0];
float lower = correctiveTerms[1];
float width = correctiveTerms[2];
final float quantizedSum;
float normVmC = 0f;
float vDotC = 0f;
if (similarityFunction != EUCLIDEAN) {
normVmC = correctiveTerms[3];
vDotC = correctiveTerms[4];
quantizedSum = correctiveTerms[5];
} else {
quantizedSum = correctiveTerms[3];
}
BinaryQueryVector binaryQueryVector =
new BinaryQueryVector(
vector,
new BinaryQuantizer.QueryFactors(
quantizedSum, distanceToCentroid, lower, width, normVmC, vDotC));
return new BinarizedRandomVectorScorer(binaryQueryVector, targetVectors, similarityFunction);
}

@Override
public RandomVectorScorerSupplier copy() throws IOException {
return new BinarizedRandomVectorScorerSupplier(
queryVectors.copy(), targetVectors.copy(), similarityFunction);
}
}

/** A binarized query representing its quantized form along with factors */
public record BinaryQueryVector(byte[] vector, BinaryQuantizer.QueryFactors factors) {}

/** Vector scorer over binarized vector values */
public static class BinarizedRandomVectorScorer
extends RandomVectorScorer.AbstractRandomVectorScorer {
private final BinaryQueryVector queryVector;
private final BinarizedByteVectorValues targetVectors;
private final VectorSimilarityFunction similarityFunction;

private final float sqrtDimensions;
private final float maxX1;

public BinarizedRandomVectorScorer(
BinaryQueryVector queryVectors,
BinarizedByteVectorValues targetVectors,
VectorSimilarityFunction similarityFunction) {
super(targetVectors);
this.queryVector = queryVectors;
this.targetVectors = targetVectors;
this.similarityFunction = similarityFunction;
// FIXME: precompute this once?
this.sqrtDimensions = targetVectors.sqrtDimensions();
this.maxX1 = targetVectors.maxX1();
}

@Override
public float score(int targetOrd) throws IOException {
byte[] quantizedQuery = queryVector.vector();
float quantizedSum = queryVector.factors().quantizedSum();
float lower = queryVector.factors().lower();
float width = queryVector.factors().width();
float distanceToCentroid = queryVector.factors().distToC();
if (similarityFunction == EUCLIDEAN) {
return euclideanScore(
targetOrd,
sqrtDimensions,
quantizedQuery,
distanceToCentroid,
lower,
quantizedSum,
width);
}

float vmC = queryVector.factors().normVmC();
float vDotC = queryVector.factors().vDotC();
float cDotC = targetVectors.getCentroidDP();
byte[] binaryCode = targetVectors.vectorValue(targetOrd);
float[] correctiveTerms = targetVectors.getCorrectiveTerms(targetOrd);
assert correctiveTerms.length == 3;
float ooq = correctiveTerms[0];
float normOC = correctiveTerms[1];
float oDotC = correctiveTerms[2];

float qcDist = VectorUtil.ipByteBinByte(quantizedQuery, binaryCode);

float xbSum = (float) VectorUtil.popCount(binaryCode);
final float dist;
// If ||o-c|| == 0, so, it's ok to throw the rest of the equation away
// and simply use `oDotC + vDotC - cDotC` as centroid == doc vector
if (normOC == 0 || ooq == 0) {
dist = oDotC + vDotC - cDotC;
} else {
// If ||o-c|| != 0, we should assume that `ooq` is finite
assert Float.isFinite(ooq);
float estimatedDot =
(2 * width / sqrtDimensions * qcDist
+ 2 * lower / sqrtDimensions * xbSum
- width / sqrtDimensions * quantizedSum
- sqrtDimensions * lower)
/ ooq;
dist = vmC * normOC * estimatedDot + oDotC + vDotC - cDotC;
}
assert Float.isFinite(dist);

float ooqSqr = (float) Math.pow(ooq, 2);
float errorBound = (float) (vmC * normOC * (maxX1 * Math.sqrt((1 - ooqSqr) / ooqSqr)));
float score = Float.isFinite(errorBound) ? dist - errorBound : dist;
if (similarityFunction == MAXIMUM_INNER_PRODUCT) {
return VectorUtil.scaleMaxInnerProductScore(score);
}
return Math.max((1f + score) / 2f, 0);
}

private float euclideanScore(
int targetOrd,
float sqrtDimensions,
byte[] quantizedQuery,
float distanceToCentroid,
float lower,
float quantizedSum,
float width)
throws IOException {
byte[] binaryCode = targetVectors.vectorValue(targetOrd);
float[] correctiveTerms = targetVectors.getCorrectiveTerms(targetOrd);
assert correctiveTerms.length == 2;

float targetDistToC = correctiveTerms[0];
float x0 = correctiveTerms[1];
float sqrX = targetDistToC * targetDistToC;
double xX0 = targetDistToC / x0;

float xbSum = (float) VectorUtil.popCount(binaryCode);
float factorPPC =
(float) (-2.0 / sqrtDimensions * xX0 * (xbSum * 2.0 - targetVectors.dimension()));
float factorIP = (float) (-2.0 / sqrtDimensions * xX0);

long qcDist = VectorUtil.ipByteBinByte(quantizedQuery, binaryCode);
float score =
sqrX
+ distanceToCentroid
+ factorPPC * lower
+ (qcDist * 2 - quantizedSum) * factorIP * width;
float projectionDist = (float) Math.sqrt(xX0 * xX0 - targetDistToC * targetDistToC);
float error = 2.0f * maxX1 * projectionDist;
float y = (float) Math.sqrt(distanceToCentroid);
float errorBound = y * error;
if (Float.isFinite(errorBound)) {
score = score + errorBound;
}
return Math.max(1 / (1f + score), 0);
}
}
}
Loading