-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new pluggable vector similarity to field info #13200
Add new pluggable vector similarity to field info #13200
Conversation
Tests still all fail, but now I think it compiles. Many deprecation warnings to go through and clean up still. One concern I had was on As for the interface for the API itself, it seems OK, but due to the old API, we were doing some weird things with the random vector scorers, etc. I need to revisit all that to see whats up. |
* @param targetOrd the ordinal of the target vector | ||
* @return the float vector value | ||
*/ | ||
float[] vectorValue(int targetOrd) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting a vector value here looks right, but the I wanna see if we can push is a little lower-level! what's going on here is that the vectorValue method seeks to the targetOrd * vector_dims. This is convenient, but ultimately depends on the underlying access to the data. Maybe these providers can have a common supertype that allows low-level access, all we need is access to the IndexInput
, dims
, and size
- number of vectors. Then an implementation can choose how to access the data, if it pleases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I am currently reading around in our HNSW searcher/builder logic to see how the interfaces will fit there. I have some ideas and will ping you again once I have it better nailed down.
Hi, |
* Returns the {@link IndexInput} for the byte vector data or null if the data is not stored in a | ||
* file. | ||
*/ | ||
default IndexInput vectorData() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that these three methods are exactly what we need. The IndexInput will start at the beginning of the actual vector data, and with the offset and element size, we can determine the address of any vector ordinal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure about this actually, I just reverted it. Maybe I can add it back. I am hesitent a little bit as it seems like this could be leaking a bit of the codec. For example, there would be non-trivial knowledge required to handle an off-heap int4
or binary calculation, or anything that stores the vectors in any format other than flat bytes.
It seems to me that the codec itself should provide optimized vector similarities for known similarities.
Requiring something like users to provide int4CompressedOptimizedDotProduct
to get the best experience there when the codec should "just do it" seems weird.
197200f
to
57c476f
Compare
The SPI interface and naming of vector similarity looks fine from the FieldInfos and their encoding on field metadata. The code looks copypasted (including the Holder class) from docvalues/postings so it fits perfectly into our framework. |
I haven't checked the old enum, do we really need all the backwards cruft, if we make the SPI a new feature for Lucene 10? We can remove all old classes then and just adopt reader code for the old codec to make it able to read the old byte values as identifier for similarities and just map them to strings for SPI lookup. |
There are many places still using the deprecated logic internally mainly because I haven't gone through and cleaned them all up as we have never decided that this is 100% the direction we wanted to go. The main places where the old enum is used on write is
|
I have removed the deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the changes look good to me.
while reviewing (also considering the topic of backward-codecs
) I found myself thinking if HnswGraphSearcher
, ScalarQuantizer
and similar such classes used by Lucene*VectorReader/Writer
shouldn't be marked as experimental
or just internal
.
I still need to run with Lucene Util to ensure all these changes didn't add a weird overhead. This leads me to add back the deprecated apis for the vector fields. I know I would be frustrated, even if the API is marked as experimental. |
private static final VectorSimilarityFunction SIMILARITY_FUNCTION = | ||
VectorSimilarityFunction.DOT_PRODUCT; | ||
private static final VectorSimilarity SIMILARITY_FUNCTION = | ||
VectorSimilarity.DotProductSimilarity.INSTANCE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be renamed?
@@ -61,6 +61,8 @@ | |||
// Open certain packages for the test framework (ram usage tester). | |||
opens org.apache.lucene.document to | |||
org.apache.lucene.test_framework; | |||
opens org.apache.lucene.codecs to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
final byte distanceFunctionOrdinal = input.readByte(); | ||
if (distanceFunctionOrdinal < 0 | ||
|| distanceFunctionOrdinal >= VectorSimilarity.LEGACY_VALUE_LENGTH) { | ||
throw new IllegalArgumentException("invalid distance function: " + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a CorruptIndexEx. Same below. At least the old code throwed this Exception.
@@ -409,7 +385,11 @@ public void write( | |||
} | |||
output.writeVInt(fi.getVectorDimension()); | |||
output.writeByte((byte) fi.getVectorEncoding().ordinal()); | |||
output.writeByte(distFuncToOrd(fi.getVectorSimilarityFunction())); | |||
if (fi.getVectorSimilarity() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
Optional.ofNullable(fi.getVectorSimilarity())
.orElse(VectorSimilarity.EuclideanDistanceSimilarity.INSTANCE)
.getName()
Thanks for persevering on this @benwtrent ! I am a bit concerned about the generalization here. The whole similarity is currently modeled around the usage of the HNSW codec that assumes that vectors are compared randomly. This makes the interface heavily geared towards this use case. It also assumes that we always have access to the raw vectors to perform the similarity which is a false premise if we think about product quantization, LSH or any transformational codec. |
@jimczi I match your wall of text with my own :).
Yeah, I hear ya. That is sort of the difficulty with any pluggable infrastructure is finding a general enough API.
++
While writing this pluggable interface and looking at all the assumptions it has to make, I kept coming back to “Wait, don’t we already have a pluggable thing for customizing fields? Yeah, its our codecs…” And many hyper optimized implementations of various vector similarities would have to know how the vectors are laid out in memory. That logic and work should be coupled to the codec.
++
Agreed, I am going to open a PR soon to deprecate cosine. It’s probably the most useless one we have now.
The scoring consistency is a nice benefit. But, tying implementation and scoring to the codec does give us a natural way to deprecate and move forward support for similarities. For example, I could see
I agree. As for providing an option for normalizing, Lucene already has an optimized VectorUtil function for this.
It is true that some vector storage mechanisms would disallow similarities and others would allow more particular ones (for example, I have to think about this more. I agree, stepping back and looking at the code required to have pluggable similarities and the various backwards compatibility and (capability in general) woes it could occur is frustrating. It may not be worth it at all given the cost. |
So, I took another take on this in: #13288 The main idea here is that instead of adding another pluggable thing, we rely on formats & custom functions. Its an idea similar to the custom compression options for other formats. The LOC is way smaller, and IMO, the implementation ends up being much cleaner in general. |
usurped by: #13288 closing |
This adjusts and refactors the vector similarity interface to be pluggable. Meaning, all methods providing VectorSimilarityFunction are now deprecated or removed and users should move to using
VectorSimilarity
instead.This is pluggable via SPI within FieldInfo itself, and backwards compatible reading is handled internally for any older field infos that accepted the now deprecated ordinals.
closes #13182