Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new pluggable vector similarity to field info #13200

Closed

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Mar 22, 2024

This adjusts and refactors the vector similarity interface to be pluggable. Meaning, all methods providing VectorSimilarityFunction are now deprecated or removed and users should move to using VectorSimilarity instead.

This is pluggable via SPI within FieldInfo itself, and backwards compatible reading is handled internally for any older field infos that accepted the now deprecated ordinals.

closes #13182

@benwtrent
Copy link
Member Author

Tests still all fail, but now I think it compiles. Many deprecation warnings to go through and clean up still.

One concern I had was on FieldInfo. Do we want to ask for a fully realized VectorSimilarity object or the String name by which it should be loaded?

As for the interface for the API itself, it seems OK, but due to the old API, we were doing some weird things with the random vector scorers, etc. I need to revisit all that to see whats up.

* @param targetOrd the ordinal of the target vector
* @return the float vector value
*/
float[] vectorValue(int targetOrd) throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting a vector value here looks right, but the I wanna see if we can push is a little lower-level! what's going on here is that the vectorValue method seeks to the targetOrd * vector_dims. This is convenient, but ultimately depends on the underlying access to the data. Maybe these providers can have a common supertype that allows low-level access, all we need is access to the IndexInput, dims, and size - number of vectors. Then an implementation can choose how to access the data, if it pleases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I am currently reading around in our HNSW searcher/builder logic to see how the interfaces will fit there. I have some ideas and will ping you again once I have it better nailed down.

@uschindler
Copy link
Contributor

Hi,
I will check the general setup of the SPI interface this week. Sorry for delay.
Uwe

* Returns the {@link IndexInput} for the byte vector data or null if the data is not stored in a
* file.
*/
default IndexInput vectorData() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that these three methods are exactly what we need. The IndexInput will start at the beginning of the actual vector data, and with the offset and element size, we can determine the address of any vector ordinal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure about this actually, I just reverted it. Maybe I can add it back. I am hesitent a little bit as it seems like this could be leaking a bit of the codec. For example, there would be non-trivial knowledge required to handle an off-heap int4 or binary calculation, or anything that stores the vectors in any format other than flat bytes.

It seems to me that the codec itself should provide optimized vector similarities for known similarities.

Requiring something like users to provide int4CompressedOptimizedDotProduct to get the best experience there when the codec should "just do it" seems weird.

@benwtrent benwtrent force-pushed the feature/pluggable-vector-similarities branch from 197200f to 57c476f Compare April 2, 2024 16:49
@uschindler
Copy link
Contributor

The SPI interface and naming of vector similarity looks fine from the FieldInfos and their encoding on field metadata. The code looks copypasted (including the Holder class) from docvalues/postings so it fits perfectly into our framework.
For the naming if the similarities the approach looks fine, I am just not sure if the usual LuceneXY naming of SPIs is needed here.

@uschindler
Copy link
Contributor

I haven't checked the old enum, do we really need all the backwards cruft, if we make the SPI a new feature for Lucene 10? We can remove all old classes then and just adopt reader code for the old codec to make it able to read the old byte values as identifier for similarities and just map them to strings for SPI lookup.

@benwtrent
Copy link
Member Author

@uschindler

We can remove all old classes then and just adopt reader code for the old codec to make it able to read the old byte values as identifier for similarities and just map them to strings for SPI lookup.

There are many places still using the deprecated logic internally mainly because I haven't gone through and cleaned them all up as we have never decided that this is 100% the direction we wanted to go.

The main places where the old enum is used on write is [Byte|Float]VectorField those and their companion queries are marked as experimental. So, I can remove the interaction there pretty easily. Then the enumeration could flagged as deprecated so that external users can stop using it. I am sure its being used more generally outside of the codecs themselves.

VectorSimilarityFunction isn't flagged as experimental itself, and is sitting in the index package. Makes me think that it could be being used. So any upgrade path would require us deprecating it and providing an alternative no?

@benwtrent benwtrent added this to the 9.11.0 milestone Apr 4, 2024
@benwtrent benwtrent marked this pull request as ready for review April 4, 2024 13:29
@benwtrent
Copy link
Member Author

I have removed the deprecated VectorSimilarityFunction methods from the kNN fields, I thought this was acceptable because those fields are flagged as experimental. However, I realize that this might be frustrating for users, so I can revert that change and add back deprecated methods that allow users to supply the deprecated enumeration.

Copy link
Contributor

@tteofili tteofili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes look good to me.
while reviewing (also considering the topic of backward-codecs) I found myself thinking if HnswGraphSearcher , ScalarQuantizer and similar such classes used by Lucene*VectorReader/Writer shouldn't be marked as experimental or just internal.

@benwtrent
Copy link
Member Author

I still need to run with Lucene Util to ensure all these changes didn't add a weird overhead.

This leads me to add back the deprecated apis for the vector fields. I know I would be frustrated, even if the API is marked as experimental.

private static final VectorSimilarityFunction SIMILARITY_FUNCTION =
VectorSimilarityFunction.DOT_PRODUCT;
private static final VectorSimilarity SIMILARITY_FUNCTION =
VectorSimilarity.DotProductSimilarity.INSTANCE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be renamed?

@@ -61,6 +61,8 @@
// Open certain packages for the test framework (ram usage tester).
opens org.apache.lucene.document to
org.apache.lucene.test_framework;
opens org.apache.lucene.codecs to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

final byte distanceFunctionOrdinal = input.readByte();
if (distanceFunctionOrdinal < 0
|| distanceFunctionOrdinal >= VectorSimilarity.LEGACY_VALUE_LENGTH) {
throw new IllegalArgumentException("invalid distance function: " + i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a CorruptIndexEx. Same below. At least the old code throwed this Exception.

@@ -409,7 +385,11 @@ public void write(
}
output.writeVInt(fi.getVectorDimension());
output.writeByte((byte) fi.getVectorEncoding().ordinal());
output.writeByte(distFuncToOrd(fi.getVectorSimilarityFunction()));
if (fi.getVectorSimilarity() == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

Optional.ofNullable(fi.getVectorSimilarity())
 
 .orElse(VectorSimilarity.EuclideanDistanceSimilarity.INSTANCE)
  .getName()

@jimczi
Copy link
Contributor

jimczi commented Apr 4, 2024

Thanks for persevering on this @benwtrent !
A few late thoughts on my end:

I am a bit concerned about the generalization here. The whole similarity is currently modeled around the usage of the HNSW codec that assumes that vectors are compared randomly. This makes the interface heavily geared towards this use case. It also assumes that we always have access to the raw vectors to perform the similarity which is a false premise if we think about product quantization, LSH or any transformational codec.
I wonder if we took the similarity too far in that aspect. In my opinion the similarity should be set at the knn format level and the options could depend on what the format can provide. For HNSW and flat codec, they could continue to share the simple enum we have today with an option to override the implementation in a custom codec. Users that want to implement funky similarities on top could do it in a custom codec that overrides the base ones. We can make this customization more easily accessible in our base codecs if needed.
The point of the base codecs is to provide good out of the box functionalities that work in all cases. Blindly accepting any type of similarities generally is a recipe for failure, we should consider adding a new similarity as something very expert that requires dealing with a new format entirely.
I am also keen to reopen the discussion around simplifying the similarity we currently support.
I personally like the fact that it’s a simple enum with very few values. The issue in how it is exposed today is because each value is linked with an implementation. I think it would be valuable to make the implementation of similarities a detail that each knn format needs to provide.Defining similarity independently of the format complicates matters without much benefit. The only perceived advantage currently is ensuring consistent scoring when querying a field with different knn formats within a single index. However, I question the practicality and necessity of this capability.
If we were to start again I’d argue that just supporting dot-product would be enough and cosine would be left out. I think we can still do that in Lucene 10 and provide the option to normalize the vectors during indexing/querying.
My main question here is whether the similarity should be abstracted at the knn format level.
In my opinion, framing similarity as a universal interface for all knn formats is misleading and could hinder the implementation of other valid knn formats.

@benwtrent
Copy link
Member Author

@jimczi I match your wall of text with my own :).

I am a bit concerned about the generalization here. The whole similarity is currently modeled around the usage of the HNSW codec that assumes that vectors are compared randomly. This makes the interface heavily geared towards this use case.

Yeah, I hear ya. That is sort of the difficulty with any pluggable infrastructure is finding a general enough API.

It also assumes that we always have access to the raw vectors to perform the similarity which is a false premise if we think about product quantization, LSH or any transformational codec.

++

I wonder if we took the similarity too far in that aspect. In my opinion the similarity should be set at the knn format level and the options could depend on what the format can provide. For HNSW and flat codec, they could continue to share the simple enum we have today with an option to override the implementation in a custom codec. Users that want to implement funky similarities on top could do it in a custom codec that overrides the base ones. We can make this customization more easily accessible in our base codecs if needed.

While writing this pluggable interface and looking at all the assumptions it has to make, I kept coming back to “Wait, don’t we already have a pluggable thing for customizing fields? Yeah, its our codecs…”

And many hyper optimized implementations of various vector similarities would have to know how the vectors are laid out in memory. That logic and work should be coupled to the codec.

The point of the base codecs is to provide good out of the box functionalities that work in all cases. Blindly accepting any type of similarities generally is a recipe for failure, we should consider adding a new similarity as something very expert that requires dealing with a new format entirely.

++

I am also keen to reopen the discussion around simplifying the similarity we currently support.

Agreed, I am going to open a PR soon to deprecate cosine. It’s probably the most useless one we have now.

I personally like the fact that it’s a simple enum with very few values. The issue in how it is exposed today is because each value is linked with an implementation. I think it would be valuable to make the implementation of similarities a detail that each knn format needs to provide.Defining similarity independently of the format complicates matters without much benefit. The only perceived advantage currently is ensuring consistent scoring when querying a field with different knn formats within a single index. However, I question the practicality and necessity of this capability.

The scoring consistency is a nice benefit. But, tying implementation and scoring to the codec does give us a natural way to deprecate and move forward support for similarities.

For example, I could see dot_product and maximum_inner_product being merged into one similarity in the future. The main sticking point is the danged scoring as the scaling for the dot_product similarity is so unique and different.

If we were to start again I’d argue that just supporting dot-product would be enough and cosine would be left out. I think we can still do that in Lucene 10 and provide the option to normalize the vectors during indexing/querying.

I agree. As for providing an option for normalizing, Lucene already has an optimized VectorUtil function for this.

My main question here is whether the similarity should be abstracted at the knn format level.
In my opinion, framing similarity as a universal interface for all knn formats is misleading and could hinder the implementation of other valid knn formats.

It is true that some vector storage mechanisms would disallow similarities and others would allow more particular ones (for example, hamming only makes sense for binary values in bytes and would fail for floating points).

I have to think about this more. I agree, stepping back and looking at the code required to have pluggable similarities and the various backwards compatibility and (capability in general) woes it could occur is frustrating.

It may not be worth it at all given the cost.

@benwtrent
Copy link
Member Author

So, I took another take on this in: #13288

The main idea here is that instead of adding another pluggable thing, we rely on formats & custom functions. Its an idea similar to the custom compression options for other formats.

The LOC is way smaller, and IMO, the implementation ends up being much cleaner in general.

@benwtrent
Copy link
Member Author

usurped by: #13288

closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Making vector comparisons pluggable
5 participants