Add new pluggable vector similarity to field info #13200

benwtrent · 2024-03-22T20:35:30Z

This adjusts and refactors the vector similarity interface to be pluggable. Meaning, all methods providing VectorSimilarityFunction are now deprecated or removed and users should move to using VectorSimilarity instead.

This is pluggable via SPI within FieldInfo itself, and backwards compatible reading is handled internally for any older field infos that accepted the now deprecated ordinals.

closes #13182

benwtrent · 2024-03-27T20:22:11Z

Tests still all fail, but now I think it compiles. Many deprecation warnings to go through and clean up still.

One concern I had was on FieldInfo. Do we want to ask for a fully realized VectorSimilarity object or the String name by which it should be loaded?

As for the interface for the API itself, it seems OK, but due to the old API, we were doing some weird things with the random vector scorers, etc. I need to revisit all that to see whats up.

ChrisHegarty · 2024-03-29T12:16:28Z

lucene/core/src/java/org/apache/lucene/codecs/FloatVectorProvider.java

+   * @param targetOrd the ordinal of the target vector
+   * @return the float vector value
+   */
+  float[] vectorValue(int targetOrd) throws IOException;


Getting a vector value here looks right, but the I wanna see if we can push is a little lower-level! what's going on here is that the vectorValue method seeks to the targetOrd * vector_dims. This is convenient, but ultimately depends on the underlying access to the data. Maybe these providers can have a common supertype that allows low-level access, all we need is access to the IndexInput, dims, and size - number of vectors. Then an implementation can choose how to access the data, if it pleases.

Agreed, I am currently reading around in our HNSW searcher/builder logic to see how the interfaces will fit there. I have some ideas and will ping you again once I have it better nailed down.

uschindler · 2024-04-01T18:00:34Z

Hi,
I will check the general setup of the SPI interface this week. Sorry for delay.
Uwe

…ector-similarities

ChrisHegarty · 2024-04-02T16:18:59Z

lucene/core/src/java/org/apache/lucene/codecs/ByteVectorProvider.java

+   * Returns the {@link IndexInput} for the byte vector data or null if the data is not stored in a
+   * file.
+   */
+  default IndexInput vectorData() throws IOException {


I think that these three methods are exactly what we need. The IndexInput will start at the beginning of the actual vector data, and with the offset and element size, we can determine the address of any vector ordinal.

I am unsure about this actually, I just reverted it. Maybe I can add it back. I am hesitent a little bit as it seems like this could be leaking a bit of the codec. For example, there would be non-trivial knowledge required to handle an off-heap int4 or binary calculation, or anything that stores the vectors in any format other than flat bytes.

It seems to me that the codec itself should provide optimized vector similarities for known similarities.

Requiring something like users to provide int4CompressedOptimizedDotProduct to get the best experience there when the codec should "just do it" seems weird.

uschindler · 2024-04-02T17:13:18Z

The SPI interface and naming of vector similarity looks fine from the FieldInfos and their encoding on field metadata. The code looks copypasted (including the Holder class) from docvalues/postings so it fits perfectly into our framework.
For the naming if the similarities the approach looks fine, I am just not sure if the usual LuceneXY naming of SPIs is needed here.

uschindler · 2024-04-02T17:17:59Z

I haven't checked the old enum, do we really need all the backwards cruft, if we make the SPI a new feature for Lucene 10? We can remove all old classes then and just adopt reader code for the old codec to make it able to read the old byte values as identifier for similarities and just map them to strings for SPI lookup.

benwtrent · 2024-04-02T19:43:30Z

@uschindler

We can remove all old classes then and just adopt reader code for the old codec to make it able to read the old byte values as identifier for similarities and just map them to strings for SPI lookup.

There are many places still using the deprecated logic internally mainly because I haven't gone through and cleaned them all up as we have never decided that this is 100% the direction we wanted to go.

The main places where the old enum is used on write is [Byte|Float]VectorField those and their companion queries are marked as experimental. So, I can remove the interaction there pretty easily. Then the enumeration could flagged as deprecated so that external users can stop using it. I am sure its being used more generally outside of the codecs themselves.

VectorSimilarityFunction isn't flagged as experimental itself, and is sitting in the index package. Makes me think that it could be being used. So any upgrade path would require us deprecating it and providing an alternative no?

…ector-similarities

benwtrent · 2024-04-04T13:31:12Z

I have removed the deprecated VectorSimilarityFunction methods from the kNN fields, I thought this was acceptable because those fields are flagged as experimental. However, I realize that this might be frustrating for users, so I can revert that change and add back deprecated methods that allow users to supply the deprecated enumeration.

tteofili

the changes look good to me.
while reviewing (also considering the topic of backward-codecs) I found myself thinking if HnswGraphSearcher , ScalarQuantizer and similar such classes used by Lucene*VectorReader/Writer shouldn't be marked as experimental or just internal.

benwtrent · 2024-04-04T15:04:24Z

I still need to run with Lucene Util to ensure all these changes didn't add a weird overhead.

This leads me to add back the deprecated apis for the vector fields. I know I would be frustrated, even if the API is marked as experimental.

uschindler · 2024-04-04T16:27:48Z

...sis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymProvider.java

-  private static final VectorSimilarityFunction SIMILARITY_FUNCTION =
-      VectorSimilarityFunction.DOT_PRODUCT;
+  private static final VectorSimilarity SIMILARITY_FUNCTION =
+      VectorSimilarity.DotProductSimilarity.INSTANCE;


Shouldn't this be renamed?

uschindler · 2024-04-04T16:29:46Z

lucene/core/src/java/module-info.java

@@ -61,6 +61,8 @@
  // Open certain packages for the test framework (ram usage tester).
  opens org.apache.lucene.document to
      org.apache.lucene.test_framework;
+  opens org.apache.lucene.codecs to


Why is this needed?

uschindler · 2024-04-04T16:32:32Z

lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94FieldInfosFormat.java

+            final byte distanceFunctionOrdinal = input.readByte();
+            if (distanceFunctionOrdinal < 0
+                || distanceFunctionOrdinal >= VectorSimilarity.LEGACY_VALUE_LENGTH) {
+              throw new IllegalArgumentException("invalid distance function: " + i);


I think this should be a CorruptIndexEx. Same below. At least the old code throwed this Exception.

uschindler · 2024-04-04T16:37:20Z

lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94FieldInfosFormat.java

@@ -409,7 +385,11 @@ public void write(
        }
        output.writeVInt(fi.getVectorDimension());
        output.writeByte((byte) fi.getVectorEncoding().ordinal());
-        output.writeByte(distFuncToOrd(fi.getVectorSimilarityFunction()));
+        if (fi.getVectorSimilarity() == null) {


Maybe:

Optional.ofNullable(fi.getVectorSimilarity()) .orElse(VectorSimilarity.EuclideanDistanceSimilarity.INSTANCE) .getName()

jimczi · 2024-04-04T17:03:56Z

Thanks for persevering on this @benwtrent !
A few late thoughts on my end:

I am a bit concerned about the generalization here. The whole similarity is currently modeled around the usage of the HNSW codec that assumes that vectors are compared randomly. This makes the interface heavily geared towards this use case. It also assumes that we always have access to the raw vectors to perform the similarity which is a false premise if we think about product quantization, LSH or any transformational codec.
I wonder if we took the similarity too far in that aspect. In my opinion the similarity should be set at the knn format level and the options could depend on what the format can provide. For HNSW and flat codec, they could continue to share the simple enum we have today with an option to override the implementation in a custom codec. Users that want to implement funky similarities on top could do it in a custom codec that overrides the base ones. We can make this customization more easily accessible in our base codecs if needed.
The point of the base codecs is to provide good out of the box functionalities that work in all cases. Blindly accepting any type of similarities generally is a recipe for failure, we should consider adding a new similarity as something very expert that requires dealing with a new format entirely.
I am also keen to reopen the discussion around simplifying the similarity we currently support.
I personally like the fact that it’s a simple enum with very few values. The issue in how it is exposed today is because each value is linked with an implementation. I think it would be valuable to make the implementation of similarities a detail that each knn format needs to provide.Defining similarity independently of the format complicates matters without much benefit. The only perceived advantage currently is ensuring consistent scoring when querying a field with different knn formats within a single index. However, I question the practicality and necessity of this capability.
If we were to start again I’d argue that just supporting dot-product would be enough and cosine would be left out. I think we can still do that in Lucene 10 and provide the option to normalize the vectors during indexing/querying.
My main question here is whether the similarity should be abstracted at the knn format level.
In my opinion, framing similarity as a universal interface for all knn formats is misleading and could hinder the implementation of other valid knn formats.

benwtrent · 2024-04-05T19:29:37Z

@jimczi I match your wall of text with my own :).

I am a bit concerned about the generalization here. The whole similarity is currently modeled around the usage of the HNSW codec that assumes that vectors are compared randomly. This makes the interface heavily geared towards this use case.

Yeah, I hear ya. That is sort of the difficulty with any pluggable infrastructure is finding a general enough API.

It also assumes that we always have access to the raw vectors to perform the similarity which is a false premise if we think about product quantization, LSH or any transformational codec.

++

I wonder if we took the similarity too far in that aspect. In my opinion the similarity should be set at the knn format level and the options could depend on what the format can provide. For HNSW and flat codec, they could continue to share the simple enum we have today with an option to override the implementation in a custom codec. Users that want to implement funky similarities on top could do it in a custom codec that overrides the base ones. We can make this customization more easily accessible in our base codecs if needed.

While writing this pluggable interface and looking at all the assumptions it has to make, I kept coming back to “Wait, don’t we already have a pluggable thing for customizing fields? Yeah, its our codecs…”

And many hyper optimized implementations of various vector similarities would have to know how the vectors are laid out in memory. That logic and work should be coupled to the codec.

The point of the base codecs is to provide good out of the box functionalities that work in all cases. Blindly accepting any type of similarities generally is a recipe for failure, we should consider adding a new similarity as something very expert that requires dealing with a new format entirely.

++

I am also keen to reopen the discussion around simplifying the similarity we currently support.

Agreed, I am going to open a PR soon to deprecate cosine. It’s probably the most useless one we have now.

I personally like the fact that it’s a simple enum with very few values. The issue in how it is exposed today is because each value is linked with an implementation. I think it would be valuable to make the implementation of similarities a detail that each knn format needs to provide.Defining similarity independently of the format complicates matters without much benefit. The only perceived advantage currently is ensuring consistent scoring when querying a field with different knn formats within a single index. However, I question the practicality and necessity of this capability.

The scoring consistency is a nice benefit. But, tying implementation and scoring to the codec does give us a natural way to deprecate and move forward support for similarities.

For example, I could see dot_product and maximum_inner_product being merged into one similarity in the future. The main sticking point is the danged scoring as the scaling for the dot_product similarity is so unique and different.

If we were to start again I’d argue that just supporting dot-product would be enough and cosine would be left out. I think we can still do that in Lucene 10 and provide the option to normalize the vectors during indexing/querying.

I agree. As for providing an option for normalizing, Lucene already has an optimized VectorUtil function for this.

My main question here is whether the similarity should be abstracted at the knn format level.
In my opinion, framing similarity as a universal interface for all knn formats is misleading and could hinder the implementation of other valid knn formats.

It is true that some vector storage mechanisms would disallow similarities and others would allow more particular ones (for example, hamming only makes sense for binary values in bytes and would fail for floating points).

I have to think about this more. I agree, stepping back and looking at the code required to have pluggable similarities and the various backwards compatibility and (capability in general) woes it could occur is frustrating.

It may not be worth it at all given the cost.

benwtrent · 2024-04-10T14:08:26Z

So, I took another take on this in: #13288

The main idea here is that instead of adding another pluggable thing, we rely on formats & custom functions. Its an idea similar to the custom compression options for other formats.

The LOC is way smaller, and IMO, the implementation ends up being much cleaner in general.

benwtrent · 2024-04-17T17:24:46Z

usurped by: #13288

closing

benwtrent added 2 commits March 22, 2024 16:31

Add new pluggable vector similarity to field info

d49d3a2

iter

41c1a0b

benwtrent added 3 commits March 28, 2024 13:55

fix compilation

b66a201

removing unnecessary declarations

93b2d75

iter

e174851

ChrisHegarty reviewed Mar 29, 2024

View reviewed changes

benwtrent added 2 commits March 29, 2024 17:05

iter

2646a8e

iter

8e0b240

benwtrent added 3 commits April 1, 2024 15:17

iter

d62aa20

Merge remote-tracking branch 'upstream/main' into feature/pluggable-v…

6818d78

…ector-similarities

iter

57c476f

benwtrent mentioned this pull request Apr 2, 2024

Pass custom similarity function to similarityToQueryVector API #13187

Open

ChrisHegarty reviewed Apr 2, 2024

View reviewed changes

benwtrent force-pushed the feature/pluggable-vector-similarities branch from 197200f to 57c476f Compare April 2, 2024 16:49

benwtrent added 2 commits April 2, 2024 13:07

quantization random access

55446d6

iter

a0c7c02

benwtrent added 6 commits April 3, 2024 09:44

Merge branch 'main' into feature/pluggable-vector-similarities

32196f6

iter

d42aff2

iter

32eaa97

Merge remote-tracking branch 'upstream/main' into feature/pluggable-v…

1f74706

…ector-similarities

iter

7167222

iter

3e618c0

benwtrent added this to the 9.11.0 milestone Apr 4, 2024

benwtrent marked this pull request as ready for review April 4, 2024 13:29

benwtrent requested a review from ChrisHegarty April 4, 2024 13:31

tteofili approved these changes Apr 4, 2024

View reviewed changes

uschindler reviewed Apr 4, 2024

View reviewed changes

benwtrent mentioned this pull request Apr 10, 2024

Add BitVectors format and make flat vectors format easier to extend #13288

Merged

benwtrent closed this Apr 17, 2024

benwtrent mentioned this pull request May 24, 2024

Use SPI instead of Enum for VectorSimilarityFunctions #13401

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new pluggable vector similarity to field info #13200

Add new pluggable vector similarity to field info #13200

benwtrent commented Mar 22, 2024 •

edited

Loading

benwtrent commented Mar 27, 2024

ChrisHegarty Mar 29, 2024

benwtrent Mar 29, 2024

uschindler commented Apr 1, 2024

ChrisHegarty Apr 2, 2024

benwtrent Apr 2, 2024

uschindler commented Apr 2, 2024

uschindler commented Apr 2, 2024

benwtrent commented Apr 2, 2024

benwtrent commented Apr 4, 2024

tteofili left a comment

benwtrent commented Apr 4, 2024

uschindler Apr 4, 2024

uschindler Apr 4, 2024

uschindler Apr 4, 2024

uschindler Apr 4, 2024

jimczi commented Apr 4, 2024

benwtrent commented Apr 5, 2024

benwtrent commented Apr 10, 2024

benwtrent commented Apr 17, 2024

Add new pluggable vector similarity to field info #13200

Add new pluggable vector similarity to field info #13200

Conversation

benwtrent commented Mar 22, 2024 • edited Loading

benwtrent commented Mar 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uschindler commented Apr 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uschindler commented Apr 2, 2024

uschindler commented Apr 2, 2024

benwtrent commented Apr 2, 2024

benwtrent commented Apr 4, 2024

tteofili left a comment

Choose a reason for hiding this comment

benwtrent commented Apr 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi commented Apr 4, 2024

benwtrent commented Apr 5, 2024

benwtrent commented Apr 10, 2024

benwtrent commented Apr 17, 2024

benwtrent commented Mar 22, 2024 •

edited

Loading