Use SPI instead of Enum for VectorSimilarityFunctions #13401

Pulkitg64 · 2024-05-21T16:44:15Z

Description

This PR is to get feedback on the idea and any major changes required in the commit.

In this commit we are using Java SPI instead of ENUM to define VectorSimilarityFunction used for vector search. Current implementation in Lucene tightly couples ENUM to Lucene index i.e. enum values are directly used in field info which are stored in index and later used for reading purpose. This makes adding or removing any similarity function to the ENUM very difficult as removing any function from ENUM can change the ordinal value of the functions and cause mismatch of field while reading fields from the Lucene index. On the other hand Java SPI makes life easier by providing pluggable implementation of the similarity function. With SPI, we can extend or remove similarity function easily without the need for changing things at indexing and searching sides.

For backward compatibility, I have kept ordinals with the similarity functions which can be directly used for writing/reading fields to/from the index. For Lucene version >=10 we can avoid using ordinal values and directly use function name for reading/writing the index.

Pulkitg64 · 2024-05-22T05:13:14Z

@benwtrent @uschindler @ChrisHegarty
Could you please take a look, if you get a chance?

navneet1v · 2024-05-23T00:40:03Z

@Pulkitg64
+1 on the feature and functionality. I would like to recommend one thing here:

Can we add the reloading the SPIs functionality for VectorSimilarityFunctions just like we have for codecs. Ref: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/Codec.java#L126-L137

this will ensure that vector similarity functions which are not inside Lucene can be created if anyone want to put a different similarity functions like l1, linf etc for doing similarity search.

Pulkitg64 · 2024-05-23T06:06:29Z

Makes sense. Thanks @navneet1v for the suggestion.

uschindler

I have some comments, but this is not a final review. Just things that I stumbled upon on first walkthrough.

I will have no time to do a closer review soon, so please give me some time.

uschindler · 2024-05-23T08:17:35Z

...ard-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90FieldInfosFormat.java

 * </ul>
 * </ul>
 *
 * @lucene.experimental
 */
 public final class Lucene90FieldInfosFormat extends FieldInfosFormat {

+ private static final Map<Integer, String> SIMILARITY_FUNCTIONS_MAP = new HashMap<>();


Use Java 9+ Map.of() here

uschindler · 2024-05-23T08:19:10Z

...ard-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90FieldInfosFormat.java

- throw new CorruptIndexException("invalid distance function: " + b, input);
+ /** Returns VectorSimilarityFunction from index input and ordinal value */
+ public static VectorSimilarityFunction getDistFunc(IndexInput input, byte b) throws IOException {
+ if ((int) b < 0 || (int) b >= SIMILARITY_FUNCTIONS_MAP.size()) {


this check is not correct if we have a sparse set of IDs.

It is better to just use SIMILARITY_FUNCTIONS_MAP.contains(Integer.valueOf(b))

uschindler · 2024-05-23T08:19:55Z

...ard-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90FieldInfosFormat.java

 }
- return VectorSimilarityFunction.values()[b];
+ return VectorSimilarityFunction.forName(SIMILARITY_FUNCTIONS_MAP.get((int) b));


use 'Integer.valueOf(b)' so we have type safety, as we should not accidentally use wrong type when looking up in map.

uschindler · 2024-05-23T08:22:00Z

lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java

+ * SIMILAIRTY_FUNCTION_MAP containing hardcoded mapping for ordinal to vectorSimilarityFunction
+ * name
+ */
+ public static final Map<Integer, String> SIMILARITY_FUNCTIONS_MAP = new HashMap<>();


same here, use Map.of()

uschindler · 2024-05-23T08:24:48Z

lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java

- return Math.max((1 + dotProduct(v1, v2)) / 2, 0);
+ static NamedSPILoader<VectorSimilarityFunction> getLoader() {
+ if (LOADER == null) {
+ throw new IllegalStateException();


add useful message here like in the other SPI holder classes (Postingsformats, codec, docvalues)

uschindler · 2024-05-23T08:29:47Z

lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseFieldInfoFormatTestCase.java

@@ -328,6 +332,17 @@ private int getVectorsMaxDimensions(String fieldName) {
 return Codec.getDefault().knnVectorsFormat().getMaxDimensions(fieldName);
 }

+ private VectorSimilarityFunction randomSimilarity() {


this should be rewritten to not use ServiceLoader directly and instead use the NamedSPILoader in the provider. If it is needed for tests, you can add a function to retrieve all registered similarities in VectorSimilarityFunction class.

We have something similar for analyzers.

Implement it like this for codecs:

lucene/lucene/core/src/java/org/apache/lucene/codecs/Codec.java

Lines 121 to 124 in 05f04aa

/** returns a list of all available codec names */

public static Set<String> availableCodecs() {

return Holder.getLoader().availableServices();

}

uschindler · 2024-05-23T08:31:07Z

lucene/luke/src/java/org/apache/lucene/luke/app/desktop/components/DocumentsPanelProvider.java

@@ -1235,17 +1235,17 @@ private static String flags(org.apache.lucene.luke.models.documents.DocumentFiel
 sb.append("K");
 sb.append(String.format(Locale.ENGLISH, "%04d", f.getVectorDimension()));
 sb.append("/");
- switch (f.getVectorSimilarity()) {
- case COSINE:
+ switch (f.getVectorSimilarity().getName()) {


i would change this to just return the name and remove switch completely.

uschindler · 2024-05-23T08:37:51Z

lucene/test-framework/src/java/module-info.java

@@ -19,6 +19,7 @@
 @SuppressWarnings({"module", "requires-automatic", "requires-transitive-automatic"})
 module org.apache.lucene.test_framework {
 uses org.apache.lucene.codecs.KnnVectorsFormat;
+ uses org.apache.lucene.index.VectorSimilarityFunction;


this can be removed if you impelemt it correctly (see below)

if you are on it, please also remove the ServiceLoader usage in LuceneTestCase that introduces the KnnVectorsFormat here, too.

In general both should have a method to list all available formats/simliarities.

This will also speed up tests using the randomness as the heavy lookup is prevented. This should not be done over and over.

Done! Thanks for the feedback.

benwtrent · 2024-05-24T01:51:03Z

I did kind of change before, and the added complexity and backwards compatibility concerns just didn't seem warranted. This is why the decision to do the scorer pluggability was added to the codecs.

How do you communicate to codecs the similarity kind?

For example, Vector quantization needs to know if the similarity is cosine. Do we do a "cosine".equals(similarity.name())? That is very fragile. What if some one does an "optimized cosine"? Name conflict or they just cant use vector quantization.

What if someone adds an SPI similarity that should be normalized before being quantized?

I don't see a way of doing this without leaking this internal dependency (normalization being required) into the similarity SPI, which is weird to me.

I just don't think this is feasible.

benwtrent · 2024-05-24T01:52:41Z

My old PR: #13200

Pulkitg64 · 2024-05-24T07:55:27Z

I have some comments, but this is not a final review. Just things that I stumbled upon on first walkthrough.

I will have no time to do a closer review soon, so please give me some time.

Thank you @uschindler for all the comments, I have tried to address them in the latest commit. Looking forward for more.

Pulkitg64 · 2024-05-24T08:00:18Z

I did kind of change before, and the added complexity and backwards compatibility concerns just didn't seem warranted. This is why the decision to do the scorer pluggability was added to the codecs.

How do you communicate to codecs the similarity kind?

For example, Vector quantization needs to know if the similarity is cosine. Do we do a "cosine".equals(similarity.name())? That is very fragile. What if some one does an "optimized cosine"? Name conflict or they just cant use vector quantization.

What if someone adds an SPI similarity that should be normalized before being quantized?

I don't see a way of doing this without leaking this internal dependency (normalization being required) into the similarity SPI, which is weird to me.

I just don't think this is feasible.

Thank you @benwtrent for the feedback and sharing your PR. I didn't know, this attempt was already made before. The main motivation behind this change was to get rid of ENUM implementation which is tightly coupled to field-info. This has caused inconvenience in deprecating the COSINE function from the list. Not sure what's the best approach then. Will take a look at your PR and comments to understand more about this.

ChrisHegarty · 2024-05-27T08:38:58Z

The main motivation behind this change was to get rid of ENUM implementation which is tightly coupled to field-info. This has caused inconvenience in deprecating the COSINE function from the list. Not sure what's the best approach then. Will take a look at your PR and comments to understand more about this.

This is not a very compelling reason supporting the proposed change. I agree with @benwtrent, there are several challenges that would need to be ironed out before we could consider moving this forward.

The recent addition of FlatVectorsScorer has certainly improved the extensibility in this area. For now at least, it appears to offer what is needed to plugin in new implementations.

uschindler · 2024-05-27T17:38:37Z

lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java

@@ -3216,10 +3215,11 @@ public static BytesRef newBytesRef(byte[] bytesIn, int offset, int length) {
 }

 protected KnnVectorsFormat randomVectorFormat(VectorEncoding vectorEncoding) {


I fixed this already in #13428

Oh okay, thank you for doing this.

uschindler · 2024-05-27T17:40:41Z

lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java

+ }
+
+ /** Return list of all VectorSimilarity functions name */
+ public static List<String> getAvailableVectorSimilarityFunction() {


please do this in the same way like in the other formats and return a set

the function name is wrong: needs to be plural

uschindler · 2024-05-27T17:41:07Z

lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java

+ }
+
+ /** Returns Iterator to VectorSimilarityFunctions */
+ public static Iterator<VectorSimilarityFunction> getIterator() {


remove this, not needed. The iterator is highly internal and should not be used.

github-actions · 2024-06-12T00:18:55Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

benwtrent · 2024-06-18T17:39:29Z

Wanted to touch base on this PR as it seems to have been stalled, mainly by me.

The only format that would support pluggable similarities would be Lucene99HnswVectorsFormat. Any of the quantized codecs would have to throw an exception on an unknown similarity name.

This now complicates a user's mental model support matrix. Having to consider not only codecs, but similarities, all of which are pluggable.

This inherit complexity of making things pluggable is why I think the implication of the "pluggable similarities" is "just make your own format".

However, I am not against moving away from enum and moving towards a nominal/id set of core interfaces. Enums are notoriously painful for BWC as removing one adjusts its "id" and now various edge's have to be smoothed out all over the place.

All this talk of a pluggable SPI for vector similarities spawned out of the complexities of adding fully BWC similarity functions and the difficulty of deprecating and moving on.

So, I propose:

We deprecate cosine (as we already have) and remove it from being writable in v10
Move away from enums to an id/nominal system for the similarities (what this PR could do)

github-actions · 2024-07-03T00:19:21Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

Pulkitg64 and others added 2 commits May 21, 2024 21:47

Use SPI instead of Enum for VectorSimilarityFunctions

641b959

Merge branch 'main' into spi

965a10f

Pulkitg64 added 3 commits May 22, 2024 12:29

Fix typos, import statements, and some failing tests

412b92d

Merge remote-tracking branch 'origin/spi' into spi

7f851b9

Fix failing test cases

c8c68fa

Added reloadVectorSimilarityFunction to VectorSimilarityFunction class

523cdaa

uschindler requested changes May 23, 2024

View reviewed changes

uschindler reviewed May 23, 2024

View reviewed changes

Address comments

84d1324

Pulkitg64 requested a review from uschindler May 26, 2024 05:30

uschindler reviewed May 27, 2024

View reviewed changes

uschindler requested changes May 27, 2024

View reviewed changes

github-actions bot added the Stale label Jun 12, 2024

github-actions bot removed the Stale label Jun 19, 2024

github-actions bot added the Stale label Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SPI instead of Enum for VectorSimilarityFunctions #13401

Use SPI instead of Enum for VectorSimilarityFunctions #13401

Pulkitg64 commented May 21, 2024 •

edited

Loading

Pulkitg64 commented May 22, 2024

navneet1v commented May 23, 2024 •

edited

Loading

Pulkitg64 commented May 23, 2024

uschindler left a comment

uschindler May 23, 2024

uschindler May 23, 2024

Pulkitg64 May 24, 2024

uschindler May 23, 2024

Pulkitg64 May 24, 2024

uschindler May 23, 2024

Pulkitg64 May 24, 2024

uschindler May 23, 2024

Pulkitg64 May 24, 2024

uschindler May 23, 2024

uschindler May 23, 2024

Pulkitg64 May 24, 2024

uschindler May 23, 2024

uschindler May 23, 2024

uschindler May 23, 2024 •

edited

Loading

uschindler May 23, 2024

Pulkitg64 May 24, 2024

benwtrent commented May 24, 2024

benwtrent commented May 24, 2024

Pulkitg64 commented May 24, 2024 •

edited

Loading

Pulkitg64 commented May 24, 2024

ChrisHegarty commented May 27, 2024

uschindler May 27, 2024

Pulkitg64 May 28, 2024

uschindler May 27, 2024

uschindler May 27, 2024

uschindler May 27, 2024

github-actions bot commented Jun 12, 2024

benwtrent commented Jun 18, 2024

github-actions bot commented Jul 3, 2024

	/** returns a list of all available codec names */
	public static Set<String> availableCodecs() {
	return Holder.getLoader().availableServices();
	}

		@@ -3216,10 +3215,11 @@ public static BytesRef newBytesRef(byte[] bytesIn, int offset, int length) {
		}

		protected KnnVectorsFormat randomVectorFormat(VectorEncoding vectorEncoding) {

Use SPI instead of Enum for VectorSimilarityFunctions #13401

Are you sure you want to change the base?

Use SPI instead of Enum for VectorSimilarityFunctions #13401

Conversation

Pulkitg64 commented May 21, 2024 • edited Loading

Description

Pulkitg64 commented May 22, 2024

navneet1v commented May 23, 2024 • edited Loading

Pulkitg64 commented May 23, 2024

uschindler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uschindler May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented May 24, 2024

benwtrent commented May 24, 2024

Pulkitg64 commented May 24, 2024 • edited Loading

Pulkitg64 commented May 24, 2024

ChrisHegarty commented May 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 12, 2024

benwtrent commented Jun 18, 2024

github-actions bot commented Jul 3, 2024

Pulkitg64 commented May 21, 2024 •

edited

Loading

navneet1v commented May 23, 2024 •

edited

Loading

uschindler May 23, 2024 •

edited

Loading

Pulkitg64 commented May 24, 2024 •

edited

Loading