Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SPI instead of Enum for VectorSimilarityFunctions #13401

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Pulkitg64
Copy link
Contributor

@Pulkitg64 Pulkitg64 commented May 21, 2024

Description

This PR is to get feedback on the idea and any major changes required in the commit.

In this commit we are using Java SPI instead of ENUM to define VectorSimilarityFunction used for vector search. Current implementation in Lucene tightly couples ENUM to Lucene index i.e. enum values are directly used in field info which are stored in index and later used for reading purpose. This makes adding or removing any similarity function to the ENUM very difficult as removing any function from ENUM can change the ordinal value of the functions and cause mismatch of field while reading fields from the Lucene index. On the other hand Java SPI makes life easier by providing pluggable implementation of the similarity function. With SPI, we can extend or remove similarity function easily without the need for changing things at indexing and searching sides.

For backward compatibility, I have kept ordinals with the similarity functions which can be directly used for writing/reading fields to/from the index. For Lucene version >=10 we can avoid using ordinal values and directly use function name for reading/writing the index.

@Pulkitg64
Copy link
Contributor Author

@benwtrent @uschindler @ChrisHegarty
Could you please take a look, if you get a chance?

@navneet1v
Copy link
Contributor

navneet1v commented May 23, 2024

@Pulkitg64
+1 on the feature and functionality. I would like to recommend one thing here:

Can we add the reloading the SPIs functionality for VectorSimilarityFunctions just like we have for codecs. Ref: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/Codec.java#L126-L137

this will ensure that vector similarity functions which are not inside Lucene can be created if anyone want to put a different similarity functions like l1, linf etc for doing similarity search.

@Pulkitg64
Copy link
Contributor Author

Makes sense. Thanks @navneet1v for the suggestion.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments, but this is not a final review. Just things that I stumbled upon on first walkthrough.

I will have no time to do a closer review soon, so please give me some time.

* </ul>
* </ul>
*
* @lucene.experimental
*/
public final class Lucene90FieldInfosFormat extends FieldInfosFormat {

private static final Map<Integer, String> SIMILARITY_FUNCTIONS_MAP = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Java 9+ Map.of() here

throw new CorruptIndexException("invalid distance function: " + b, input);
/** Returns VectorSimilarityFunction from index input and ordinal value */
public static VectorSimilarityFunction getDistFunc(IndexInput input, byte b) throws IOException {
if ((int) b < 0 || (int) b >= SIMILARITY_FUNCTIONS_MAP.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is not correct if we have a sparse set of IDs.

It is better to just use SIMILARITY_FUNCTIONS_MAP.contains(Integer.valueOf(b))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
return VectorSimilarityFunction.values()[b];
return VectorSimilarityFunction.forName(SIMILARITY_FUNCTIONS_MAP.get((int) b));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use 'Integer.valueOf(b)' so we have type safety, as we should not accidentally use wrong type when looking up in map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* SIMILAIRTY_FUNCTION_MAP containing hardcoded mapping for ordinal to vectorSimilarityFunction
* name
*/
public static final Map<Integer, String> SIMILARITY_FUNCTIONS_MAP = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, use Map.of()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return Math.max((1 + dotProduct(v1, v2)) / 2, 0);
static NamedSPILoader<VectorSimilarityFunction> getLoader() {
if (LOADER == null) {
throw new IllegalStateException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add useful message here like in the other SPI holder classes (Postingsformats, codec, docvalues)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -328,6 +332,17 @@ private int getVectorsMaxDimensions(String fieldName) {
return Codec.getDefault().knnVectorsFormat().getMaxDimensions(fieldName);
}

private VectorSimilarityFunction randomSimilarity() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be rewritten to not use ServiceLoader directly and instead use the NamedSPILoader in the provider. If it is needed for tests, you can add a function to retrieve all registered similarities in VectorSimilarityFunction class.

We have something similar for analyzers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement it like this for codecs:

/** returns a list of all available codec names */
public static Set<String> availableCodecs() {
return Holder.getLoader().availableServices();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1235,17 +1235,17 @@ private static String flags(org.apache.lucene.luke.models.documents.DocumentFiel
sb.append("K");
sb.append(String.format(Locale.ENGLISH, "%04d", f.getVectorDimension()));
sb.append("/");
switch (f.getVectorSimilarity()) {
case COSINE:
switch (f.getVectorSimilarity().getName()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would change this to just return the name and remove switch completely.

@@ -19,6 +19,7 @@
@SuppressWarnings({"module", "requires-automatic", "requires-transitive-automatic"})
module org.apache.lucene.test_framework {
uses org.apache.lucene.codecs.KnnVectorsFormat;
uses org.apache.lucene.index.VectorSimilarityFunction;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed if you impelemt it correctly (see below)

Copy link
Contributor

@uschindler uschindler May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are on it, please also remove the ServiceLoader usage in LuceneTestCase that introduces the KnnVectorsFormat here, too.

In general both should have a method to list all available formats/simliarities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also speed up tests using the randomness as the heavy lookup is prevented. This should not be done over and over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Thanks for the feedback.

@benwtrent
Copy link
Member

I did kind of change before, and the added complexity and backwards compatibility concerns just didn't seem warranted. This is why the decision to do the scorer pluggability was added to the codecs.

How do you communicate to codecs the similarity kind?

For example, Vector quantization needs to know if the similarity is cosine. Do we do a "cosine".equals(similarity.name())? That is very fragile. What if some one does an "optimized cosine"? Name conflict or they just cant use vector quantization.

What if someone adds an SPI similarity that should be normalized before being quantized?

I don't see a way of doing this without leaking this internal dependency (normalization being required) into the similarity SPI, which is weird to me.

I just don't think this is feasible.

@benwtrent
Copy link
Member

My old PR: #13200

@Pulkitg64
Copy link
Contributor Author

Pulkitg64 commented May 24, 2024

I have some comments, but this is not a final review. Just things that I stumbled upon on first walkthrough.

I will have no time to do a closer review soon, so please give me some time.

Thank you @uschindler for all the comments, I have tried to address them in the latest commit. Looking forward for more.

@Pulkitg64
Copy link
Contributor Author

I did kind of change before, and the added complexity and backwards compatibility concerns just didn't seem warranted. This is why the decision to do the scorer pluggability was added to the codecs.

How do you communicate to codecs the similarity kind?

For example, Vector quantization needs to know if the similarity is cosine. Do we do a "cosine".equals(similarity.name())? That is very fragile. What if some one does an "optimized cosine"? Name conflict or they just cant use vector quantization.

What if someone adds an SPI similarity that should be normalized before being quantized?

I don't see a way of doing this without leaking this internal dependency (normalization being required) into the similarity SPI, which is weird to me.

I just don't think this is feasible.

Thank you @benwtrent for the feedback and sharing your PR. I didn't know, this attempt was already made before. The main motivation behind this change was to get rid of ENUM implementation which is tightly coupled to field-info. This has caused inconvenience in deprecating the COSINE function from the list. Not sure what's the best approach then. Will take a look at your PR and comments to understand more about this.

@Pulkitg64 Pulkitg64 requested a review from uschindler May 26, 2024 05:30
@ChrisHegarty
Copy link
Contributor

The main motivation behind this change was to get rid of ENUM implementation which is tightly coupled to field-info. This has caused inconvenience in deprecating the COSINE function from the list. Not sure what's the best approach then. Will take a look at your PR and comments to understand more about this.

This is not a very compelling reason supporting the proposed change. I agree with @benwtrent, there are several challenges that would need to be ironed out before we could consider moving this forward.

The recent addition of FlatVectorsScorer has certainly improved the extensibility in this area. For now at least, it appears to offer what is needed to plugin in new implementations.

@@ -3216,10 +3215,11 @@ public static BytesRef newBytesRef(byte[] bytesIn, int offset, int length) {
}

protected KnnVectorsFormat randomVectorFormat(VectorEncoding vectorEncoding) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this already in #13428

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay, thank you for doing this.

}

/** Return list of all VectorSimilarity functions name */
public static List<String> getAvailableVectorSimilarityFunction() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do this in the same way like in the other formats and return a set

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function name is wrong: needs to be plural

}

/** Returns Iterator to VectorSimilarityFunctions */
public static Iterator<VectorSimilarityFunction> getIterator() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this, not needed. The iterator is highly internal and should not be used.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jun 12, 2024
@benwtrent
Copy link
Member

Wanted to touch base on this PR as it seems to have been stalled, mainly by me.

The only format that would support pluggable similarities would be Lucene99HnswVectorsFormat. Any of the quantized codecs would have to throw an exception on an unknown similarity name.

This now complicates a user's mental model support matrix. Having to consider not only codecs, but similarities, all of which are pluggable.

This inherit complexity of making things pluggable is why I think the implication of the "pluggable similarities" is "just make your own format".

However, I am not against moving away from enum and moving towards a nominal/id set of core interfaces. Enums are notoriously painful for BWC as removing one adjusts its "id" and now various edge's have to be smoothed out all over the place.

All this talk of a pluggable SPI for vector similarities spawned out of the complexities of adding fully BWC similarity functions and the difficulty of deprecating and moving on.

So, I propose:

  • We deprecate cosine (as we already have) and remove it from being writable in v10
  • Move away from enums to an id/nominal system for the similarities (what this PR could do)

@github-actions github-actions bot removed the Stale label Jun 19, 2024
Copy link

github-actions bot commented Jul 3, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants