-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SPI instead of Enum for VectorSimilarityFunctions #13401
base: main
Are you sure you want to change the base?
Conversation
@benwtrent @uschindler @ChrisHegarty |
@Pulkitg64 Can we add the reloading the SPIs functionality for VectorSimilarityFunctions just like we have for codecs. Ref: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/Codec.java#L126-L137 this will ensure that vector similarity functions which are not inside Lucene can be created if anyone want to put a different similarity functions like l1, linf etc for doing similarity search. |
Makes sense. Thanks @navneet1v for the suggestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments, but this is not a final review. Just things that I stumbled upon on first walkthrough.
I will have no time to do a closer review soon, so please give me some time.
* </ul> | ||
* </ul> | ||
* | ||
* @lucene.experimental | ||
*/ | ||
public final class Lucene90FieldInfosFormat extends FieldInfosFormat { | ||
|
||
private static final Map<Integer, String> SIMILARITY_FUNCTIONS_MAP = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Java 9+ Map.of()
here
throw new CorruptIndexException("invalid distance function: " + b, input); | ||
/** Returns VectorSimilarityFunction from index input and ordinal value */ | ||
public static VectorSimilarityFunction getDistFunc(IndexInput input, byte b) throws IOException { | ||
if ((int) b < 0 || (int) b >= SIMILARITY_FUNCTIONS_MAP.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is not correct if we have a sparse set of IDs.
It is better to just use SIMILARITY_FUNCTIONS_MAP.contains(Integer.valueOf(b))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
return VectorSimilarityFunction.values()[b]; | ||
return VectorSimilarityFunction.forName(SIMILARITY_FUNCTIONS_MAP.get((int) b)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use 'Integer.valueOf(b)' so we have type safety, as we should not accidentally use wrong type when looking up in map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* SIMILAIRTY_FUNCTION_MAP containing hardcoded mapping for ordinal to vectorSimilarityFunction | ||
* name | ||
*/ | ||
public static final Map<Integer, String> SIMILARITY_FUNCTIONS_MAP = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, use Map.of()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return Math.max((1 + dotProduct(v1, v2)) / 2, 0); | ||
static NamedSPILoader<VectorSimilarityFunction> getLoader() { | ||
if (LOADER == null) { | ||
throw new IllegalStateException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add useful message here like in the other SPI holder classes (Postingsformats, codec, docvalues)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -328,6 +332,17 @@ private int getVectorsMaxDimensions(String fieldName) { | |||
return Codec.getDefault().knnVectorsFormat().getMaxDimensions(fieldName); | |||
} | |||
|
|||
private VectorSimilarityFunction randomSimilarity() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be rewritten to not use ServiceLoader directly and instead use the NamedSPILoader in the provider. If it is needed for tests, you can add a function to retrieve all registered similarities in VectorSimilarityFunction class.
We have something similar for analyzers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement it like this for codecs:
lucene/lucene/core/src/java/org/apache/lucene/codecs/Codec.java
Lines 121 to 124 in 05f04aa
/** returns a list of all available codec names */ | |
public static Set<String> availableCodecs() { | |
return Holder.getLoader().availableServices(); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -1235,17 +1235,17 @@ private static String flags(org.apache.lucene.luke.models.documents.DocumentFiel | |||
sb.append("K"); | |||
sb.append(String.format(Locale.ENGLISH, "%04d", f.getVectorDimension())); | |||
sb.append("/"); | |||
switch (f.getVectorSimilarity()) { | |||
case COSINE: | |||
switch (f.getVectorSimilarity().getName()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would change this to just return the name and remove switch completely.
@@ -19,6 +19,7 @@ | |||
@SuppressWarnings({"module", "requires-automatic", "requires-transitive-automatic"}) | |||
module org.apache.lucene.test_framework { | |||
uses org.apache.lucene.codecs.KnnVectorsFormat; | |||
uses org.apache.lucene.index.VectorSimilarityFunction; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be removed if you impelemt it correctly (see below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are on it, please also remove the ServiceLoader usage in LuceneTestCase that introduces the KnnVectorsFormat here, too.
In general both should have a method to list all available formats/simliarities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also speed up tests using the randomness as the heavy lookup is prevented. This should not be done over and over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Thanks for the feedback.
I did kind of change before, and the added complexity and backwards compatibility concerns just didn't seem warranted. This is why the decision to do the scorer pluggability was added to the codecs. How do you communicate to codecs the similarity kind? For example, Vector quantization needs to know if the similarity is cosine. Do we do a "cosine".equals(similarity.name())? That is very fragile. What if some one does an "optimized cosine"? Name conflict or they just cant use vector quantization. What if someone adds an SPI similarity that should be normalized before being quantized? I don't see a way of doing this without leaking this internal dependency (normalization being required) into the similarity SPI, which is weird to me. I just don't think this is feasible. |
My old PR: #13200 |
Thank you @uschindler for all the comments, I have tried to address them in the latest commit. Looking forward for more. |
Thank you @benwtrent for the feedback and sharing your PR. I didn't know, this attempt was already made before. The main motivation behind this change was to get rid of ENUM implementation which is tightly coupled to field-info. This has caused inconvenience in deprecating the COSINE function from the list. Not sure what's the best approach then. Will take a look at your PR and comments to understand more about this. |
This is not a very compelling reason supporting the proposed change. I agree with @benwtrent, there are several challenges that would need to be ironed out before we could consider moving this forward. The recent addition of FlatVectorsScorer has certainly improved the extensibility in this area. For now at least, it appears to offer what is needed to plugin in new implementations. |
@@ -3216,10 +3215,11 @@ public static BytesRef newBytesRef(byte[] bytesIn, int offset, int length) { | |||
} | |||
|
|||
protected KnnVectorsFormat randomVectorFormat(VectorEncoding vectorEncoding) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this already in #13428
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay, thank you for doing this.
} | ||
|
||
/** Return list of all VectorSimilarity functions name */ | ||
public static List<String> getAvailableVectorSimilarityFunction() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do this in the same way like in the other formats and return a set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function name is wrong: needs to be plural
} | ||
|
||
/** Returns Iterator to VectorSimilarityFunctions */ | ||
public static Iterator<VectorSimilarityFunction> getIterator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this, not needed. The iterator is highly internal and should not be used.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Wanted to touch base on this PR as it seems to have been stalled, mainly by me. The only format that would support pluggable similarities would be This now complicates a user's mental model support matrix. Having to consider not only codecs, but similarities, all of which are pluggable. This inherit complexity of making things pluggable is why I think the implication of the "pluggable similarities" is "just make your own format". However, I am not against moving away from All this talk of a pluggable SPI for vector similarities spawned out of the complexities of adding fully BWC similarity functions and the difficulty of deprecating and moving on. So, I propose:
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Description
This PR is to get feedback on the idea and any major changes required in the commit.
In this commit we are using Java SPI instead of ENUM to define VectorSimilarityFunction used for vector search. Current implementation in Lucene tightly couples ENUM to Lucene index i.e. enum values are directly used in field info which are stored in index and later used for reading purpose. This makes adding or removing any similarity function to the ENUM very difficult as removing any function from ENUM can change the ordinal value of the functions and cause mismatch of field while reading fields from the Lucene index. On the other hand Java SPI makes life easier by providing pluggable implementation of the similarity function. With SPI, we can extend or remove similarity function easily without the need for changing things at indexing and searching sides.
For backward compatibility, I have kept ordinals with the similarity functions which can be directly used for writing/reading fields to/from the index. For Lucene version >=10 we can avoid using ordinal values and directly use function name for reading/writing the index.