-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add RawTFSimilarity class #13749
add RawTFSimilarity class #13749
Conversation
Initially I thought that this would require a custom |
lucene/core/src/java/org/apache/lucene/search/similarities/TFSimilarity.java
Outdated
Show resolved
Hide resolved
@Override | ||
public long computeNorm(FieldInvertState state) { | ||
return 1; // we dont care | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend against this. It would encode the norm differently, making it impossible for users to switch similarities without reindexing.
I would encode it the same as BM25Similarity, TFIDFSimilarity, SimilarityBase, etc.
just ignore the value in scorer() as you are doing already.
if the user wants to not encode norms then they can omit them on the field, it is a separate concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also support removing this footgun as a separate PR. Similarity is supposed to be an "expert" interface, but this abstract method that impacts the index format is trappy, for the reasons shown here.
Maybe it should have a default implementation (currently copied about 3 or 4 different places in subclasses)? https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L114
I wouldn't mess with any other method, except their javadocs. If we have a default implementation for computeNorm
then we can suggest in these places a code snippet of how to decode it (e.g. link to SmallFloat.byte4ToInt or whatever) to help users:
- https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L158
- https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L167
It would make the class easier to use. Users might look at BM25Similarity to try to figure out how to do it today, but that one has heavy optimizations such as precomputed length table. A couple javadocs would go a long way on the query-side there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started #13757 to factor out a Similarity.doComputeNorm
(or similar) method.
return new SimScorer() { | ||
@Override | ||
public float score(float freq, long norm) { | ||
return boost * freq; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i'd prefer a different name such as RawTFSimilarity
.
Otherwise, user thinks this is just "tf" component of "tf/idf" or something. But it is missing even some of the logic for tf component, e.g. there's no saturation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed in 861034b commit.
Your reference to |
final int numTerms; | ||
if (state.getIndexOptions() == IndexOptions.DOCS && state.getIndexCreatedVersionMajor() >= 8) { | ||
numTerms = state.getUniqueTermCount(); | ||
} else if (discountOverlaps) { | ||
numTerms = state.getLength() - state.getNumOverlap(); | ||
} else { | ||
numTerms = state.getLength(); | ||
} | ||
return SmallFloat.intToByte4(numTerms); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking as draft here until the #13757 is available and used here.
lucene/core/src/java/org/apache/lucene/search/similarities/RawTFSimilarity.java
Outdated
Show resolved
Hide resolved
…TFSimilarity.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @cpoerschke!
(cherry picked from commit a817426 with adjustment to TestRawTFSimilarity.java w.r.t. topDocs.totalHits.value[()] signature)
Motivation is to use the TF like a payload but without needing to have payloads and positions.
see also
PayloadScoreQuery
javadoc update w.r.t.SpanQuery
use