Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add RawTFSimilarity class #13749

Merged
merged 9 commits into from
Sep 17, 2024
Merged

add RawTFSimilarity class #13749

merged 9 commits into from
Sep 17, 2024

Conversation

cpoerschke
Copy link
Contributor

@cpoerschke cpoerschke commented Sep 10, 2024

Motivation is to use the TF like a payload but without needing to have payloads and positions.

see also

@cpoerschke
Copy link
Contributor Author

... use the TF like a payload ...

Initially I thought that this would require a custom Term[Query|Scorer] style change but then from a brief chat with @seanmacavaney (thank you!) I learnt that maybe TFSimilarity could be a thing here instead and it turns out TestConjunctions already had it as an inner class also.

@cpoerschke cpoerschke marked this pull request as ready for review September 10, 2024 09:16
@cpoerschke cpoerschke marked this pull request as draft September 10, 2024 10:24
@cpoerschke cpoerschke marked this pull request as ready for review September 10, 2024 11:01
@Override
public long computeNorm(FieldInvertState state) {
return 1; // we dont care
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend against this. It would encode the norm differently, making it impossible for users to switch similarities without reindexing.

I would encode it the same as BM25Similarity, TFIDFSimilarity, SimilarityBase, etc.

just ignore the value in scorer() as you are doing already.

if the user wants to not encode norms then they can omit them on the field, it is a separate concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also support removing this footgun as a separate PR. Similarity is supposed to be an "expert" interface, but this abstract method that impacts the index format is trappy, for the reasons shown here.

Maybe it should have a default implementation (currently copied about 3 or 4 different places in subclasses)? https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L114

I wouldn't mess with any other method, except their javadocs. If we have a default implementation for computeNorm then we can suggest in these places a code snippet of how to decode it (e.g. link to SmallFloat.byte4ToInt or whatever) to help users:

It would make the class easier to use. Users might look at BM25Similarity to try to figure out how to do it today, but that one has heavy optimizations such as precomputed length table. A couple javadocs would go a long way on the query-side there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started #13757 to factor out a Similarity.doComputeNorm (or similar) method.

return new SimScorer() {
@Override
public float score(float freq, long norm) {
return boost * freq;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i'd prefer a different name such as RawTFSimilarity.

Otherwise, user thinks this is just "tf" component of "tf/idf" or something. But it is missing even some of the logic for tf component, e.g. there's no saturation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in 861034b commit.

@jpountz
Copy link
Contributor

jpountz commented Sep 10, 2024

Your reference to DelimitedTermFrequencyTokenFilter suggests that the freq here is more a feature than an actual frequency of a term in a doc. From an API perspective, this would make me want to expose it via an IndexableField sub class, with a query factory, a bit like FeatureQuery but for integer values?

Comment on lines 55 to 63
final int numTerms;
if (state.getIndexOptions() == IndexOptions.DOCS && state.getIndexCreatedVersionMajor() >= 8) {
numTerms = state.getUniqueTermCount();
} else if (discountOverlaps) {
numTerms = state.getLength() - state.getNumOverlap();
} else {
numTerms = state.getLength();
}
return SmallFloat.intToByte4(numTerms);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as draft here until the #13757 is available and used here.

@cpoerschke cpoerschke marked this pull request as draft September 11, 2024 07:54
@cpoerschke cpoerschke changed the title add TFSimilarity class add RawTFSimilarity class Sep 12, 2024
@cpoerschke cpoerschke marked this pull request as ready for review September 13, 2024 09:14
Copy link

@seanmacavaney seanmacavaney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @cpoerschke!

@cpoerschke cpoerschke merged commit a817426 into apache:main Sep 17, 2024
3 checks passed
@cpoerschke cpoerschke deleted the TFSimilarity branch September 17, 2024 12:11
asfgit pushed a commit that referenced this pull request Sep 17, 2024
(cherry picked from commit a817426 with adjustment to TestRawTFSimilarity.java w.r.t. topDocs.totalHits.value[()] signature)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants