add RawTFSimilarity class #13749

cpoerschke · 2024-09-10T09:00:02Z

Motivation is to use the TF like a payload but without needing to have payloads and positions.

see also

PayloadScoreQuery javadoc update w.r.t. SpanQuery use #13731 PayloadScoreQuery javadoc update w.r.t. SpanQuery use
https://github.com/apache/lucene/blob/releases/lucene/9.11.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.java for indexing

cpoerschke · 2024-09-10T09:01:46Z

... use the TF like a payload ...

Initially I thought that this would require a custom Term[Query|Scorer] style change but then from a brief chat with @seanmacavaney (thank you!) I learnt that maybe TFSimilarity could be a thing here instead and it turns out TestConjunctions already had it as an inner class also.

lucene/core/src/java/org/apache/lucene/search/similarities/TFSimilarity.java

…imilarity.java

rmuir · 2024-09-10T18:00:33Z

lucene/core/src/java/org/apache/lucene/search/similarities/TFSimilarity.java

+  @Override
+  public long computeNorm(FieldInvertState state) {
+    return 1; // we dont care
+  }


I'd recommend against this. It would encode the norm differently, making it impossible for users to switch similarities without reindexing.

I would encode it the same as BM25Similarity, TFIDFSimilarity, SimilarityBase, etc.

just ignore the value in scorer() as you are doing already.

if the user wants to not encode norms then they can omit them on the field, it is a separate concern.

I'd also support removing this footgun as a separate PR. Similarity is supposed to be an "expert" interface, but this abstract method that impacts the index format is trappy, for the reasons shown here.

Maybe it should have a default implementation (currently copied about 3 or 4 different places in subclasses)? https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L114

I wouldn't mess with any other method, except their javadocs. If we have a default implementation for computeNorm then we can suggest in these places a code snippet of how to decode it (e.g. link to SmallFloat.byte4ToInt or whatever) to help users:

https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L158

https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L167

It would make the class easier to use. Users might look at BM25Similarity to try to figure out how to do it today, but that one has heavy optimizations such as precomputed length table. A couple javadocs would go a long way on the query-side there.

Started #13757 to factor out a Similarity.doComputeNorm (or similar) method.

rmuir · 2024-09-10T18:16:19Z

lucene/core/src/java/org/apache/lucene/search/similarities/TFSimilarity.java

+    return new SimScorer() {
+      @Override
+      public float score(float freq, long norm) {
+        return boost * freq;


I think i'd prefer a different name such as RawTFSimilarity.

Otherwise, user thinks this is just "tf" component of "tf/idf" or something. But it is missing even some of the logic for tf component, e.g. there's no saturation.

Renamed in 861034b commit.

jpountz · 2024-09-10T20:44:34Z

Your reference to DelimitedTermFrequencyTokenFilter suggests that the freq here is more a feature than an actual frequency of a term in a doc. From an API perspective, this would make me want to expose it via an IndexableField sub class, with a query factory, a bit like FeatureQuery but for integer values?

…ilarity

cpoerschke · 2024-09-11T07:54:46Z

lucene/core/src/java/org/apache/lucene/search/similarities/RawTFSimilarity.java

+    final int numTerms;
+    if (state.getIndexOptions() == IndexOptions.DOCS && state.getIndexCreatedVersionMajor() >= 8) {
+      numTerms = state.getUniqueTermCount();
+    } else if (discountOverlaps) {
+      numTerms = state.getLength() - state.getNumOverlap();
+    } else {
+      numTerms = state.getLength();
+    }
+    return SmallFloat.intToByte4(numTerms);


Marking as draft here until the #13757 is available and used here.

lucene/core/src/java/org/apache/lucene/search/similarities/RawTFSimilarity.java

…TFSimilarity.java

seanmacavaney

Looks great, thanks @cpoerschke!

(cherry picked from commit a817426 with adjustment to TestRawTFSimilarity.java w.r.t. topDocs.totalHits.value[()] signature)

add TFSimilarity class

e1d6e70

cpoerschke marked this pull request as ready for review September 10, 2024 09:16

seanmacavaney reviewed Sep 10, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/search/similarities/TFSimilarity.java Outdated Show resolved Hide resolved

cpoerschke marked this pull request as draft September 10, 2024 10:24

cpoerschke added 2 commits September 10, 2024 11:26

Update lucene/core/src/java/org/apache/lucene/search/similarities/TFS…

81775ab

…imilarity.java

add TestTFSimilarity.testBoostQuery()

d2a91c6

cpoerschke marked this pull request as ready for review September 10, 2024 11:01

rmuir reviewed Sep 10, 2024

View reviewed changes

cpoerschke added 3 commits September 11, 2024 08:21

replace TestBooleanQueryVisitSubscorers.CountingSimilarity with TFSim…

64559ff

…ilarity

action code review feedback w.r.t. TFSimilarity.computeNorm

b8feaa2

action code review feedback w.r.t. [Raw]TFSimilarity class name

861034b

cpoerschke commented Sep 11, 2024

View reviewed changes

cpoerschke marked this pull request as draft September 11, 2024 07:54

cpoerschke commented Sep 11, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/search/similarities/RawTFSimilarity.java Outdated Show resolved Hide resolved

Update lucene/core/src/java/org/apache/lucene/search/similarities/Raw…

17d6619

…TFSimilarity.java

cpoerschke changed the title ~~add TFSimilarity class~~ add RawTFSimilarity class Sep 12, 2024

cpoerschke added 2 commits September 13, 2024 09:47

Merge remote-tracking branch 'origin/main' into TFSimilarity

eacd46e

post origin/main merge adjustments

e143faa

cpoerschke marked this pull request as ready for review September 13, 2024 09:14

cpoerschke requested review from rmuir and seanmacavaney September 13, 2024 09:14

seanmacavaney approved these changes Sep 13, 2024

View reviewed changes

cpoerschke requested review from jpountz and ChrisHegarty September 16, 2024 10:02

cpoerschke merged commit a817426 into apache:main Sep 17, 2024
3 checks passed

cpoerschke deleted the TFSimilarity branch September 17, 2024 12:11

asfgit pushed a commit that referenced this pull request Sep 17, 2024

add RawTFSimilarity class (#13749)

fef2560

(cherry picked from commit a817426 with adjustment to TestRawTFSimilarity.java w.r.t. topDocs.totalHits.value[()] signature)

cpoerschke mentioned this pull request Sep 18, 2024

add RawTFSimilarityFactory class apache/solr#2715

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add RawTFSimilarity class #13749

add RawTFSimilarity class #13749

cpoerschke commented Sep 10, 2024 •

edited

Loading

cpoerschke commented Sep 10, 2024

rmuir Sep 10, 2024

rmuir Sep 11, 2024

cpoerschke Sep 11, 2024

rmuir Sep 10, 2024

cpoerschke Sep 11, 2024

jpountz commented Sep 10, 2024

cpoerschke Sep 11, 2024

seanmacavaney left a comment

add RawTFSimilarity class #13749

add RawTFSimilarity class #13749

Conversation

cpoerschke commented Sep 10, 2024 • edited Loading

cpoerschke commented Sep 10, 2024

rmuir Sep 10, 2024

Choose a reason for hiding this comment

rmuir Sep 11, 2024

Choose a reason for hiding this comment

cpoerschke Sep 11, 2024

Choose a reason for hiding this comment

rmuir Sep 10, 2024

Choose a reason for hiding this comment

cpoerschke Sep 11, 2024

Choose a reason for hiding this comment

jpountz commented Sep 10, 2024

cpoerschke Sep 11, 2024

Choose a reason for hiding this comment

seanmacavaney left a comment

Choose a reason for hiding this comment

cpoerschke commented Sep 10, 2024 •

edited

Loading