Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; #13757

Merged
merged 7 commits into from
Sep 13, 2024

Conversation

cpoerschke
Copy link
Contributor

No description provided.

Comment on lines 117 to 118
/** TODO */
protected static long doComputeNorm(FieldInvertState state, boolean discountOverlaps) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how best to javadoc(ument) this, suggestions welcome.

Also perhaps this could be a static method in the FieldInvertState class instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's avoid the static method

@cpoerschke cpoerschke marked this pull request as ready for review September 11, 2024 08:02
@rmuir
Copy link
Member

rmuir commented Sep 11, 2024

Can we avoid doXXX and just give computeNorm a default implementation?

@cpoerschke
Copy link
Contributor Author

Can we avoid doXXX and just give computeNorm a default implementation?

Sure, though where would the discountOverlaps value come from?

@rmuir
Copy link
Member

rmuir commented Sep 11, 2024

the straightforward way would be to add:

  • Similarity(boolean)
  • Similarity() -> this(true)
  • add getter (only), no setter. this is also duplicated in subclasses.

For me this stuff is ok, because we are making the tradeoff to simplify just the index-time side of this thing, by giving it a default implementation.

@rmuir
Copy link
Member

rmuir commented Sep 11, 2024

I would also be in favor of adding some kind of warning/note, to the computeNorm: "expert: if you change this, index will not be different and not compatible with other similarities/require reindexing to change" and to the constructor taking the bool?

Idea is just that user extends Similarity and it "works", but other things are still possible

@cpoerschke
Copy link
Contributor Author

the straightforward way would be to add:

  • Similarity(boolean)
  • Similarity() -> this(true)
  • add getter (only), no setter. this is also duplicated in subclasses.

For me this stuff is ok, because we are making the tradeoff to simplify just the index-time side of this thing, by giving it a default implementation.

cbab8fb is a variation of that i.e. no changes to the constructor signature.

I'm still investigating w.r.t. some subclasses having setters and how that would interact with the constructor taking a boolean too.

@cpoerschke cpoerschke changed the title factor out protected static Similarity.doComputeNorm method provide default Similarity.computeNorm implementation Sep 11, 2024
@cpoerschke
Copy link
Contributor Author

I'm still investigating w.r.t. some subclasses having setters and how that would interact with the constructor taking a boolean too.

91cfd5e is to have the flag and accessors in the base class, that's okay I think so long as the accessors are final (and the flag has private visibility).

@cpoerschke
Copy link
Contributor Author

I would also be in favor of adding some kind of warning/note, to the computeNorm: "expert: if you change this, index will not be different and not compatible with other similarities/require reindexing to change" and to the constructor taking the bool?

Generally I like the idea of more context but am not sure how to make it very very clear e.g. when we say "if you change this" does that mean overriding-the-method or overriding-the-method-and-then-using-the-similarity-with-the-override-with-a-field-that-previously-used-the-original-similarity-to-use-the-custom-similarity or something else? Likewise when we say "not compatible/require reindexing" - is that works-but-gives-wrong-results or does-not-work-and-throws-exceptions or a variation of the two?

If (perhaps) there is a general assumption of "you cannot change the similarity of an existing field" and "similarity (and some other) changes require reindexing" then probably I would favour not having warnings or notes that convey that in some circumstances some changes can be compatible and don't require reindexing.

@cpoerschke cpoerschke requested a review from rmuir September 11, 2024 12:17
* True if overlap tokens (tokens with a position of increment of zero) are discounted from the
* document's length.
*/
private boolean discountOverlaps = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this final (no need to set it to a value here as ctor always sets it) and remove the setter.

Otherwise, the way you had it without a boolean (abstract discountOverlaps) looked fine too?

But I think we shouldn't add mutability to the base Similarity class when setter is totally not needed by anyone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... when setter is totally not needed by anyone.

There is some residual setter use -- https://github.com/search?q=repo%3Aapache%2Flucene%20setDiscountOverlaps&type=code -- let me see what it would take to get rid of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

44c3f76 removes the residual setter use, seems a bit ugly but oh well.

@rmuir rmuir requested a review from jpountz September 11, 2024 13:08
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplification. Agreed on setting discountOverlap through the ctor rather than a setter.

@cpoerschke cpoerschke changed the title provide default Similarity.computeNorm implementation similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; Sep 11, 2024
@cpoerschke cpoerschke requested review from jpountz and rmuir September 11, 2024 16:22
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. @rmuir For my understanding, is there ever a good reason to set discountOverlaps to false?

@rmuir
Copy link
Member

rmuir commented Sep 11, 2024

This looks good to me. @rmuir For my understanding, is there ever a good reason to set discountOverlaps to false?

The discountOverlaps is something we had from times where TF/IDF was the only scoring and index statistics were limited to docfreq() and maxdoc().

It is "easy" to understand: should the document's length be punished by synonyms? But at the same time, it makes it tricky to measure how well it is working, as the lucene user has the ability to easily inject a lot of artificial "synonym-like-things" in nearly infinite ways (e.g. word-delimiter-filters and stuff) with the analysis chain. So what would even be a fair measure?

Most of the modern scorers are doing something like BM25's dl/avgdl which makes this option harder to reason about. For example discountOverlaps still works in BM25 case: document doesn't get punished relative to other documents simply because it happened to have more synonyms or word-delimiters. But all documents get "skewed" in their scoring: this option only "discounts" the per-document dl, but does not impact the avgdl for the field (which is computed solely from term dictionary statistics). Since this is same skew to all documents: if you have tons of injected synonyms and are worried about it, you might want to try decreasing BM25's b parameter to adjust? Sorry, I haven't played too much with this, just general thoughts.

@jpountz
Copy link
Contributor

jpountz commented Sep 11, 2024

Thanks, that makes sense. I'm wondering if we should remove the discountOverlap parameter from our similarities, and add support for it via a Similarity wrapper somehow, that would wrap/clone the FieldInvertState that is passed to computeNorm.

@rmuir
Copy link
Member

rmuir commented Sep 11, 2024

Thanks, that makes sense. I'm wondering if we should remove the discountOverlap parameter from our similarities, and add support for it via a Similarity wrapper somehow, that would wrap/clone the FieldInvertState that is passed to computeNorm.

I don't like that this wacky index-time encoding is in Similarity class myself, it happens at a totally different time ("index time") than the rest of Similarity logic ("search time"), and you have to pass your Similarity to both IndexWriter and IndexSearcher, making it more confusing to the user and more complicated.

@@ -111,7 +133,17 @@ protected Similarity() {}
* @param state current processing state for this field
* @return computed norm value
*/
public abstract long computeNorm(FieldInvertState state);
public long computeNorm(FieldInvertState state) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to document what the default implementation is doing? e.g. at least it should state that it uses SmallFloat.intToByte4 lossy compression to encode the length as a single byte?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel knowledgeable enough to properly document/edit on this.

"Allow edits and access to secrets by maintainers" is enabled for the pull request -- please feel free to (in-browser or otherwise) change the documentation here. Thanks!

@cpoerschke cpoerschke requested a review from rmuir September 12, 2024 10:21
Try to emphasize these happen at index-time, as opposed to the other
methods that happen at search-time, and that if you change them, you
need to reindex for changes to take effect.
@rmuir
Copy link
Member

rmuir commented Sep 12, 2024

I took a stab at improving the docs in 47d4fa0

Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another pass, I think it makes the Similarity class easier to use out-of-box? I think that modifying the index-time norm is more expert, especially after Adrien fixed the encoding to work better. It is still possible, but out-of-box you don't have to think about it as you have a default implementation?

@cpoerschke
Copy link
Contributor Author

I took a stab at improving the docs in 47d4fa0

Thanks!

@cpoerschke cpoerschke requested a review from jpountz September 12, 2024 14:24
@cpoerschke cpoerschke merged commit 7c056ab into apache:main Sep 13, 2024
3 checks passed
@cpoerschke cpoerschke deleted the Similarity-doComputeNorm branch September 13, 2024 08:27
asfgit pushed a commit that referenced this pull request Sep 13, 2024
…ining discountOverlaps setters; (#13757)

Co-authored-by: Robert Muir <rmuir@apache.org>
(cherry picked from commit 7c056ab)
@cpoerschke
Copy link
Contributor Author

#13845 to add some missing constructor variants.

@javanna
Copy link
Contributor

javanna commented Oct 2, 2024

This is missing a CHANGES entry. Did it also need to be added to the migrate guide? I added the 10.0 milestone as well while I was at it.

@javanna javanna added this to the 10.0.0 milestone Oct 2, 2024
@cpoerschke
Copy link
Contributor Author

This is missing a CHANGES entry. Did it also need to be added to the migrate guide? I added the 10.0 milestone as well while I was at it.

071eb03 added to #13845 to add CHANGES.txt entries.

Not sure if additional migrate guide steps are needed: the 8x to 9x steps -- https://github.com/apache/lucene/blob/releases/lucene/9.12.0/lucene/MIGRATE.md#bm25similaritysetdiscountoverlaps-and-legacybm25similaritysetdiscountoverlaps-methods-removed-lucene-9646 -- already contain something similar and it is relatively obvious that the setting moved to the constructor?

@javanna javanna removed this from the 10.0.0 milestone Oct 2, 2024
@javanna
Copy link
Contributor

javanna commented Oct 2, 2024

I did not realize that this change went into 9.12. I thought it was a 10.0 change. Sorry about the confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants