-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; #13757
Conversation
/** TODO */ | ||
protected static long doComputeNorm(FieldInvertState state, boolean discountOverlaps) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how best to javadoc(ument) this, suggestions welcome.
Also perhaps this could be a static method in the FieldInvertState
class instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's avoid the static method
Can we avoid |
Sure, though where would the |
the straightforward way would be to add:
For me this stuff is ok, because we are making the tradeoff to simplify just the index-time side of this thing, by giving it a default implementation. |
I would also be in favor of adding some kind of warning/note, to the Idea is just that user extends Similarity and it "works", but other things are still possible |
…orm implementation
cbab8fb is a variation of that i.e. no changes to the constructor signature. I'm still investigating w.r.t. some subclasses having setters and how that would interact with the constructor taking a boolean too. |
91cfd5e is to have the flag and accessors in the base class, that's okay I think so long as the accessors are final (and the flag has private visibility). |
Generally I like the idea of more context but am not sure how to make it very very clear e.g. when we say "if you change this" does that mean overriding-the-method or overriding-the-method-and-then-using-the-similarity-with-the-override-with-a-field-that-previously-used-the-original-similarity-to-use-the-custom-similarity or something else? Likewise when we say "not compatible/require reindexing" - is that works-but-gives-wrong-results or does-not-work-and-throws-exceptions or a variation of the two? If (perhaps) there is a general assumption of "you cannot change the similarity of an existing field" and "similarity (and some other) changes require reindexing" then probably I would favour not having warnings or notes that convey that in some circumstances some changes can be compatible and don't require reindexing. |
* True if overlap tokens (tokens with a position of increment of zero) are discounted from the | ||
* document's length. | ||
*/ | ||
private boolean discountOverlaps = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this final
(no need to set it to a value here as ctor always sets it) and remove the setter.
Otherwise, the way you had it without a boolean (abstract discountOverlaps
) looked fine too?
But I think we shouldn't add mutability to the base Similarity class when setter is totally not needed by anyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... when setter is totally not needed by anyone.
There is some residual setter use -- https://github.com/search?q=repo%3Aapache%2Flucene%20setDiscountOverlaps&type=code -- let me see what it would take to get rid of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
44c3f76 removes the residual setter use, seems a bit ugly but oh well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplification. Agreed on setting discountOverlap through the ctor rather than a setter.
lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. @rmuir For my understanding, is there ever a good reason to set discountOverlaps to false?
lucene/core/src/java/org/apache/lucene/search/similarities/DFRSimilarity.java
Show resolved
Hide resolved
The It is "easy" to understand: should the document's length be punished by synonyms? But at the same time, it makes it tricky to measure how well it is working, as the lucene user has the ability to easily inject a lot of artificial "synonym-like-things" in nearly infinite ways (e.g. word-delimiter-filters and stuff) with the analysis chain. So what would even be a fair measure? Most of the modern scorers are doing something like BM25's |
Thanks, that makes sense. I'm wondering if we should remove the discountOverlap parameter from our similarities, and add support for it via a Similarity wrapper somehow, that would wrap/clone the FieldInvertState that is passed to computeNorm. |
I don't like that this wacky index-time encoding is in Similarity class myself, it happens at a totally different time ("index time") than the rest of Similarity logic ("search time"), and you have to pass your Similarity to both IndexWriter and IndexSearcher, making it more confusing to the user and more complicated. |
@@ -111,7 +133,17 @@ protected Similarity() {} | |||
* @param state current processing state for this field | |||
* @return computed norm value | |||
*/ | |||
public abstract long computeNorm(FieldInvertState state); | |||
public long computeNorm(FieldInvertState state) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to document what the default implementation is doing? e.g. at least it should state that it uses SmallFloat.intToByte4
lossy compression to encode the length as a single byte?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel knowledgeable enough to properly document/edit on this.
"Allow edits and access to secrets by maintainers" is enabled for the pull request -- please feel free to (in-browser or otherwise) change the documentation here. Thanks!
Try to emphasize these happen at index-time, as opposed to the other methods that happen at search-time, and that if you change them, you need to reindex for changes to take effect.
I took a stab at improving the docs in 47d4fa0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another pass, I think it makes the Similarity class easier to use out-of-box? I think that modifying the index-time norm is more expert, especially after Adrien fixed the encoding to work better. It is still possible, but out-of-box you don't have to think about it as you have a default implementation?
Thanks! |
#13845 to add some missing constructor variants. |
This is missing a CHANGES entry. Did it also need to be added to the migrate guide? I added the 10.0 milestone as well while I was at it. |
071eb03 added to #13845 to add CHANGES.txt entries. Not sure if additional migrate guide steps are needed: the 8x to 9x steps -- https://github.com/apache/lucene/blob/releases/lucene/9.12.0/lucene/MIGRATE.md#bm25similaritysetdiscountoverlaps-and-legacybm25similaritysetdiscountoverlaps-methods-removed-lucene-9646 -- already contain something similar and it is relatively obvious that the setting moved to the constructor? |
I did not realize that this change went into 9.12. I thought it was a 10.0 change. Sorry about the confusion. |
No description provided.