-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Remove the postings highlighter and make unified the default highlighter choice #25028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing t
clintongormley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few docs comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by the offset strategy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Highlighters allow you to produce highlighted snippets from one or more fields in your search results."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unified highlighter (which is used by default if no highlighter type is specified) uses the Lucene Unified Highlighter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offsets Strategy requires some explanation, perhaps:
In order to create meaningful search snippets from the terms being queried, a highlighter needs to know the start and end character offsets of each word in the original text. These offsets can be obtained from:
- The postings list (fields mapped as
"index_options": "offsets"). - Term vectors (fields mapped as
"term_vectors": "with_positions_offsets"). - The original field, by reanalysing the text on-the-fly.
nik9000
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few minors but LGTM. I guess next is to deprecate the postings highlighter in 5.x, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you keep the indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect it isn't actually faster for very short strings. At least, that was my experience with the experimental highlighter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing "this is only supported by the fast vector highlighter", I think.
|
Thanks @clintongormley and @nik9000
We can make this change transparent by allowing |
I'm afraid of something sneaky like that. I think we're better off deprecating in 5.x so users of the postings highlighter know that they should test with the unified highlighter. Even if it is the same code I'd prefer they know about the change rather than get some kind of surprise on upgrade. |
|
Not sure, but I am wondering if we should rather have two separate PRs, one for removing postings which gets replaced by unified (probably already potentially breaking although if it breaks users it's because of bugs? but certainly breaking for people specifying postings as highlighter type), and another one for changing the default highlighter type (more breaking as it affects also people relying on plain or fvh just because they don't have offsets or they have term vectors). I tend to agree with @nik9000 on not making unified a synonym of postings under the hood. |
Ok so I'll start with the deprecation in 5.x
I think it affects 5.x only where we have two options:
I am leaning toward option 2 since the
If the desired behavior for 6.x is to use the |
|
I opened #25073 for the deprecation in 5.x |
…ter choice This change removes the `postings` highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves exactly like the `unified` highlighter when index_options is set to `offsets`: https://issues.apache.org/jira/browse/LUCENE-7815 It also makes the `unified` highlighter the default choice for highlighting a field (if `type` is not provided). The strategy used internally by this highlighter remain the same as before, it checks `term_vectors` first, then `postings` and ultimately it re-analyzes the text. Ultimately it rewrites the docs so that the options that the `unified` highlighter cannot handle are clearly marked as such. There are few features that the `unified` highlighter is not able to handle which is why the other highlighters (`plain` and `fvh`) are still available. I'll open separate issues for these features and we'll deprecate the `fvh` and `plain` highlighters when full support for these features have been added to the `unified`.
c1e357f to
4feaaca
Compare
This change adds a deprecation warning for removal in 6.0. Relates elastic#25028
This change adds a deprecation warning for removal in 6.0. Only one deprecation is logged per request Relates #25028
* master: (53 commits) Log checkout so SHA is known Add link to community Rust Client (elastic#22897) "shard started" should show index and shard ID (elastic#25157) await fix testWithRandomException Change BWC versions on create index response Return the index name on a create index response Remove incorrect bwc branch logic from master Correctly format arrays in output [Test] Extending parsing checks for SearchResponse (elastic#25148) Scripting: Change keys for inline/stored scripts to source/id (elastic#25127) [Test] Add test for custom requests in High Level Rest Client (elastic#25106) nested: In case of a single type the _id field should be added to the nested document instead of _uid field. `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586) fix highlighting docs Fix NPE in token_count datatype with null value (elastic#25046) Remove the postings highlighter and make unified the default highlighter choice (elastic#25028) [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144) Fix typo in shards.asciidoc (elastic#25143) List Hibernate Search (elastic#25145) [DOCS] update maxRetryTimeout in java REST client usage page ...
* master: (80 commits) Test: remove faling test that relies on merge order Log checkout so SHA is known Add link to community Rust Client (elastic#22897) "shard started" should show index and shard ID (elastic#25157) await fix testWithRandomException Change BWC versions on create index response Return the index name on a create index response Remove incorrect bwc branch logic from master Correctly format arrays in output [Test] Extending parsing checks for SearchResponse (elastic#25148) Scripting: Change keys for inline/stored scripts to source/id (elastic#25127) [Test] Add test for custom requests in High Level Rest Client (elastic#25106) nested: In case of a single type the _id field should be added to the nested document instead of _uid field. `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586) fix highlighting docs Fix NPE in token_count datatype with null value (elastic#25046) Remove the postings highlighter and make unified the default highlighter choice (elastic#25028) [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144) Fix typo in shards.asciidoc (elastic#25143) List Hibernate Search (elastic#25145) ...
* master: (1889 commits) Test: remove faling test that relies on merge order Log checkout so SHA is known Add link to community Rust Client (elastic#22897) "shard started" should show index and shard ID (elastic#25157) await fix testWithRandomException Change BWC versions on create index response Return the index name on a create index response Remove incorrect bwc branch logic from master Correctly format arrays in output [Test] Extending parsing checks for SearchResponse (elastic#25148) Scripting: Change keys for inline/stored scripts to source/id (elastic#25127) [Test] Add test for custom requests in High Level Rest Client (elastic#25106) nested: In case of a single type the _id field should be added to the nested document instead of _uid field. `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586) fix highlighting docs Fix NPE in token_count datatype with null value (elastic#25046) Remove the postings highlighter and make unified the default highlighter choice (elastic#25028) [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144) Fix typo in shards.asciidoc (elastic#25143) List Hibernate Search (elastic#25145) ...
This change removes the
postingshighlighter. This highlighter has been removed from Lucene master (7.x) because it behavesexactly like the
unifiedhighlighter when index_options is set tooffsets:https://issues.apache.org/jira/browse/LUCENE-7815
It also makes the
unifiedhighlighter the default choice for highlighting a field (iftypeis not provided).The strategy used internally by this highlighter remain the same as before, it checks
term_vectorsfirst, thenpostingsand ultimately it re-analyzes the text.This change also rewrites the docs so that the options that the
unifiedhighlighter cannot handle are clearly marked as such.There are few features that the
unifiedhighlighter is not able to handle which is why the other highlighters (plainandfvh) are still available.I'll open separate issues for these features and we'll deprecate the
fvhandplainhighlighters when full support for these features have been added to theunified.