-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid double term construction in DfsPhase #38716
Conversation
Pinging @elastic/es-search |
A follow up to this could be altering |
@elasticmachine run elasticsearch-ci/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, while I understand why you did that I am not sure that the gain in performance outweigh the API change required to do it.
@@ -225,7 +223,7 @@ public Explanation explain(int topLevelDocId, IndexSearcher searcher, RescoreCon | |||
} | |||
|
|||
@Override | |||
public void extractTerms(IndexSearcher searcher, RescoreContext rescoreContext, Set<Term> termsSet) { | |||
public void extractTerms(IndexSearcher searcher, RescoreContext rescoreContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new signature is a bit weird, the only option is to call createWeight
on the searcher but it's obfuscated so you need to check an actual implementation to realize that.
if (fieldStatistics.containsKey(term.field()) == false) { | ||
final CollectionStatistics collectionStatistics = context.searcher().collectionStatistics(term.field()); | ||
if (collectionStatistics != null) { | ||
fieldStatistics.put(term.field(), collectionStatistics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is really equivalent. Some queries are going to build term statistics even though they don't add terms in extractTerms
for various reasons (GlobalOrdinalsQuery
for instance) so we'll end up with more terms than before. However the good side of this change is that we'd extract only the terms that are used for scoring instead of all terms that are present in the query.
The API change for the rescorer is too obfuscated IMO but one thing I fully agree with is that we don't need the special map so we could rely on plain HashMap
to cleanup the code a bit ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some queries are going to build term statistics even though they don't add terms in
extractTerms
for various reasons (GlobalOrdinalsQuery
for instance)
I think this is incorrect - queries will only build term statistics if they actually need them, whereas extractTerms
doesn't always track that. TermWeight
for example always adds its term, even if it has been created with ScoreMode.COMPLETE_NO_SCORES
, while GlobalOrdinalsQuery
creates its child weight with COMPLETE_NO_SCORES
so no term stats will be pulled here.
I agree about the API change on rescorer though, let me think about a better way to do that.
I've pushed a change removing
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new approach , thanks @romseygeek
* elastic/master: Avoid double term construction in DfsPhase (elastic#38716) Fix typo in DateRange docs (yyy → yyyy) (elastic#38883) Introduced class reuses follow parameter code between ShardFollowTasks (elastic#38910) Ensure random timestamps are within search boundary (elastic#38753) [CI] Muting method testFollowIndex in IndexFollowingIT Update Lucene snapshot repo for 7.0.0-beta1 (elastic#38946) SQL: Doc on syntax (identifiers in particular) (elastic#38662) Upgrade to Gradle 5.2.1 (elastic#38880) Tie break search shard iterator comparisons on cluster alias (elastic#38853) Also mmap cfs files for hybridfs (elastic#38940) Build: Fix issue with test status logging (elastic#38799) Adapt FullClusterRestartIT on master (elastic#38856) Fix testAutoFollowing test to use createLeaderIndex() helper method. Migrate muted auto follow rolling upgrade test and unmute this test (elastic#38900) ShardBulkAction ignore primary response on primary (elastic#38901) Recover peers from translog, ignoring soft deletes (elastic#38904) Fix NPE on Stale Index in IndicesService (elastic#38891) Smarter CCR concurrent file chunk fetching (elastic#38841) Fix intermittent failure in ApiKeyIntegTests (elastic#38627) re-enable SmokeTestWatcherWithSecurityIT (elastic#38814)
DfsPhase captures terms used for scoring a query in order to build global term statistics across multiple shards for more accurate scoring. It currently does this by building the query's `Weight` and calling `extractTerms` on it to collect terms, and then calling `IndexSearcher.termStatistics()` for each collected term. This duplicates work, however, as the various `Weight` implementations will already have collected these statistics at construction time. This commit replaces this round-about way of collecting stats, instead using a delegating IndexSearcher that collects the term contexts and statistics when `IndexSearcher.termStatistics()` is called from the Weight. It also fixes a bug when using rescorers, where a `QueryRescorer` would calculate distributed term statistics, but ignore field statistics. `Rescorer.extractTerms` has been removed, and replaced with a new method on `RescoreContext` that returns any queries used by the rescore implementation. The delegating IndexSearcher then collects term contexts and statistics in the same way described above for each Query.
DfsPhase captures terms used for scoring a query in order to build global term statistics across
multiple shards for more accurate scoring. It currently does this by building the query's
Weight
andcalling
extractTerms
on it to collect terms, and then callingIndexSearcher.termStatistics()
for eachcollected term. This duplicates work, however, as the various
Weight
implementations will alreadyhave collected these statistics at construction time.
This commit replaces this round-about way of collecting stats, instead using a delegating
IndexSearcher that collects the term contexts and statistics when
IndexSearcher.termStatistics()
is called from the Weight.
As this also applies to rescorers, the signature of
RescoreContext.extractTerms
has changed slightly.