-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Set terminate_early to trackTotalHitsUpTo on non-scoring boolean queries for performance #18842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Set terminate_early to trackTotalHitsUpTo on non-scoring boolean queries for performance #18842
Conversation
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
|
❌ Gradle check result for 0066d64: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
|
❌ Gradle check result for 52ee010: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
|
❌ Gradle check result for 44c6677: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
|
❌ Gradle check result for 9fccadc: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Flaky test: #17271 |
Signed-off-by: Peter Alfonsi <peter.alfonsi@gmail.com>
|
❌ Gradle check result for 8929416: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18842 +/- ##
============================================
+ Coverage 72.67% 72.70% +0.02%
+ Complexity 68610 68600 -10
============================================
Files 5577 5577
Lines 315375 315391 +16
Branches 45772 45784 +12
============================================
+ Hits 229209 229300 +91
+ Misses 67613 67518 -95
- Partials 18553 18573 +20 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| // but at this point the CSS logic already assumes it can't be changed. We can revisit this in future. | ||
| } | ||
| // TODO: In future we can do the same for any constant-score query | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just disable CSS when early termination is enabled?
| assert from != -1 : "Cannot call `tryEnablingEarlyTermination` until after `from` has been set"; | ||
| if (query == null) return false; | ||
|
|
||
| if (terminateAfter != DEFAULT_TERMINATE_AFTER) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check for deeper queries, such as
- Nested BooleanQueries inside FILTER clauses
- ConstantScoreQuery wrapping (common in filter contexts)
- MinimumShouldMatch requirements not exposed in BooleanQuery API
|
|
||
| // We can only set terminateAfter to trackTotalHitsUpTo if we only have filter and must_not clauses | ||
| if (bq.getClauses(Occur.MUST).isEmpty() && bq.getClauses(Occur.SHOULD).isEmpty()) { | ||
| terminateAfter = Math.max(size, trackTotalHitsUpTo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not be checking for null query here? Returning false for null query is wrong. A null query is equivalent to MatchAllQuery in OpenSearch, which IS a constant-scoring query suitable for early termination.
|
|
||
| if (terminateAfter != DEFAULT_TERMINATE_AFTER) return false; | ||
| if (!(query instanceof BooleanQuery bq)) return false; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No check for search profiling. Early termination will produce misleading profiler output showing only partial query execution.
| int trackTotalHitsUpTo = 500; | ||
|
|
||
| // Enforce 1 shard per node, so that no shard has < trackTotalHitsUpTo matching docs and cannot actually terminate early | ||
| assertAcked( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are we validating multi shard behaviour?
| .setTerminateAfter(1_000_000) | ||
| .get(); | ||
| assertFalse(originalQueryResponse.isTerminatedEarly()); // Returns false not null when TA was set but not reached | ||
| assertHitCount(originalQueryResponse, trackTotalHitsUpTo, GREATER_THAN_OR_EQUAL_TO); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also test for mixed clause types that look like filters?
| .get(); | ||
| assertTrue(response.isTerminatedEarly()); | ||
| // Note: queries that have finished early with terminate_after will return "eq" for hit relation | ||
| assertHitCount(response, trackTotalHitsUpTo, EQUAL_TO); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we validate the correctness of the returned documents as well?
| if (!(query instanceof BooleanQuery bq)) return false; | ||
|
|
||
| if (aggregations() != null) return false; | ||
| if (from > 0 || searchAfter != null) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why cant early termination not work with search_after?
|
This PR is stalled because it has been open for 30 days with no activity. |
|
This PR is stalled because it has been open for 30 days with no activity. |
|
This PR is stalled because it has been open for 30 days with no activity. |
Description
On boolean queries with only filter or must_not clauses, all documents have equal scores. We only track the total number of hits up to
trackTotalHitsUpTo, 10k by default. Since all documents are tied in score, and we use lower doc IDs as a tiebreaker, we will never return any documents aftertrackTotalHitsUpToand they won't impact the returned hit count. (This assumes no aggregations/sorting/pagination/scrolling). Therefore, we should enableterminate_afteron these queries, to avoid scanning through further documents after 10k hits are gathered. This can cause significant speedups. For more explanation see the issue.This is similar to
ApproximatePointRangeQueryin terms of its effect. However there are some pros and cons to each, and after discussion with @harshavamsi we think these can go in parallel:Query/Scorerlogicinstanceofcheck for that query typeterminate_after's performance is slightly better, mostly because collectors operate on the shard level rather than the segment level, so we stop searching after 10k documents instead of segment_count * 10k documents as we would with approximate queries. Custom query logic overhead is probably also a factor.terminate_aftercannot.I ran this branch of OSB using
http_logs. Each query is a boolean query with 2 filter clauses. The selectivity of the queries varies a lot.Results with concurrent segment search off:
We can see the speedup is greatest the more docs the query matches, which makes sense.
Results with concurrent segment search on:
Note there were some concerns around CSS + terminate_after in the past. However, after discussion with @jed326 it seems it is ok in this case. The main concern was a user manually setting terminate_after to some value like 1k. Then, if they enabled CSS with 4 slices for example, they would suddenly be gathering 4 * 1k docs, increasing resource usage. This isn't a concern in our case, because we only apply this change if
terminate_afteris not set on the query, and the number of docs processed can only ever decrease or stay the same. So, enabling this optimization will never cause increased resource usage with CSS.Note that once
terminate_afteris set to 10k, CSS actually performs a little worse than non-CSS, which is expected because of the slight overhead + >1 slices. In future we could disable CSS when this optimization enablesterminate_afterbut it seems as of today, at this point in the query processing CSS can't be enabled/disabled. However the speedup is still significant so I think it's ok to leave this for later.Related Issues
Resolves #18510
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.