Skip to content

Conversation

@peteralfonsi
Copy link
Contributor

Description

On boolean queries with only filter or must_not clauses, all documents have equal scores. We only track the total number of hits up to trackTotalHitsUpTo, 10k by default. Since all documents are tied in score, and we use lower doc IDs as a tiebreaker, we will never return any documents after trackTotalHitsUpTo and they won't impact the returned hit count. (This assumes no aggregations/sorting/pagination/scrolling). Therefore, we should enable terminate_after on these queries, to avoid scanning through further documents after 10k hits are gathered. This can cause significant speedups. For more explanation see the issue.

This is similar to ApproximatePointRangeQuery in terms of its effect. However there are some pros and cons to each, and after discussion with @harshavamsi we think these can go in parallel:

  • The terminate_after method is much simpler and requires no changes to Query/Scorer logic
  • Extending it to other field types happens automatically, and extending it to other constant-scoring query types is as simple as an instanceof check for that query type
  • terminate_after's performance is slightly better, mostly because collectors operate on the shard level rather than the segment level, so we stop searching after 10k documents instead of segment_count * 10k documents as we would with approximate queries. Custom query logic overhead is probably also a factor.
  • However, approximate queries can be used on sorted queries, while terminate_after cannot.

I ran this branch of OSB using http_logs. Each query is a boolean query with 2 filter clauses. The selectivity of the queries varies a lot.

Results with concurrent segment search off:

Query Original p50 (ms) Contender p50 (ms) Speedup as fraction of original
"request" matches "images" & "status" matches "200" 783 12.6 62x
"request" matches "images" & "status" matches "500" 12.5 9.8 1.3x
"request" matches "images" & "status" matches "400" 39.9 15.4 2.6x
"request" matches "images" & "timestamp" from 6/10-6/13 432 8.8 49x
"timestamp" from 6/10-6/13 & "request.raw" in list of top 10 terms 235 9.1 26x

We can see the speedup is greatest the more docs the query matches, which makes sense.

Results with concurrent segment search on:

Query Original p50 (ms) Contender p50 (ms) Speedup as fraction of original
"request" matches "images" & "status" matches "200" 596 17.8 33x
"request" matches "images" & "status" matches "500" 11.8 11.3 1.04x
"request" matches "images" & "status" matches "404" 31.6 18.3 1.7x
"request" matches "images" & "timestamp" from 6/10-6/13 346 11.1 31x
"timestamp" from 6/10-6/13 & "request.raw" in list of top 10 terms 173 12.9 13x

Note there were some concerns around CSS + terminate_after in the past. However, after discussion with @jed326 it seems it is ok in this case. The main concern was a user manually setting terminate_after to some value like 1k. Then, if they enabled CSS with 4 slices for example, they would suddenly be gathering 4 * 1k docs, increasing resource usage. This isn't a concern in our case, because we only apply this change if terminate_after is not set on the query, and the number of docs processed can only ever decrease or stay the same. So, enabling this optimization will never cause increased resource usage with CSS.

Note that once terminate_after is set to 10k, CSS actually performs a little worse than non-CSS, which is expected because of the slight overhead + >1 slices. In future we could disable CSS when this optimization enables terminate_after but it seems as of today, at this point in the query processing CSS can't be enabled/disabled. However the speedup is still significant so I think it's ok to leave this for later.

Related Issues

Resolves #18510

Check List

  • Functionality includes testing.
  • [N/A] API changes companion pull request created, if applicable.
  • [N/A] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Peter Alfonsi added 6 commits July 1, 2025 14:09
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for 0066d64: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Peter Alfonsi <petealft@amazon.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for 52ee010: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Peter Alfonsi <petealft@amazon.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for 44c6677: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@peteralfonsi
Copy link
Contributor Author

Flaky tests: #14509, #18490, #18157

Signed-off-by: Peter Alfonsi <petealft@amazon.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for 9fccadc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@peteralfonsi
Copy link
Contributor Author

Flaky test: #17271

Signed-off-by: Peter Alfonsi <peter.alfonsi@gmail.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for 8929416: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@peteralfonsi
Copy link
Contributor Author

#18872

Signed-off-by: Peter Alfonsi <petealft@amazon.com>
@github-actions
Copy link
Contributor

✅ Gradle check result for 6eec6df: SUCCESS

@codecov
Copy link

codecov bot commented Jul 30, 2025

Codecov Report

❌ Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.70%. Comparing base (3b7603e) to head (6eec6df).
⚠️ Report is 442 commits behind head on main.

Files with missing lines Patch % Lines
...va/org/opensearch/search/DefaultSearchContext.java 78.57% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18842      +/-   ##
============================================
+ Coverage     72.67%   72.70%   +0.02%     
+ Complexity    68610    68600      -10     
============================================
  Files          5577     5577              
  Lines        315375   315391      +16     
  Branches      45772    45784      +12     
============================================
+ Hits         229209   229300      +91     
+ Misses        67613    67518      -95     
- Partials      18553    18573      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

// but at this point the CSS logic already assumes it can't be changed. We can revisit this in future.
}
// TODO: In future we can do the same for any constant-score query
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just disable CSS when early termination is enabled?

assert from != -1 : "Cannot call `tryEnablingEarlyTermination` until after `from` has been set";
if (query == null) return false;

if (terminateAfter != DEFAULT_TERMINATE_AFTER) return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check for deeper queries, such as

  1. Nested BooleanQueries inside FILTER clauses
  2. ConstantScoreQuery wrapping (common in filter contexts)
  3. MinimumShouldMatch requirements not exposed in BooleanQuery API


// We can only set terminateAfter to trackTotalHitsUpTo if we only have filter and must_not clauses
if (bq.getClauses(Occur.MUST).isEmpty() && bq.getClauses(Occur.SHOULD).isEmpty()) {
terminateAfter = Math.max(size, trackTotalHitsUpTo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not be checking for null query here? Returning false for null query is wrong. A null query is equivalent to MatchAllQuery in OpenSearch, which IS a constant-scoring query suitable for early termination.


if (terminateAfter != DEFAULT_TERMINATE_AFTER) return false;
if (!(query instanceof BooleanQuery bq)) return false;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No check for search profiling. Early termination will produce misleading profiler output showing only partial query execution.

int trackTotalHitsUpTo = 500;

// Enforce 1 shard per node, so that no shard has < trackTotalHitsUpTo matching docs and cannot actually terminate early
assertAcked(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are we validating multi shard behaviour?

.setTerminateAfter(1_000_000)
.get();
assertFalse(originalQueryResponse.isTerminatedEarly()); // Returns false not null when TA was set but not reached
assertHitCount(originalQueryResponse, trackTotalHitsUpTo, GREATER_THAN_OR_EQUAL_TO);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also test for mixed clause types that look like filters?

.get();
assertTrue(response.isTerminatedEarly());
// Note: queries that have finished early with terminate_after will return "eq" for hit relation
assertHitCount(response, trackTotalHitsUpTo, EQUAL_TO);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate the correctness of the returned documents as well?

if (!(query instanceof BooleanQuery bq)) return false;

if (aggregations() != null) return false;
if (from > 0 || searchAfter != null) return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cant early termination not work with search_after?

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Sep 4, 2025
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Oct 9, 2025
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Nov 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement or improvement to existing feature or request lucene Search:Performance stalled Issues that have stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Set terminate_after to trackTotalHitsUpTo for speedup in boolean queries with only filter clauses

2 participants