Allow Collectors to re-order segments for non-exhaustive searches #15436

romseygeek · 2025-11-19T16:56:19Z

Top-k searches using result pruning can already skip segments if they know
that the best possible match in that segment is uncompetitive. We can take
this a step further by looking at the minimum and maximum values of a field
in each segment, and then ordering the segments such that those with more
competitive values in general come earlier in the search. This is particularly
useful for adversarial cases such as a query sorted by timestamp over an index
with an inverse timestamp sort.

Top-k searches using result pruning can already skip segments if they know that the best possible match in that segment is uncompetitive. We can take this a step further by looking at the minimum and maximum values of a field in each segment, and then ordering the segments such that those with more competitive values in general come earlier in the search. This is particularly useful for adversarial cases such as a query sorted by timestamp over an index with an inverse timestamp sort.

romseygeek · 2025-11-19T16:58:53Z

lucene/core/src/test/org/apache/lucene/search/TestSortOptimization.java

    int iterations = limit + random().nextInt(limit);
    long seqNoGenerator = random().nextInt(1000);
    for (long i = 0; i < iterations; i++) {
-      int copies = random().nextInt(100) <= 5 ? 1 : 1 + random().nextInt(5);


Note - this is necessary because with segment sorting, the order in which documents with equal sort values are returned may not be stable between runs if the documents end up in different segments.

I think this test was added to catch a bug: https://issues.apache.org/jira/browse/LUCENE-10106
Would removing this randomization cause not to catch the problem that LUCENE-10106 tried to fix?

I think that the other tests added in that patch should also catch it, but maybe @dnhatn has an opinion here?

it's okay to remove this.

tteofili

nide idea, LGTM!
p.s.:
did you run any benchmark to see how this impacts performance ?

romseygeek · 2025-11-20T09:38:07Z

did you run any benchmark to see how this impacts performance ?

Not yet, am playing with luceneutil to see if I can add some adverse sort queries. Will report numbers back here.

dsmiley · 2025-11-20T14:30:58Z

lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java

    collector.setWeight(weight);

+    if (collector.scoreMode().isExhaustive() == false) {
+      Comparator<LeafReaderContext> leafComparator = collector.getLeafReaderComparator();


Perhaps we shouldn't gate this here by isExhaustive. Maybe let the Collector author decide when to return non-null. It keeps IndexSearcher logic simpler, removing a non-obvious condition to the uninitiated.

Good call, will update.

martijnvg

Great idea @romseygeek! I left a few questions, but otherwise this looks good.

martijnvg · 2025-11-21T09:09:37Z

lucene/core/src/java/org/apache/lucene/search/NumericFieldReaderContextComparator.java

+  private final Long missingValue;
+  private final ToLongFunction<byte[]> pointDecoder;
+
+  public NumericFieldReaderContextComparator(


Can this constructor also be package protected?

martijnvg · 2025-11-21T09:13:05Z

lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java

    }
+
+    @Override
+    public Comparator<LeafReaderContext> getLeafReaderComparator() {


In the case that the original ordering of leaves is already optimal (because leaf sorter has been configured on IndexWriterConfig), would this be the place where subclasses overwrite and return null?

In this case there shouldn't be a need to do the re-ordering of segments?

The case where there's a configured leaf sorter is tricky, as it might be optimal or it might be entirely adverse depending on whether the query sort is reversed or not. You could override here and turn off query-time segment sorting if you wanted. But maybe we need a better escape hatch?

But maybe we need a better escape hatch?

I think so. What do you think would be a good escape hatch here?

martijnvg · 2025-11-21T09:19:40Z

lucene/core/src/test/org/apache/lucene/search/TestSortOptimization.java

    int iterations = limit + random().nextInt(limit);
    long seqNoGenerator = random().nextInt(1000);
    for (long i = 0; i < iterations; i++) {
-      int copies = random().nextInt(100) <= 5 ? 1 : 1 + random().nextInt(5);


I think this test was added to catch a bug: https://issues.apache.org/jira/browse/LUCENE-10106
Would removing this randomization cause not to catch the problem that LUCENE-10106 tried to fix?

lucene/core/src/java/org/apache/lucene/search/NumericFieldReaderContextComparator.java

martijnvg

LGTM

romseygeek · 2025-12-01T10:02:21Z

I added a couple of sorted MatchAll queries to wikimedium.10M.tasks and tested this out on an index sorted by lastMod. In this case it basically doesn't make any difference at all:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
        MatchAllDateTimeDescSort       29.30     (37.7%)       27.29     (23.6%)   -6.9% ( -49% -   87%) 0.490
           HighTermDayOfYearSort       42.40     (10.8%)       40.19     (10.7%)   -5.2% ( -24% -   18%) 0.126
            TermDateTimeDescSort      222.80      (4.0%)      219.28      (4.4%)   -1.6% (  -9% -    7%) 0.236
            HighTermTitleBDVSort        6.88      (4.0%)        6.87      (3.2%)   -0.1% (  -7% -    7%) 0.904
            MatchAllDateTimeSort        9.01     (11.3%)        9.04      (9.6%)    0.3% ( -18% -   23%) 0.921
                        PKLookup      130.26      (2.3%)      130.92      (2.2%)    0.5% (  -3% -    5%) 0.478
                      TermDTSort       52.50     (11.2%)       53.30     (15.5%)    1.5% ( -22% -   31%) 0.721
               HighTermMonthSort       37.38      (9.4%)       39.34      (9.2%)    5.2% ( -12% -   26%) 0.074

The lastMod values are fairly evenly distributed between segments, so segment sorting doesn't really have an effect. I think a more interesting experiment would be with something like time series data where the input is naturally close to sorted and so the sort values in segments are mostly disjoint. I'll see if I can mock something up and run these tests again.

On the plus side, it seems that there isn't a noticeable penalty for doing this sorting, so the escape hatch may not be necessary. But I want to make sure that there are actually existing benefits as well!

romseygeek self-assigned this Nov 19, 2025

romseygeek requested a review from martijnvg November 19, 2025 16:56

github-actions bot added module:core/search module:test-framework labels Nov 19, 2025

romseygeek commented Nov 19, 2025

View reviewed changes

changes

ccb3d91

github-actions bot added this to the 10.4.0 milestone Nov 19, 2025

tteofili approved these changes Nov 20, 2025

View reviewed changes

dsmiley reviewed Nov 20, 2025

View reviewed changes

martijnvg reviewed Nov 21, 2025

View reviewed changes

iter

3d0b5c5

martijnvg approved these changes Nov 24, 2025

View reviewed changes

Allow Collectors to re-order segments for non-exhaustive searches #15436

Are you sure you want to change the base?

Allow Collectors to re-order segments for non-exhaustive searches #15436

Uh oh!

Conversation

romseygeek commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tteofili left a comment

Choose a reason for hiding this comment

Uh oh!

romseygeek commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

romseygeek commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants