Skip to content

Conversation

@romseygeek
Copy link
Contributor

Top-k searches using result pruning can already skip segments if they know
that the best possible match in that segment is uncompetitive. We can take
this a step further by looking at the minimum and maximum values of a field
in each segment, and then ordering the segments such that those with more
competitive values in general come earlier in the search. This is particularly
useful for adversarial cases such as a query sorted by timestamp over an index
with an inverse timestamp sort.

Top-k searches using result pruning can already skip segments if they
know that the best possible match in that segment is uncompetitive.
We can take this a step further by looking at the minimum and maximum
values of a field in each segment, and then ordering the segments such
that those with more competitive values in general come earlier in
the search.  This is particularly useful for adversarial cases such
as a query sorted by timestamp over an index with an inverse timestamp
sort.
int iterations = limit + random().nextInt(limit);
long seqNoGenerator = random().nextInt(1000);
for (long i = 0; i < iterations; i++) {
int copies = random().nextInt(100) <= 5 ? 1 : 1 + random().nextInt(5);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note - this is necessary because with segment sorting, the order in which documents with equal sort values are returned may not be stable between runs if the documents end up in different segments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test was added to catch a bug: https://issues.apache.org/jira/browse/LUCENE-10106
Would removing this randomization cause not to catch the problem that LUCENE-10106 tried to fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the other tests added in that patch should also catch it, but maybe @dnhatn has an opinion here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's okay to remove this.

@github-actions github-actions bot added this to the 10.4.0 milestone Nov 19, 2025
Copy link
Contributor

@tteofili tteofili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nide idea, LGTM!
p.s.:
did you run any benchmark to see how this impacts performance ?

@romseygeek
Copy link
Contributor Author

did you run any benchmark to see how this impacts performance ?

Not yet, am playing with luceneutil to see if I can add some adverse sort queries. Will report numbers back here.

collector.setWeight(weight);

if (collector.scoreMode().isExhaustive() == false) {
Comparator<LeafReaderContext> leafComparator = collector.getLeafReaderComparator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we shouldn't gate this here by isExhaustive. Maybe let the Collector author decide when to return non-null. It keeps IndexSearcher logic simpler, removing a non-obvious condition to the uninitiated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, will update.

Copy link
Contributor

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea @romseygeek! I left a few questions, but otherwise this looks good.

private final Long missingValue;
private final ToLongFunction<byte[]> pointDecoder;

public NumericFieldReaderContextComparator(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this constructor also be package protected?

}

@Override
public Comparator<LeafReaderContext> getLeafReaderComparator() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that the original ordering of leaves is already optimal (because leaf sorter has been configured on IndexWriterConfig), would this be the place where subclasses overwrite and return null?

In this case there shouldn't be a need to do the re-ordering of segments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case where there's a configured leaf sorter is tricky, as it might be optimal or it might be entirely adverse depending on whether the query sort is reversed or not. You could override here and turn off query-time segment sorting if you wanted. But maybe we need a better escape hatch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe we need a better escape hatch?

I think so. What do you think would be a good escape hatch here?

int iterations = limit + random().nextInt(limit);
long seqNoGenerator = random().nextInt(1000);
for (long i = 0; i < iterations; i++) {
int copies = random().nextInt(100) <= 5 ? 1 : 1 + random().nextInt(5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test was added to catch a bug: https://issues.apache.org/jira/browse/LUCENE-10106
Would removing this randomization cause not to catch the problem that LUCENE-10106 tried to fix?

Copy link
Contributor

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@romseygeek
Copy link
Contributor Author

I added a couple of sorted MatchAll queries to wikimedium.10M.tasks and tested this out on an index sorted by lastMod. In this case it basically doesn't make any difference at all:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
        MatchAllDateTimeDescSort       29.30     (37.7%)       27.29     (23.6%)   -6.9% ( -49% -   87%) 0.490
           HighTermDayOfYearSort       42.40     (10.8%)       40.19     (10.7%)   -5.2% ( -24% -   18%) 0.126
            TermDateTimeDescSort      222.80      (4.0%)      219.28      (4.4%)   -1.6% (  -9% -    7%) 0.236
            HighTermTitleBDVSort        6.88      (4.0%)        6.87      (3.2%)   -0.1% (  -7% -    7%) 0.904
            MatchAllDateTimeSort        9.01     (11.3%)        9.04      (9.6%)    0.3% ( -18% -   23%) 0.921
                        PKLookup      130.26      (2.3%)      130.92      (2.2%)    0.5% (  -3% -    5%) 0.478
                      TermDTSort       52.50     (11.2%)       53.30     (15.5%)    1.5% ( -22% -   31%) 0.721
               HighTermMonthSort       37.38      (9.4%)       39.34      (9.2%)    5.2% ( -12% -   26%) 0.074

The lastMod values are fairly evenly distributed between segments, so segment sorting doesn't really have an effect. I think a more interesting experiment would be with something like time series data where the input is naturally close to sorted and so the sort values in segments are mostly disjoint. I'll see if I can mock something up and run these tests again.

On the plus side, it seems that there isn't a noticeable penalty for doing this sorting, so the escape hatch may not be necessary. But I want to make sure that there are actually existing benefits as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants