-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Allow Collectors to re-order segments for non-exhaustive searches #15436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Top-k searches using result pruning can already skip segments if they know that the best possible match in that segment is uncompetitive. We can take this a step further by looking at the minimum and maximum values of a field in each segment, and then ordering the segments such that those with more competitive values in general come earlier in the search. This is particularly useful for adversarial cases such as a query sorted by timestamp over an index with an inverse timestamp sort.
| int iterations = limit + random().nextInt(limit); | ||
| long seqNoGenerator = random().nextInt(1000); | ||
| for (long i = 0; i < iterations; i++) { | ||
| int copies = random().nextInt(100) <= 5 ? 1 : 1 + random().nextInt(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note - this is necessary because with segment sorting, the order in which documents with equal sort values are returned may not be stable between runs if the documents end up in different segments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test was added to catch a bug: https://issues.apache.org/jira/browse/LUCENE-10106
Would removing this randomization cause not to catch the problem that LUCENE-10106 tried to fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the other tests added in that patch should also catch it, but maybe @dnhatn has an opinion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's okay to remove this.
tteofili
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nide idea, LGTM!
p.s.:
did you run any benchmark to see how this impacts performance ?
Not yet, am playing with luceneutil to see if I can add some adverse sort queries. Will report numbers back here. |
| collector.setWeight(weight); | ||
|
|
||
| if (collector.scoreMode().isExhaustive() == false) { | ||
| Comparator<LeafReaderContext> leafComparator = collector.getLeafReaderComparator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we shouldn't gate this here by isExhaustive. Maybe let the Collector author decide when to return non-null. It keeps IndexSearcher logic simpler, removing a non-obvious condition to the uninitiated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, will update.
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea @romseygeek! I left a few questions, but otherwise this looks good.
| private final Long missingValue; | ||
| private final ToLongFunction<byte[]> pointDecoder; | ||
|
|
||
| public NumericFieldReaderContextComparator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this constructor also be package protected?
| } | ||
|
|
||
| @Override | ||
| public Comparator<LeafReaderContext> getLeafReaderComparator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case that the original ordering of leaves is already optimal (because leaf sorter has been configured on IndexWriterConfig), would this be the place where subclasses overwrite and return null?
In this case there shouldn't be a need to do the re-ordering of segments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case where there's a configured leaf sorter is tricky, as it might be optimal or it might be entirely adverse depending on whether the query sort is reversed or not. You could override here and turn off query-time segment sorting if you wanted. But maybe we need a better escape hatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe we need a better escape hatch?
I think so. What do you think would be a good escape hatch here?
| int iterations = limit + random().nextInt(limit); | ||
| long seqNoGenerator = random().nextInt(1000); | ||
| for (long i = 0; i < iterations; i++) { | ||
| int copies = random().nextInt(100) <= 5 ? 1 : 1 + random().nextInt(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test was added to catch a bug: https://issues.apache.org/jira/browse/LUCENE-10106
Would removing this randomization cause not to catch the problem that LUCENE-10106 tried to fix?
lucene/core/src/java/org/apache/lucene/search/NumericFieldReaderContextComparator.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/NumericFieldReaderContextComparator.java
Show resolved
Hide resolved
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
I added a couple of sorted MatchAll queries to The On the plus side, it seems that there isn't a noticeable penalty for doing this sorting, so the escape hatch may not be necessary. But I want to make sure that there are actually existing benefits as well! |
Top-k searches using result pruning can already skip segments if they know
that the best possible match in that segment is uncompetitive. We can take
this a step further by looking at the minimum and maximum values of a field
in each segment, and then ordering the segments such that those with more
competitive values in general come earlier in the search. This is particularly
useful for adversarial cases such as a query sorted by timestamp over an index
with an inverse timestamp sort.