-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-sort shards based on the max/min value of the primary sort field #49092
Conversation
This change automatically pre-sort search shards on search requests that use a primary sort based on the value of a field. When possible, the can_match phase will extract the min/max (depending on the provided sort order) values of each shard and use it to pre-sort the shards prior to running the subsequent phases. This feature can be useful to ensure that shards that contain recent data are executed first so that intermediate merge have more chance to contain contiguous data (think of date_histogram for instance) but it could also be used in a follow up to early terminate sorted top-hits queries that don't require the total hit count. The latter could significantly speed up the retrieval of the most/least recent documents from time-based indices. I took two shortcuts here: * I reused the can_match phase to add the required information for the shard sort. We could instead introduce a new phase but it make sense to me to use the existing phase to add more informations as long as the additional ops are lightweight. * The shard sort is done automatically if the primary search sort is based on a field. However this sorting only makes sense if the range of values in each shard doesn't overlap (time-based indices sorted on timestamp for instance). We could add a new option to enable/disable this behavior or even add an additional `shard_sort` criteria but I also like the fact that users don't need to set any option to benefit from this feature. Relates elastic#49091
Pinging @elastic/es-search (:Search/Search) |
@elasticmachine run elasticsearch-ci/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but I like it. To your questions:
+1 to reusing can_match, this is better than introducing an additional phase
+1 to not requiring users to tune anything. I wonder whether we should return both the min and the max value so that we could check whether there is significant overlap or not on the coordinating node and optimize accordingly.
I think this logic might make wrong decisions when sorting on a date and some fields have the date mapped as date
and other ones as date_nanos
? I don't think it's a bug deal, we wouldn't return wrong results, I'm just checking my understanding.
server/src/main/java/org/elasticsearch/action/search/CanMatchPreFilterSearchPhase.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/sort/FieldSortBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/sort/FieldSortBuilder.java
Outdated
Show resolved
Hide resolved
Thanks for looking @jpountz. I pushed another commit to address your review. The can_match phase now returns the minimum and the maximum value of the primary sort and we pick the sort value in the coordinator node depending on the provided order. |
server/src/main/java/org/elasticsearch/action/search/CanMatchPreFilterSearchPhase.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/CanMatchPreFilterSearchPhase.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/CanMatchPreFilterSearchPhase.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/CanMatchPreFilterSearchPhase.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/CanMatchPreFilterSearchPhase.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/routing/GroupShardsIterator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/sort/FieldSortBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/sort/FieldSortBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/sort/MinAndMax.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/sort/MinAndMax.java
Outdated
Show resolved
Hide resolved
The change looks good, Ieft some nit picks about type safety and usage of the comparator API. |
…reFilterSearchPhase.java Co-Authored-By: Adrien Grand <jpountz@gmail.com>
…reFilterSearchPhase.java Co-Authored-By: Adrien Grand <jpountz@gmail.com>
…reFilterSearchPhase.java Co-Authored-By: Adrien Grand <jpountz@gmail.com>
…reFilterSearchPhase.java Co-Authored-By: Adrien Grand <jpountz@gmail.com>
…ardsIterator.java Co-Authored-By: Adrien Grand <jpountz@gmail.com>
Co-Authored-By: Adrien Grand <jpountz@gmail.com>
@elasticmachine run elasticsearch-ci/packaging-sample-matrix |
…49092) This change automatically pre-sort search shards on search requests that use a primary sort based on the value of a field. When possible, the can_match phase will extract the min/max (depending on the provided sort order) values of each shard and use it to pre-sort the shards prior to running the subsequent phases. This feature can be useful to ensure that shards that contain recent data are executed first so that intermediate merge have more chance to contain contiguous data (think of date_histogram for instance) but it could also be used in a follow up to early terminate sorted top-hits queries that don't require the total hit count. The latter could significantly speed up the retrieval of the most/least recent documents from time-based indices. Relates #49091
The MinAndMax encapsulates min and max values for a shard. It uses generics to make sure that the values are of the same type and are also comparable. Though there are warnings whenever this class is currently used, which are addressed with this commit. Relates to elastic#49092
`MinAndMax` encapsulates min and max values for a shard. It uses generics to make sure that the values are of the same type and are also comparable. Though there are warnings whenever this class is currently used, which are addressed with this commit. Relates to #49092
`MinAndMax` encapsulates min and max values for a shard. It uses generics to make sure that the values are of the same type and are also comparable. Though there are warnings whenever this class is currently used, which are addressed with this commit. Relates to elastic#49092
`MinAndMax` encapsulates min and max values for a shard. It uses generics to make sure that the values are of the same type and are also comparable. Though there are warnings whenever this class is currently used, which are addressed with this commit. Relates to #49092
This change automatically pre-sort search shards on search requests that use a primary sort based on the value of a field. When possible, the can_match phase will extract the min/max (depending on the provided sort order) values of each shard and use it to pre-sort the shards prior to running the subsequent phases. This feature can be useful to ensure that shards that contain recent data are executed first so that intermediate merges have more chance to contain contiguous data (think of
date_histogram
for instance) but it could also be used in a follow up to early terminate sortedtop-hits queries that don't require the total hit count. The latter could significantly speed up the retrieval of the most/least recent documents from time-based indices.
I took two shortcuts here that require some discussions:
shard_sort
criteria but I also like the fact that users don't need to set any option to benefit from this feature.Relates #49091