Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-sort shards based on the max/min value of the primary sort field #49092

Merged
merged 14 commits into from
Nov 21, 2019

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Nov 14, 2019

This change automatically pre-sort search shards on search requests that use a primary sort based on the value of a field. When possible, the can_match phase will extract the min/max (depending on the provided sort order) values of each shard and use it to pre-sort the shards prior to running the subsequent phases. This feature can be useful to ensure that shards that contain recent data are executed first so that intermediate merges have more chance to contain contiguous data (think of date_histogram for instance) but it could also be used in a follow up to early terminate sorted
top-hits queries that don't require the total hit count. The latter could significantly speed up the retrieval of the most/least recent documents from time-based indices.
I took two shortcuts here that require some discussions:

  • I reused the can_match phase to add the required information for the shard sort. We could instead introduce a new phase but it make sense to me to use the existing phase to add more informations as long as the additional ops are lightweight.
  • The shard sort is done automatically if the primary search sort is based on a field. However this sorting only makes sense if the range of values in each shard doesn't overlap (time-based indices sorted on timestamp for instance). We could add a new option to enable/disable this behavior or even add an additional shard_sort criteria but I also like the fact that users don't need to set any option to benefit from this feature.

Relates #49091

This change automatically pre-sort search shards on search requests that use a primary sort based on the value
of a field. When possible, the can_match phase will extract the min/max (depending on the provided sort order) values
of each shard and use it to pre-sort the shards prior to running the subsequent phases. This feature can be useful to
ensure that shards that contain recent data are executed first so that intermediate merge have more chance to contain
contiguous data (think of date_histogram for instance) but it could also be used in a follow up to early terminate sorted
top-hits queries that don't require the total hit count. The latter could significantly speed up the retrieval of the most/least
recent documents from time-based indices.
I took two shortcuts here:
* I reused the can_match phase to add the required information for the shard sort. We could instead introduce a new phase
 but it make sense to me to use the existing phase to add more informations as long as the additional ops are lightweight.
* The shard sort is done automatically if the primary search sort is based on a field. However this sorting only makes sense
if the range of values in each shard doesn't overlap (time-based indices sorted on timestamp for instance). We could add
a new option to enable/disable this behavior or even add an additional `shard_sort` criteria but I also like the fact that
users don't need to set any option to benefit from this feature.

Relates elastic#49091
@jimczi jimczi added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.6.0 labels Nov 14, 2019
@jimczi jimczi requested a review from javanna November 14, 2019 14:05
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@javanna javanna mentioned this pull request Nov 14, 2019
6 tasks
@jimczi
Copy link
Contributor Author

jimczi commented Nov 15, 2019

@elasticmachine run elasticsearch-ci/1

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but I like it. To your questions:
+1 to reusing can_match, this is better than introducing an additional phase
+1 to not requiring users to tune anything. I wonder whether we should return both the min and the max value so that we could check whether there is significant overlap or not on the coordinating node and optimize accordingly.

I think this logic might make wrong decisions when sorting on a date and some fields have the date mapped as date and other ones as date_nanos? I don't think it's a bug deal, we wouldn't return wrong results, I'm just checking my understanding.

@jimczi
Copy link
Contributor Author

jimczi commented Nov 20, 2019

Thanks for looking @jpountz. I pushed another commit to address your review. The can_match phase now returns the minimum and the maximum value of the primary sort and we pick the sort value in the coordinator node depending on the provided order.

@jpountz
Copy link
Contributor

jpountz commented Nov 20, 2019

The change looks good, Ieft some nit picks about type safety and usage of the comparator API.

jimczi and others added 8 commits November 20, 2019 23:38
…reFilterSearchPhase.java

Co-Authored-By: Adrien Grand <jpountz@gmail.com>
…reFilterSearchPhase.java

Co-Authored-By: Adrien Grand <jpountz@gmail.com>
…reFilterSearchPhase.java

Co-Authored-By: Adrien Grand <jpountz@gmail.com>
…reFilterSearchPhase.java

Co-Authored-By: Adrien Grand <jpountz@gmail.com>
…ardsIterator.java

Co-Authored-By: Adrien Grand <jpountz@gmail.com>
Co-Authored-By: Adrien Grand <jpountz@gmail.com>
@jimczi
Copy link
Contributor Author

jimczi commented Nov 21, 2019

@elasticmachine run elasticsearch-ci/packaging-sample-matrix

@jimczi jimczi merged commit b8ce07b into elastic:master Nov 21, 2019
@jimczi jimczi deleted the shard_sort branch November 21, 2019 21:45
jimczi added a commit that referenced this pull request Nov 22, 2019
…49092)

This change automatically pre-sort search shards on search requests that use a primary sort based on the value of a field. When possible, the can_match phase will extract the min/max (depending on the provided sort order) values of each shard and use it to pre-sort the shards prior to running the subsequent phases. This feature can be useful to ensure that shards that contain recent data are executed first so that intermediate merge have more chance to contain contiguous data (think of date_histogram for instance) but it could also be used in a follow up to early terminate sorted top-hits queries that don't require the total hit count. The latter could significantly speed up the retrieval of the most/least
recent documents from time-based indices.

Relates #49091
jimczi added a commit that referenced this pull request Nov 22, 2019
javanna added a commit to javanna/elasticsearch that referenced this pull request Feb 21, 2020
The MinAndMax encapsulates min and max values for a shard. It uses generics to make sure that the values are of the same type and are also comparable. Though there are warnings whenever this class is currently used, which are addressed with this commit.

Relates to elastic#49092
javanna added a commit that referenced this pull request Mar 3, 2020
`MinAndMax` encapsulates min and max values for a shard. It uses generics to make sure that the values are of the same type and are also comparable. Though there are warnings whenever this class is currently used, which are addressed with this commit.

Relates to #49092
javanna added a commit to javanna/elasticsearch that referenced this pull request Mar 3, 2020
`MinAndMax` encapsulates min and max values for a shard. It uses generics to make sure that the values are of the same type and are also comparable. Though there are warnings whenever this class is currently used, which are addressed with this commit.

Relates to elastic#49092
javanna added a commit that referenced this pull request Mar 3, 2020
`MinAndMax` encapsulates min and max values for a shard. It uses generics to make sure that the values are of the same type and are also comparable. Though there are warnings whenever this class is currently used, which are addressed with this commit.

Relates to #49092
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants