-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new merge policy that interleaves old and new segments on force merge #48533
Conversation
… merge This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working on in # remain efficient even after running a force merge. Relates elastic#37043
Pinging @elastic/es-distributed (:Distributed/Engine) |
// and then interleave them to colocate oldest and most recent segments together. | ||
private List<SegmentCommitInfo> interleaveList(List<SegmentCommitInfo> infos) throws IOException { | ||
List<SegmentCommitInfo> newInfos = new ArrayList<>(infos.size()); | ||
Collections.sort(infos, Comparator.comparing(a -> a.info.name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid changing infos
in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making a copy would also help ensure that the list supports random-access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++, I pushed aae5c30
// We wrap the merge policy for all indices even though it is mostly useful for time-based indices | ||
// but there should be no overhead for other type of indices so it's simpler than adding a setting | ||
// to enable it. | ||
mergePolicy = new ShuflleForcedMergePolicy(mergePolicy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with doing it all the time for simplicity, but can you add an escape hatch in case it proves problematic for some use-cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I added a system property in aae5c30
… merge (#48533) This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working on in # remain efficient even after running a force merge. Relates #37043
Measure the performance of sort operations after force merging to 1 segment. PR elastic/elasticsearch#48533 adds a new merge policy that interleaves old and new segments on force merge. This checks the sort performance with this policy after docs are merged to 1 segment.
Measure the performance of sort operations after force merging to 1 segment. PR elastic/elasticsearch#48533 adds a new merge policy that interleaves old and new segments on force merge. This checks the sort performance with this policy after docs are merged to 1 segment.
This rewrites long sort as a `DistanceFeatureQuery`, which can efficiently skip non-competitive blocks and segments of documents. Depending on the dataset, the speedups can be 2 - 10 times. The optimization can be disabled with setting the system property `es.search.rewrite_sort` to `false`. Optimization is skipped when an index has 50% or more data with the same value. Optimization is done through: 1. Rewriting sort as `DistanceFeatureQuery` which can efficiently skip non-competitive blocks and segments of documents. 2. Sorting segments according to the primary numeric sort field(#44021) This allows to skip non-competitive segments. 3. Using collector manager. When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. We use collectorManager, where for every segment a dedicated collector will be created. 4. Using Lucene's shared TopFieldCollector manager This collector manager is able to exchange minimum competitive score between collectors, which allows us to efficiently skip the whole segments that don't contain competitive scores. 5. When index is force merged to a single segment, #48533 interleaving old and new segments allows for this optimization as well, as blocks with non-competitive docs can be skipped. Closes #37043 Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
This rewrites long sort as a `DistanceFeatureQuery`, which can efficiently skip non-competitive blocks and segments of documents. Depending on the dataset, the speedups can be 2 - 10 times. The optimization can be disabled with setting the system property `es.search.rewrite_sort` to `false`. Optimization is skipped when an index has 50% or more data with the same value. Optimization is done through: 1. Rewriting sort as `DistanceFeatureQuery` which can efficiently skip non-competitive blocks and segments of documents. 2. Sorting segments according to the primary numeric sort field(#44021) This allows to skip non-competitive segments. 3. Using collector manager. When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. We use collectorManager, where for every segment a dedicated collector will be created. 4. Using Lucene's shared TopFieldCollector manager This collector manager is able to exchange minimum competitive score between collectors, which allows us to efficiently skip the whole segments that don't contain competitive scores. 5. When index is force merged to a single segment, #48533 interleaving old and new segments allows for this optimization as well, as blocks with non-competitive docs can be skipped. Backport for #48804 Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler
than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working on in #37043 remain efficient even after running a force merge.
Relates #37043