Add a new merge policy that interleaves old and new segments on force merge #48533

jimczi · 2019-10-25T14:59:07Z

This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler
than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working on in #37043 remain efficient even after running a force merge.

Relates #37043

… merge This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working on in # remain efficient even after running a force merge. Relates elastic#37043

elasticmachine · 2019-10-25T14:59:10Z

Pinging @elastic/es-distributed (:Distributed/Engine)

jpountz · 2019-10-25T15:03:42Z

server/src/main/java/org/apache/lucene/index/ShuflleForcedMergePolicy.java

+    // and then interleave them to colocate oldest and most recent segments together.
+    private List<SegmentCommitInfo> interleaveList(List<SegmentCommitInfo> infos) throws IOException {
+        List<SegmentCommitInfo> newInfos = new ArrayList<>(infos.size());
+        Collections.sort(infos, Comparator.comparing(a -> a.info.name));


I think we should avoid changing infos in place.

Making a copy would also help ensure that the list supports random-access.

++, I pushed aae5c30

jpountz · 2019-10-25T15:39:27Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+        // We wrap the merge policy for all indices even though it is mostly useful for time-based indices
+        // but there should be no overhead for other type of indices so it's simpler than adding a setting
+        // to enable it.
+        mergePolicy = new ShuflleForcedMergePolicy(mergePolicy);


I agree with doing it all the time for simplicity, but can you add an escape hatch in case it proves problematic for some use-cases?

Sure I added a system property in aae5c30

… merge (#48533) This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working on in # remain efficient even after running a force merge. Relates #37043

Measure the performance of sort operations after force merging to 1 segment. PR elastic/elasticsearch#48533 adds a new merge policy that interleaves old and new segments on force merge. This checks the sort performance with this policy after docs are merged to 1 segment.

This rewrites long sort as a `DistanceFeatureQuery`, which can efficiently skip non-competitive blocks and segments of documents. Depending on the dataset, the speedups can be 2 - 10 times. The optimization can be disabled with setting the system property `es.search.rewrite_sort` to `false`. Optimization is skipped when an index has 50% or more data with the same value. Optimization is done through: 1. Rewriting sort as `DistanceFeatureQuery` which can efficiently skip non-competitive blocks and segments of documents. 2. Sorting segments according to the primary numeric sort field(#44021) This allows to skip non-competitive segments. 3. Using collector manager. When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. We use collectorManager, where for every segment a dedicated collector will be created. 4. Using Lucene's shared TopFieldCollector manager This collector manager is able to exchange minimum competitive score between collectors, which allows us to efficiently skip the whole segments that don't contain competitive scores. 5. When index is force merged to a single segment, #48533 interleaving old and new segments allows for this optimization as well, as blocks with non-competitive docs can be skipped. Closes #37043 Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>

This rewrites long sort as a `DistanceFeatureQuery`, which can efficiently skip non-competitive blocks and segments of documents. Depending on the dataset, the speedups can be 2 - 10 times. The optimization can be disabled with setting the system property `es.search.rewrite_sort` to `false`. Optimization is skipped when an index has 50% or more data with the same value. Optimization is done through: 1. Rewriting sort as `DistanceFeatureQuery` which can efficiently skip non-competitive blocks and segments of documents. 2. Sorting segments according to the primary numeric sort field(#44021) This allows to skip non-competitive segments. 3. Using collector manager. When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. We use collectorManager, where for every segment a dedicated collector will be created. 4. Using Lucene's shared TopFieldCollector manager This collector manager is able to exchange minimum competitive score between collectors, which allows us to efficiently skip the whole segments that don't contain competitive scores. 5. When index is force merged to a single segment, #48533 interleaving old and new segments allows for this optimization as well, as blocks with non-competitive docs can be skipped. Backport for #48804 Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>

jimczi added >enhancement :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v8.0.0 v7.6.0 labels Oct 25, 2019

jimczi requested review from jpountz and mayya-sharipova October 25, 2019 14:59

jpountz approved these changes Oct 25, 2019

View reviewed changes

address review

aae5c30

jpountz approved these changes Oct 28, 2019

View reviewed changes

jimczi merged commit 5297e5a into elastic:master Oct 29, 2019

jimczi deleted the interleaved_forced_merge branch October 29, 2019 08:00

dnhatn mentioned this pull request Oct 31, 2019

testForceMergeWithSoftDeletesRetentionAndRecoverySource fails #48735

Closed

mayya-sharipova mentioned this pull request Nov 13, 2019

http_logs add force merge to 1 segment elastic/rally-tracks#90

Merged

mayya-sharipova mentioned this pull request Nov 29, 2019

Optimize sort on numeric long and date fields. #49732

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new merge policy that interleaves old and new segments on force merge #48533

Add a new merge policy that interleaves old and new segments on force merge #48533

jimczi commented Oct 25, 2019

elasticmachine commented Oct 25, 2019

jpountz Oct 25, 2019

jpountz Oct 25, 2019

jimczi Oct 28, 2019

jpountz Oct 25, 2019

jimczi Oct 28, 2019

Add a new merge policy that interleaves old and new segments on force merge #48533

Add a new merge policy that interleaves old and new segments on force merge #48533

Conversation

jimczi commented Oct 25, 2019

elasticmachine commented Oct 25, 2019

jpountz Oct 25, 2019

Choose a reason for hiding this comment

jpountz Oct 25, 2019

Choose a reason for hiding this comment

jimczi Oct 28, 2019

Choose a reason for hiding this comment

jpountz Oct 25, 2019

Choose a reason for hiding this comment

jimczi Oct 28, 2019

Choose a reason for hiding this comment