Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new merge policy that interleaves old and new segments on force merge #48533

Merged
merged 2 commits into from
Oct 29, 2019

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Oct 25, 2019

This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler
than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working on in #37043 remain efficient even after running a force merge.

Relates #37043

… merge

This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges
and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents
first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices
even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler
than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working
on in # remain efficient even after running a force merge.

Relates elastic#37043
@jimczi jimczi added >enhancement :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v8.0.0 v7.6.0 labels Oct 25, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Engine)

// and then interleave them to colocate oldest and most recent segments together.
private List<SegmentCommitInfo> interleaveList(List<SegmentCommitInfo> infos) throws IOException {
List<SegmentCommitInfo> newInfos = new ArrayList<>(infos.size());
Collections.sort(infos, Comparator.comparing(a -> a.info.name));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid changing infos in place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a copy would also help ensure that the list supports random-access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, I pushed aae5c30

// We wrap the merge policy for all indices even though it is mostly useful for time-based indices
// but there should be no overhead for other type of indices so it's simpler than adding a setting
// to enable it.
mergePolicy = new ShuflleForcedMergePolicy(mergePolicy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with doing it all the time for simplicity, but can you add an escape hatch in case it proves problematic for some use-cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I added a system property in aae5c30

@jimczi jimczi merged commit 5297e5a into elastic:master Oct 29, 2019
@jimczi jimczi deleted the interleaved_forced_merge branch October 29, 2019 08:00
jimczi added a commit that referenced this pull request Oct 29, 2019
… merge (#48533)

This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges
and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents
first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices
even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler
than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working
on in # remain efficient even after running a force merge.

Relates #37043
mayya-sharipova added a commit to elastic/rally-tracks that referenced this pull request Nov 14, 2019
Measure the performance of sort operations
after force merging to 1 segment.

PR elastic/elasticsearch#48533 adds a new merge policy that
interleaves old and new segments on force merge. 
This checks the sort performance with this policy 
after docs are merged to 1 segment.
mayya-sharipova added a commit to elastic/rally-tracks that referenced this pull request Nov 14, 2019
Measure the performance of sort operations
after force merging to 1 segment.

PR elastic/elasticsearch#48533 adds a new merge policy that
interleaves old and new segments on force merge. 
This checks the sort performance with this policy 
after docs are merged to 1 segment.
mayya-sharipova added a commit that referenced this pull request Nov 26, 2019
This rewrites long sort as a `DistanceFeatureQuery`, which can
efficiently skip non-competitive blocks and segments of documents.
Depending on the dataset, the speedups can be 2 - 10 times.

The optimization can be disabled with setting the system property
`es.search.rewrite_sort` to `false`.

Optimization is skipped when an index has 50% or more data with
the same value.

Optimization is done through:
1. Rewriting sort as `DistanceFeatureQuery` which can
efficiently skip non-competitive blocks and segments of documents.

2. Sorting segments according to the primary numeric sort field(#44021)
This allows to skip non-competitive segments.

3. Using collector manager.
When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
We use collectorManager, where for every segment a dedicated collector
will be created.

4. Using Lucene's shared TopFieldCollector manager
This collector manager is able to exchange minimum competitive
score between collectors, which allows us to efficiently skip
the whole segments that don't contain competitive scores.

5. When index is force merged to a single segment, #48533 interleaving
old and new segments allows for this optimization as well,
as blocks with non-competitive docs can be skipped.

Closes #37043

Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
mayya-sharipova added a commit that referenced this pull request Nov 29, 2019
This rewrites long sort as a `DistanceFeatureQuery`, which can
efficiently skip non-competitive blocks and segments of documents.
Depending on the dataset, the speedups can be 2 - 10 times.

The optimization can be disabled with setting the system property
`es.search.rewrite_sort` to `false`.

Optimization is skipped when an index has 50% or more data with
the same value.

Optimization is done through:
1. Rewriting sort as `DistanceFeatureQuery` which can
efficiently skip non-competitive blocks and segments of documents.

2. Sorting segments according to the primary numeric sort field(#44021)
This allows to skip non-competitive segments.

3. Using collector manager.
When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
We use collectorManager, where for every segment a dedicated collector
will be created.

4. Using Lucene's shared TopFieldCollector manager
This collector manager is able to exchange minimum competitive
score between collectors, which allows us to efficiently skip
the whole segments that don't contain competitive scores.

5. When index is force merged to a single segment, #48533 interleaving
old and new segments allows for this optimization as well,
as blocks with non-competitive docs can be skipped.

Backport for #48804


Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants