-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use LogByteSizeMergePolicy instead of TieredMergePolicy for time-based data. #92684
Conversation
…d data. `TieredMergePolicy` is better than `LogByteSizeMergePolicy` at computing cheaper merges, but it does so by allowing itself to merge non-adjacent segments. An important property we get when only merging adjacent segments and data gets indexed in time order is that segments have non-overlapping time ranges. This means that a range query on the time field will only partially match 2 segments at most, and other segments will either fully match or not match at all. This also means that for segments that partially match a range query, the matching docs will likely be clustered together in a sequential range of doc IDs, ie. there will be good data locality.
Hi @jpountz, I've created a changelog YAML for you. |
Pinging @elastic/es-distributed (Team:Distributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, left a couple of comments.
} | ||
|
||
void setMaxMergedSegment(ByteSizeValue maxMergedSegment) { | ||
mergePolicy.setMaxMergedSegmentMB(maxMergedSegment.getMbFrac()); | ||
tieredMergePolicy.setMaxMergedSegmentMB(maxMergedSegment.getMbFrac()); | ||
// Note: max merge MB has different semantics on LogByteSizeMergePolicy: it's the maximum size for a segment to be considered for a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this opens us up for degenerate cases in both ends, either now stopping at 2.5GB or going up to 25GB. I imagine this being provoked by different input styles (either document size, frequency etc), i.e., some data streams work well (hitting around 5GB) whereas others may suffer from worse search or indexing (due to the larger merges) performance?
I wonder if we could adapt the log merge policy to be closer to the tiered merge policy here. I.e., if it can merge together 2 adjacent segments but not 3, then merge the 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your question, it made me check the actual LogByteSizeMergePolicy
behavior, and actually my comment is not right, LogByteSizeMergePolicy
works as you describe since may last year via apache/lucene#935, which introduced the behavior you described as part of fixing another problem with this merge policy, the fact that this merge policy packs segments together is even tested.
@Override | ||
MergePolicy getMergePolicy(MergePolicyConfig config, boolean isTimeSeriesIndex) { | ||
if (isTimeSeriesIndex) { | ||
// TieredMergePolicy is better than LogByteSizeMergePolicy at computing cheaper merges, but it does so by allowing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is formulated as if it in practice could be (substantially) more expensive to use the log merge policy, perhaps we can elaborate here on why this is unlikely to be the case for data streams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I pushed an update that makes it sound like a better trade-off.
I ran two Rally runs on the NYC taxis dataset with the following options to compare ingestion rates:
|
For reference here are final segment sizes in both cases:
|
I dug into this and this looks like an expected property of
So e.g. in the final segment structure we get with
I ran into some simulations using Lucene's |
Another idea I've been willing to play with consists of merging segments less aggressively for time-based data. So I ran the NYC taxis indexing benchmark again with |
New iteration: instead of trying to reuse the same "segments per tier" configuration option of
I'll do more analysis of segment distributions with these two different merge policies. |
I simulated an index that gets loaded with segments whose size is anywhere between 0 and 2MB until it reaches 50GB with both
Overall I find it very appealing: the fact that |
The way the number of segments was computed, as the average since the beginning, gave too much weight to early point-in-time views of the index, so I changed it to be a moving average over the last 1,000 point-in-time views of the index and decreased the merge factor to 16. It's now essentially impossible to tell which one of the two merge policies has more segments on average, while the patch still has a much lower write amplification. Here is a new video. I've had quite a few iterations on this PR so I'll try to summarize what this change is and the benefits:
So what are the downsides?
All these downsides feel minor to me, especially in the context of time-based data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow/see this part:
The way the number of segments was computed, as the average since the beginning, gave too much weight to early point-in-time views of the index, so I changed it to be a moving average over the last 1,000 point-in-time views of the index and decreased the merge factor to 16.
Can you elaborate on where this is manifested in the code?
private final Logger logger; | ||
private final boolean mergesEnabled; | ||
private volatile Type mergePolicyType; | ||
|
||
public static final double DEFAULT_EXPUNGE_DELETES_ALLOWED = 10d; | ||
public static final ByteSizeValue DEFAULT_FLOOR_SEGMENT = new ByteSizeValue(2, ByteSizeUnit.MB); | ||
public static final int DEFAULT_MAX_MERGE_AT_ONCE = 10; | ||
public static final ByteSizeValue DEFAULT_MAX_MERGED_SEGMENT = new ByteSizeValue(5, ByteSizeUnit.GB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider raising this default "a bit" when using log-byte-size policy? To get to a similar segment size as in the tiered merge policy.
I say this because, I could imagine us wanting to merge up all the remaining small segments after rollover. And with just below 5GB, that may give 11 segs rather than 10, which seems wasteful. It would be good to have a ~5.1GB avg large segment size - to match our defaults of 50GB shards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, TieredMergePolicy
may stop merging anywhere between 2.5GB and 5GB (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java#L375-L390). It's less frequent than with LogByteSizeMergePolicy
because it has the ability to pack small segments together with large segments due to its ability to merge non-adjacent segments, but if one of the background merges produces a 4GB segment, then this segment won't get merged because the merge would mostly consist of rewriting this big large segment, which is wasteful.
In my opinion, what we should do depends on what we think is the best practice for time-based indices:
- If we think that old indices should get force-merged to a single segment, it might make more sense to lower the max merged segment size to something like 1GB in order to save merging while the index is actively being written: since all segments will get rewritten in the end, it doesn't really matter where we end with background merges and leaning towards smaller segments would help decrease the merging overhead on ingestion.
- If we think that it's a better practice to only merge the smaller segments and end up with segments in the order of 5GB on average as you are suggesting, then it might indeed make more sense to set the max merged segment size to something like 6GB.
- Or maybe we should not configure a maximum merged segment size at all on time-based indices, because we already have a rollover size that already bounds somehow the maximum size of segments. If you have 10 1GB segments, why would you do 2 5-segments merges to get 2 5GB segments when doing a single 10-segments merge has roughly the same cost and results in fewer segments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the 3rd option, unless if there are practical limits we could run into (heap and memory primarily). We may still want some "desired segment size" for tail merging, but that can be added as part of that effort (if we implement it). Given that it is now a factor 16, it seems very likely that with any limit we will be doing several smaller merges to fit under the indicated roof. Compared to the 5GB limit we have now, we should not see 16 of those being merged together. Correct me if I am wrong, but removing the roof is thus unlikely to lead to more actual merging than with the 5GB limit?
I suppose we might be able to ease the merging a bit with a 1GB roof. Perhaps worth trying out what the amplification would be in the two cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no practical limits that I can think of and I agree that there shouldn't be more merging than what we're seeing now because 5GB is already within one mergeFactor of the rollover size of 50GB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henningandersen Here's a video that shows the difference between max segment size = 1GB and an unbounded max segment size: https://photos.app.goo.gl/cyWZeTmthhWwC3fZ9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I forgot to mention that this video was created with 300kB flushes on average rather than 1MB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to have a new video then comparing 5GB max segment size to unbounded - with same flush size. Just to verify our mental model around the write amplification effect of the max segment size.
I wonder if we'd still want something like 100GB max segment size as protection against users using a non-default rollover size or hiccups in ILM and the rollover process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a video that compares 5GB to unbounded with TieredMergePolicy: https://photos.app.goo.gl/JEDJdGFVcUAbXaKP9, and there with LogByteSizeMergePolicy (mergeFactor=16, same parameters as TieredMergePolicy otherwise): https://photos.app.goo.gl/cp3my5RzEbTHrAXSA. Very similar numbers of segments and write amplification but the bigger segments are bigger when the max segment size is unbounded. You'd need to ingest more than 50GB in an index to start seeing a bigger difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, I think a 100GB (or 50GB) roof should be our new default with log byte size merge policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a change that sets a roof of 100GB on time-based data.
Sorry this comment was about the first video that I posted, not the code. It adds about 50k segments to an index in sequence, after each segment checks if merges need to run, and then looks at the number of segments in the index. The first video computed the number of segments as the average number of segments since the beginning. The second video uses a moving window of 1k segment flushes. |
I also tested this change on the TSDB track to get another data point. Similar number of segments in the end, but lower merge count. I'm unclear as to why the merge time is the same, I need to dig.
|
💚 CLA has been signed |
8185866
to
7cd99d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why my team was pinged for Code owner code review here, would be great if someone can look into why that is happening. Could be due to force push.
I'm approving as to not block this PR seeing that it's already approved.
@miltonhultgren I did something wrong in a local rebase which I only noticed after pushing, and this looks like it caused this ping for a code review so I don't fully understand why. Sorry for the annoyance. |
This is a follow-up to elastic#92684. elastic#92684 switched from `TieredMergePolicy` to `LogByteSizeMergePolicy` for time-based data, trying to retain similar numbers of segments in shards. This change goes further, and takes advantage of the fact that adjacent segment merging gives segments (mostly) non-overlapping time ranges, to reduce merging overhead without hurting the efficiency of range queries on the timestamp field. In general the trade-off of this change is that it yields: - Faster ingestion thanks to reduced merging overhead. - Similar performance for range queries on the timestamp field. - Very slightly degraded performance of term queries due to the increased number of segments. This should be hardly noticeable in most cases. - Possibly degraded performance of fuzzy, wildcard queries, as well as range queries on other fields than the timestamp field.
This is a follow-up to #92684. #92684 switched from `TieredMergePolicy` to `LogByteSizeMergePolicy` for time-based data, trying to retain similar numbers of segments in shards. This change goes further, and takes advantage of the fact that adjacent segment merging gives segments (mostly) non-overlapping time ranges, to reduce merging overhead without hurting the efficiency of range queries on the timestamp field. In general the trade-off of this change is that it yields: - Faster ingestion thanks to reduced merging overhead. - Similar performance for range queries on the timestamp field. - Very slightly degraded performance of term queries due to the increased number of segments. This should be hardly noticeable in most cases. - Possibly degraded performance of fuzzy, wildcard queries, as well as range queries on other fields than the timestamp field.
TieredMergePolicy
is better thanLogByteSizeMergePolicy
at computing cheaper merges, but it does so by allowing itself to merge non-adjacent segments and by delaying merges a bit until it has collected more segments to pick from. An important property we get when only merging adjacent segments and data gets indexed in time order is that segments have non-overlapping time ranges. This means that a range query on the time field will only partially match 2 segments at most, and other segments will either fully match or not match at all. This also means that for segments that partially match a range query, the matching docs will likely be clustered together in a sequential range of doc IDs, ie. there will be good data locality.