Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LogByteSizeMergePolicy instead of TieredMergePolicy for time-based data. #92684

Merged
merged 17 commits into from
Feb 9, 2023

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jan 4, 2023

TieredMergePolicy is better than LogByteSizeMergePolicy at computing cheaper merges, but it does so by allowing itself to merge non-adjacent segments and by delaying merges a bit until it has collected more segments to pick from. An important property we get when only merging adjacent segments and data gets indexed in time order is that segments have non-overlapping time ranges. This means that a range query on the time field will only partially match 2 segments at most, and other segments will either fully match or not match at all. This also means that for segments that partially match a range query, the matching docs will likely be clustered together in a sequential range of doc IDs, ie. there will be good data locality.

…d data.

`TieredMergePolicy` is better than `LogByteSizeMergePolicy` at computing
cheaper merges, but it does so by allowing itself to merge non-adjacent
segments. An important property we get when only merging adjacent segments and
data gets indexed in time order is that segments have non-overlapping time
ranges.  This means that a range query on the time field will only partially
match 2 segments at most, and other segments will either fully match or not
match at all. This also means that for segments that partially match a range
query, the matching docs will likely be clustered together in a sequential
range of doc IDs, ie. there will be good data locality.
@jpountz jpountz added >enhancement :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v8.7.0 labels Jan 4, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jan 4, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @jpountz, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, left a couple of comments.

}

void setMaxMergedSegment(ByteSizeValue maxMergedSegment) {
mergePolicy.setMaxMergedSegmentMB(maxMergedSegment.getMbFrac());
tieredMergePolicy.setMaxMergedSegmentMB(maxMergedSegment.getMbFrac());
// Note: max merge MB has different semantics on LogByteSizeMergePolicy: it's the maximum size for a segment to be considered for a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this opens us up for degenerate cases in both ends, either now stopping at 2.5GB or going up to 25GB. I imagine this being provoked by different input styles (either document size, frequency etc), i.e., some data streams work well (hitting around 5GB) whereas others may suffer from worse search or indexing (due to the larger merges) performance?

I wonder if we could adapt the log merge policy to be closer to the tiered merge policy here. I.e., if it can merge together 2 adjacent segments but not 3, then merge the 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your question, it made me check the actual LogByteSizeMergePolicy behavior, and actually my comment is not right, LogByteSizeMergePolicy works as you describe since may last year via apache/lucene#935, which introduced the behavior you described as part of fixing another problem with this merge policy, the fact that this merge policy packs segments together is even tested.

@Override
MergePolicy getMergePolicy(MergePolicyConfig config, boolean isTimeSeriesIndex) {
if (isTimeSeriesIndex) {
// TieredMergePolicy is better than LogByteSizeMergePolicy at computing cheaper merges, but it does so by allowing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is formulated as if it in practice could be (substantially) more expensive to use the log merge policy, perhaps we can elaborate here on why this is unlikely to be the case for data streams?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I pushed an update that makes it sound like a better trade-off.

@jpountz
Copy link
Contributor Author

jpountz commented Jan 9, 2023

I ran two Rally runs on the NYC taxis dataset with the following options to compare ingestion rates: /rally race --track=nyc_taxis --challenge=append-no-conflicts-index-only --preserve-install --car=4gheap. In both cases, the segment count is significantly lower in the end, which suggests that LogByteSizeMergePolicy merges more aggressively. I need to dig more into why.

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                        Metric |   Task |         Baseline |        Contender |        Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|-------:|-----------------:|-----------------:|------------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |        |    111.81        |    111.056       |    -0.75413 |    min |   -0.67% |
|             Min cumulative indexing time across primary shard |        |    111.81        |    111.056       |    -0.75413 |    min |   -0.67% |
|          Median cumulative indexing time across primary shard |        |    111.81        |    111.056       |    -0.75413 |    min |   -0.67% |
|             Max cumulative indexing time across primary shard |        |    111.81        |    111.056       |    -0.75413 |    min |   -0.67% |
|           Cumulative indexing throttle time of primary shards |        |      0           |      0           |     0       |    min |    0.00% |
|    Min cumulative indexing throttle time across primary shard |        |      0           |      0           |     0       |    min |    0.00% |
| Median cumulative indexing throttle time across primary shard |        |      0           |      0           |     0       |    min |    0.00% |
|    Max cumulative indexing throttle time across primary shard |        |      0           |      0           |     0       |    min |    0.00% |
|                       Cumulative merge time of primary shards |        |     40.9411      |     43.5569      |     2.61578 |    min |   +6.39% |
|                      Cumulative merge count of primary shards |        |     99           |    106           |     7       |        |   +7.07% |
|                Min cumulative merge time across primary shard |        |     40.9411      |     43.5569      |     2.61578 |    min |   +6.39% |
|             Median cumulative merge time across primary shard |        |     40.9411      |     43.5569      |     2.61578 |    min |   +6.39% |
|                Max cumulative merge time across primary shard |        |     40.9411      |     43.5569      |     2.61578 |    min |   +6.39% |
|              Cumulative merge throttle time of primary shards |        |      4.63548     |      2.23013     |    -2.40535 |    min |  -51.89% |
|       Min cumulative merge throttle time across primary shard |        |      4.63548     |      2.23013     |    -2.40535 |    min |  -51.89% |
|    Median cumulative merge throttle time across primary shard |        |      4.63548     |      2.23013     |    -2.40535 |    min |  -51.89% |
|       Max cumulative merge throttle time across primary shard |        |      4.63548     |      2.23013     |    -2.40535 |    min |  -51.89% |
|                     Cumulative refresh time of primary shards |        |      1.5524      |      1.75107     |     0.19867 |    min |  +12.80% |
|                    Cumulative refresh count of primary shards |        |     72           |     74           |     2       |        |   +2.78% |
|              Min cumulative refresh time across primary shard |        |      1.5524      |      1.75107     |     0.19867 |    min |  +12.80% |
|           Median cumulative refresh time across primary shard |        |      1.5524      |      1.75107     |     0.19867 |    min |  +12.80% |
|              Max cumulative refresh time across primary shard |        |      1.5524      |      1.75107     |     0.19867 |    min |  +12.80% |
|                       Cumulative flush time of primary shards |        |      2.26853     |      2.25813     |    -0.0104  |    min |   -0.46% |
|                      Cumulative flush count of primary shards |        |     31           |     32           |     1       |        |   +3.23% |
|                Min cumulative flush time across primary shard |        |      2.26853     |      2.25813     |    -0.0104  |    min |   -0.46% |
|             Median cumulative flush time across primary shard |        |      2.26853     |      2.25813     |    -0.0104  |    min |   -0.46% |
|                Max cumulative flush time across primary shard |        |      2.26853     |      2.25813     |    -0.0104  |    min |   -0.46% |
|                                       Total Young Gen GC time |        |     34.426       |     35.662       |     1.236   |      s |   +3.59% |
|                                      Total Young Gen GC count |        |   2247           |   2334           |    87       |        |   +3.87% |
|                                         Total Old Gen GC time |        |      0           |      0           |     0       |      s |    0.00% |
|                                        Total Old Gen GC count |        |      0           |      0           |     0       |        |    0.00% |
|                                                    Store size |        |     33.7444      |     26.326       |    -7.41844 |     GB |  -21.98% |
|                                                 Translog size |        |      5.12227e-08 |      5.12227e-08 |     0       |     GB |    0.00% |
|                                        Heap used for segments |        |      0           |      0           |     0       |     MB |    0.00% |
|                                      Heap used for doc values |        |      0           |      0           |     0       |     MB |    0.00% |
|                                           Heap used for terms |        |      0           |      0           |     0       |     MB |    0.00% |
|                                           Heap used for norms |        |      0           |      0           |     0       |     MB |    0.00% |
|                                          Heap used for points |        |      0           |      0           |     0       |     MB |    0.00% |
|                                   Heap used for stored fields |        |      0           |      0           |     0       |     MB |    0.00% |
|                                                 Segment count |        |     47           |     28           |   -19       |        |  -40.43% |
|                                   Total Ingest Pipeline count |        |      0           |      0           |     0       |        |    0.00% |
|                                    Total Ingest Pipeline time |        |      0           |      0           |     0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |        |      0           |      0           |     0       |        |    0.00% |
|                                                Min Throughput |  index | 149073           | 142754           | -6319.44    | docs/s |   -4.24% |
|                                               Mean Throughput |  index | 153128           | 146763           | -6365.24    | docs/s |   -4.16% |
|                                             Median Throughput |  index | 152939           | 146332           | -6606.65    | docs/s |   -4.32% |
|                                                Max Throughput |  index | 155999           | 149679           | -6320.11    | docs/s |   -4.05% |
|                                       50th percentile latency |  index |    427.853       |    439.047       |    11.1942  |     ms |   +2.62% |
|                                       90th percentile latency |  index |    516.156       |    561.514       |    45.3582  |     ms |   +8.79% |
|                                       99th percentile latency |  index |   2460.01        |   2598           |   137.99    |     ms |   +5.61% |
|                                     99.9th percentile latency |  index |   4652.12        |   4815.94        |   163.819   |     ms |   +3.52% |
|                                    99.99th percentile latency |  index |   5583.57        |   5876.35        |   292.782   |     ms |   +5.24% |
|                                      100th percentile latency |  index |   5859.37        |   7330.19        |  1470.82    |     ms |  +25.10% |
|                                  50th percentile service time |  index |    427.853       |    439.047       |    11.1942  |     ms |   +2.62% |
|                                  90th percentile service time |  index |    516.156       |    561.514       |    45.3582  |     ms |   +8.79% |
|                                  99th percentile service time |  index |   2460.01        |   2598           |   137.99    |     ms |   +5.61% |
|                                99.9th percentile service time |  index |   4652.12        |   4815.94        |   163.819   |     ms |   +3.52% |
|                               99.99th percentile service time |  index |   5583.57        |   5876.35        |   292.782   |     ms |   +5.24% |
|                                 100th percentile service time |  index |   5859.37        |   7330.19        |  1470.82    |     ms |  +25.10% |
|                                                    error rate |  index |      0           |      0           |     0       |      % |    0.00% |
|                                                        Metric |   Task |         Baseline |        Contender |       Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|-------:|-----------------:|-----------------:|-----------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |        |    107.493       |    107.775       |    0.28148 |    min |   +0.26% |
|             Min cumulative indexing time across primary shard |        |    107.493       |    107.775       |    0.28148 |    min |   +0.26% |
|          Median cumulative indexing time across primary shard |        |    107.493       |    107.775       |    0.28148 |    min |   +0.26% |
|             Max cumulative indexing time across primary shard |        |    107.493       |    107.775       |    0.28148 |    min |   +0.26% |
|           Cumulative indexing throttle time of primary shards |        |      0           |      0           |    0       |    min |    0.00% |
|    Min cumulative indexing throttle time across primary shard |        |      0           |      0           |    0       |    min |    0.00% |
| Median cumulative indexing throttle time across primary shard |        |      0           |      0           |    0       |    min |    0.00% |
|    Max cumulative indexing throttle time across primary shard |        |      0           |      0           |    0       |    min |    0.00% |
|                       Cumulative merge time of primary shards |        |     37.9238      |     39.9655      |    2.04177 |    min |   +5.38% |
|                      Cumulative merge count of primary shards |        |     95           |     97           |    2       |        |   +2.11% |
|                Min cumulative merge time across primary shard |        |     37.9238      |     39.9655      |    2.04177 |    min |   +5.38% |
|             Median cumulative merge time across primary shard |        |     37.9238      |     39.9655      |    2.04177 |    min |   +5.38% |
|                Max cumulative merge time across primary shard |        |     37.9238      |     39.9655      |    2.04177 |    min |   +5.38% |
|              Cumulative merge throttle time of primary shards |        |      3.38232     |      2.8477      |   -0.53462 |    min |  -15.81% |
|       Min cumulative merge throttle time across primary shard |        |      3.38232     |      2.8477      |   -0.53462 |    min |  -15.81% |
|    Median cumulative merge throttle time across primary shard |        |      3.38232     |      2.8477      |   -0.53462 |    min |  -15.81% |
|       Max cumulative merge throttle time across primary shard |        |      3.38232     |      2.8477      |   -0.53462 |    min |  -15.81% |
|                     Cumulative refresh time of primary shards |        |      1.91515     |      1.86215     |   -0.053   |    min |   -2.77% |
|                    Cumulative refresh count of primary shards |        |     71           |     71           |    0       |        |    0.00% |
|              Min cumulative refresh time across primary shard |        |      1.91515     |      1.86215     |   -0.053   |    min |   -2.77% |
|           Median cumulative refresh time across primary shard |        |      1.91515     |      1.86215     |   -0.053   |    min |   -2.77% |
|              Max cumulative refresh time across primary shard |        |      1.91515     |      1.86215     |   -0.053   |    min |   -2.77% |
|                       Cumulative flush time of primary shards |        |      2.98818     |      2.65518     |   -0.333   |    min |  -11.14% |
|                      Cumulative flush count of primary shards |        |     29           |     29           |    0       |        |    0.00% |
|                Min cumulative flush time across primary shard |        |      2.98818     |      2.65518     |   -0.333   |    min |  -11.14% |
|             Median cumulative flush time across primary shard |        |      2.98818     |      2.65518     |   -0.333   |    min |  -11.14% |
|                Max cumulative flush time across primary shard |        |      2.98818     |      2.65518     |   -0.333   |    min |  -11.14% |
|                                       Total Young Gen GC time |        |     35.336       |     35.723       |    0.387   |      s |   +1.10% |
|                                      Total Young Gen GC count |        |   2279           |   2328           |   49       |        |   +2.15% |
|                                         Total Old Gen GC time |        |      0           |      0           |    0       |      s |    0.00% |
|                                        Total Old Gen GC count |        |      0           |      0           |    0       |        |    0.00% |
|                                                    Store size |        |     28.2845      |     27.7215      |   -0.56301 |     GB |   -1.99% |
|                                                 Translog size |        |      5.12227e-08 |      5.12227e-08 |    0       |     GB |    0.00% |
|                                        Heap used for segments |        |      0           |      0           |    0       |     MB |    0.00% |
|                                      Heap used for doc values |        |      0           |      0           |    0       |     MB |    0.00% |
|                                           Heap used for terms |        |      0           |      0           |    0       |     MB |    0.00% |
|                                           Heap used for norms |        |      0           |      0           |    0       |     MB |    0.00% |
|                                          Heap used for points |        |      0           |      0           |    0       |     MB |    0.00% |
|                                   Heap used for stored fields |        |      0           |      0           |    0       |     MB |    0.00% |
|                                                 Segment count |        |     44           |     19           |  -25       |        |  -56.82% |
|                                   Total Ingest Pipeline count |        |      0           |      0           |    0       |        |    0.00% |
|                                    Total Ingest Pipeline time |        |      0           |      0           |    0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |        |      0           |      0           |    0       |        |    0.00% |
|                                                Min Throughput |  index | 142391           | 145147           | 2755.87    | docs/s |   +1.94% |
|                                               Mean Throughput |  index | 146888           | 148094           | 1205.95    | docs/s |   +0.82% |
|                                             Median Throughput |  index | 146300           | 147390           | 1090.09    | docs/s |   +0.75% |
|                                                Max Throughput |  index | 151926           | 152498           |  571.718   | docs/s |   +0.38% |
|                                       50th percentile latency |  index |    434.912       |    429.054       |   -5.85768 |     ms |   -1.35% |
|                                       90th percentile latency |  index |    588.696       |    565.785       |  -22.9107  |     ms |   -3.89% |
|                                       99th percentile latency |  index |   2536.5         |   2623.29        |   86.7887  |     ms |   +3.42% |
|                                     99.9th percentile latency |  index |   4399.82        |   4526.09        |  126.27    |     ms |   +2.87% |
|                                    99.99th percentile latency |  index |   6521.79        |   5968.96        | -552.833   |     ms |   -8.48% |
|                                      100th percentile latency |  index |   6912.31        |   6465.72        | -446.591   |     ms |   -6.46% |
|                                  50th percentile service time |  index |    434.912       |    429.054       |   -5.85768 |     ms |   -1.35% |
|                                  90th percentile service time |  index |    588.696       |    565.785       |  -22.9107  |     ms |   -3.89% |
|                                  99th percentile service time |  index |   2536.5         |   2623.29        |   86.7887  |     ms |   +3.42% |
|                                99.9th percentile service time |  index |   4399.82        |   4526.09        |  126.27    |     ms |   +2.87% |
|                               99.99th percentile service time |  index |   6521.79        |   5968.96        | -552.833   |     ms |   -8.48% |
|                                 100th percentile service time |  index |   6912.31        |   6465.72        | -446.591   |     ms |   -6.46% |
|                                                    error rate |  index |      0           |      0           |    0       |      % |    0.00% |

@jpountz
Copy link
Contributor Author

jpountz commented Jan 9, 2023

For reference here are final segment sizes in both cases:

TieredMergePolicy:

5.1gb
5.1gb
4.1gb
3.9gb
3.1gb
799.8mb
584mb
326.1mb
197.1mb
193.6mb
177.7mb
138.3mb
116.2mb
100.3mb
91.9mb
91.3mb
83.5mb
66.9mb
57.5mb
8.2mb
8.3mb
7mb
6.3mb
5.1mb
3.9mb
3.4mb
2.7mb
813kb
776.9kb

LogByteSizeMergePolicy:

4.8gb
4.2gb
4gb
2.1gb
2.1gb
2gb
1.8gb
1.2gb
1gb
104.7mb
100.1mb
65.2mb
55.5mb
51.2mb
41.9mb
30.7mb
26.1mb
9mb
1.7mb

@jpountz
Copy link
Contributor Author

jpountz commented Jan 9, 2023

I dug into this and this looks like an expected property of LogByteSizeMergePolicy:

  • TieredMergePolicy computes a budget of number of segments given the total size of the shard, and computes the cheapest merge to run if the current number exceeds this budget.
  • LogByteSizeMergePolicy returns a merge as soon as it can find a sequence of mergeFactor segments where maxSegmentSize / minSegmentSize <= mergeFactor^0.75. For the default merge factor of 10, the threshold is ~5.6.

So e.g. in the final segment structure we get with TieredMergePolicy, LogByteSizeMergePolicy would merge segments whose sizes are reported below, because they are adjacent and 326.1/83.5 = 3.9 < 10^0.75 = 5.6.

326.1mb
197.1mb
193.6mb
177.7mb
138.3mb
116.2mb
100.3mb
91.9mb
91.3mb
83.5mb

I ran into some simulations using Lucene's BaseMergePolicyTestCase to see how this translates in terms of write amplification, and the amplification increase looks modest overall. This seems confirmed by the fact that we're seeing similar merge counts and times.

@jpountz
Copy link
Contributor Author

jpountz commented Jan 13, 2023

Another idea I've been willing to play with consists of merging segments less aggressively for time-based data. So I ran the NYC taxis indexing benchmark again with LogByteSizeMergePolicy and a number of segments per tier equal to 16. The benchmark reports 62 merges that took 38 minutes overall, and 23 segments in the end. It's interesting that these 3 numbers are equal or lower than what I was getting with TieredMergePolicy and the default number of 10 segments per tier (~95 merges, ~38 minutes spent merging, ~45 segments).

@jpountz
Copy link
Contributor Author

jpountz commented Jan 23, 2023

New iteration: instead of trying to reuse the same "segments per tier" configuration option of TieredMergePolicy to configure the "merge factor" of LogByteSizeMergePolicy, I introduced a different "merge factor" setting which I defaulted to 20. Here is what it gives for Rally, the final segment count is in the same ballpart, fewer merges ran overall, and a similar amount of time was spent merging:

|                                                        Metric |   Task |         Baseline |        Contender |       Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|-------:|-----------------:|-----------------:|-----------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |        |    110.922       |    107.054       |   -3.8684  |    min |   -3.49% |
|             Min cumulative indexing time across primary shard |        |    110.922       |    107.054       |   -3.8684  |    min |   -3.49% |
|          Median cumulative indexing time across primary shard |        |    110.922       |    107.054       |   -3.8684  |    min |   -3.49% |
|             Max cumulative indexing time across primary shard |        |    110.922       |    107.054       |   -3.8684  |    min |   -3.49% |
|           Cumulative indexing throttle time of primary shards |        |      0           |      0           |    0       |    min |    0.00% |
|    Min cumulative indexing throttle time across primary shard |        |      0           |      0           |    0       |    min |    0.00% |
| Median cumulative indexing throttle time across primary shard |        |      0           |      0           |    0       |    min |    0.00% |
|    Max cumulative indexing throttle time across primary shard |        |      0           |      0           |    0       |    min |    0.00% |
|                       Cumulative merge time of primary shards |        |     40.5229      |     41.7068      |    1.18392 |    min |   +2.92% |
|                      Cumulative merge count of primary shards |        |     99           |     56           |  -43       |        |  -43.43% |
|                Min cumulative merge time across primary shard |        |     40.5229      |     41.7068      |    1.18392 |    min |   +2.92% |
|             Median cumulative merge time across primary shard |        |     40.5229      |     41.7068      |    1.18392 |    min |   +2.92% |
|                Max cumulative merge time across primary shard |        |     40.5229      |     41.7068      |    1.18392 |    min |   +2.92% |
|              Cumulative merge throttle time of primary shards |        |      4.17575     |      4.5495      |    0.37375 |    min |   +8.95% |
|       Min cumulative merge throttle time across primary shard |        |      4.17575     |      4.5495      |    0.37375 |    min |   +8.95% |
|    Median cumulative merge throttle time across primary shard |        |      4.17575     |      4.5495      |    0.37375 |    min |   +8.95% |
|       Max cumulative merge throttle time across primary shard |        |      4.17575     |      4.5495      |    0.37375 |    min |   +8.95% |
|                     Cumulative refresh time of primary shards |        |      1.5781      |      1.79573     |    0.21763 |    min |  +13.79% |
|                    Cumulative refresh count of primary shards |        |     72           |     84           |   12       |        |  +16.67% |
|              Min cumulative refresh time across primary shard |        |      1.5781      |      1.79573     |    0.21763 |    min |  +13.79% |
|           Median cumulative refresh time across primary shard |        |      1.5781      |      1.79573     |    0.21763 |    min |  +13.79% |
|              Max cumulative refresh time across primary shard |        |      1.5781      |      1.79573     |    0.21763 |    min |  +13.79% |
|                       Cumulative flush time of primary shards |        |      1.92293     |      2.7306      |    0.80767 |    min |  +42.00% |
|                      Cumulative flush count of primary shards |        |     30           |     40           |   10       |        |  +33.33% |
|                Min cumulative flush time across primary shard |        |      1.92293     |      2.7306      |    0.80767 |    min |  +42.00% |
|             Median cumulative flush time across primary shard |        |      1.92293     |      2.7306      |    0.80767 |    min |  +42.00% |
|                Max cumulative flush time across primary shard |        |      1.92293     |      2.7306      |    0.80767 |    min |  +42.00% |
|                                       Total Young Gen GC time |        |     35.158       |     34.629       |   -0.529   |      s |   -1.50% |
|                                      Total Young Gen GC count |        |   2321           |   2281           |  -40       |        |   -1.72% |
|                                         Total Old Gen GC time |        |      0           |      0           |    0       |      s |    0.00% |
|                                        Total Old Gen GC count |        |      0           |      0           |    0       |        |    0.00% |
|                                                    Store size |        |     34.7863      |     29.8789      |   -4.90745 |     GB |  -14.11% |
|                                                 Translog size |        |      5.12227e-08 |      5.12227e-08 |    0       |     GB |    0.00% |
|                                        Heap used for segments |        |      0           |      0           |    0       |     MB |    0.00% |
|                                      Heap used for doc values |        |      0           |      0           |    0       |     MB |    0.00% |
|                                           Heap used for terms |        |      0           |      0           |    0       |     MB |    0.00% |
|                                           Heap used for norms |        |      0           |      0           |    0       |     MB |    0.00% |
|                                          Heap used for points |        |      0           |      0           |    0       |     MB |    0.00% |
|                                   Heap used for stored fields |        |      0           |      0           |    0       |     MB |    0.00% |
|                                                 Segment count |        |     34           |     30           |   -4       |        |  -11.76% |
|                                   Total Ingest Pipeline count |        |      0           |      0           |    0       |        |    0.00% |
|                                    Total Ingest Pipeline time |        |      0           |      0           |    0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |        |      0           |      0           |    0       |        |    0.00% |
|                                                Min Throughput |  index | 150420           | 149587           | -833.572   | docs/s |   -0.55% |
|                                               Mean Throughput |  index | 153917           | 156610           | 2693.33    | docs/s |   +1.75% |
|                                             Median Throughput |  index | 153556           | 156535           | 2978.36    | docs/s |   +1.94% |
|                                                Max Throughput |  index | 156250           | 160905           | 4654.22    | docs/s |   +2.98% |
|                                       50th percentile latency |  index |    421.743       |    416.996       |   -4.74713 |     ms |   -1.13% |
|                                       90th percentile latency |  index |    506.589       |    526.923       |   20.3344  |     ms |   +4.01% |
|                                       99th percentile latency |  index |   2542.61        |   2476.12        |  -66.4871  |     ms |   -2.61% |
|                                     99.9th percentile latency |  index |   4755.45        |   4653.35        | -102.106   |     ms |   -2.15% |
|                                    99.99th percentile latency |  index |   5662.68        |   5773.84        |  111.153   |     ms |   +1.96% |
|                                      100th percentile latency |  index |   6126.71        |   7736.29        | 1609.58    |     ms |  +26.27% |
|                                  50th percentile service time |  index |    421.743       |    416.996       |   -4.74713 |     ms |   -1.13% |
|                                  90th percentile service time |  index |    506.589       |    526.923       |   20.3344  |     ms |   +4.01% |
|                                  99th percentile service time |  index |   2542.61        |   2476.12        |  -66.4871  |     ms |   -2.61% |
|                                99.9th percentile service time |  index |   4755.45        |   4653.35        | -102.106   |     ms |   -2.15% |
|                               99.99th percentile service time |  index |   5662.68        |   5773.84        |  111.153   |     ms |   +1.96% |
|                                 100th percentile service time |  index |   6126.71        |   7736.29        | 1609.58    |     ms |  +26.27% |
|                                                    error rate |  index |      0           |      0           |    0       |      % |    0.00% |

I'll do more analysis of segment distributions with these two different merge policies.

@jpountz
Copy link
Contributor Author

jpountz commented Jan 23, 2023

I simulated an index that gets loaded with segments whose size is anywhere between 0 and 2MB until it reaches 50GB with both TieredMergePolicy (today's default) and LogByteSizeMergePolicy with the parameters suggested in this PR, especially mergeFactor=20. Here's a video that shows the distribution of segment sizes as data gets loaded. For context,

  • amp is the write amplification due to merges, ie. number of bytes written to disk due to merges divided by the total index size
  • segs is the average number of segments of the index since it was created.

Overall I find it very appealing: the fact that LogByteSizeMergePolicy merges segments more aggressively allowed increasing the number of segments that get merged together significantly without massively increasing the number of segments in the index while massively decreasing write amplification.

@jpountz
Copy link
Contributor Author

jpountz commented Jan 24, 2023

The way the number of segments was computed, as the average since the beginning, gave too much weight to early point-in-time views of the index, so I changed it to be a moving average over the last 1,000 point-in-time views of the index and decreased the merge factor to 16. It's now essentially impossible to tell which one of the two merge policies has more segments on average, while the patch still has a much lower write amplification. Here is a new video.

I've had quite a few iterations on this PR so I'll try to summarize what this change is and the benefits:

  • Only doing adjacent merges helps speed up range queries on the @timestamp field if documents get ingested in (approximate) @timestamp order.
  • This would also play better with things like the frozen tier as documents with a similar @timestamp would naturally get colocated.
  • The new merge policy also has a lower write amplification without increasing the average number of segments in an index in a meaningful way, which should help speed up ingestion.

So what are the downsides?

  • While the average number of segments is approximately the same as before, the variance of the number of segments is going to be a bit higher.
  • The constraint of only merging adjacent segments makes this merge policy bad at handling deletes efficiently.
  • Segments of the max merged segment size (5GB by default) will not be as close to the max size limit as they were before, because the requirement to only merge adjacent segments will prevent from greedily adding small segments to the merge until reaching 5GB.

All these downsides feel minor to me, especially in the context of time-based data.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow/see this part:

The way the number of segments was computed, as the average since the beginning, gave too much weight to early point-in-time views of the index, so I changed it to be a moving average over the last 1,000 point-in-time views of the index and decreased the merge factor to 16.

Can you elaborate on where this is manifested in the code?

private final Logger logger;
private final boolean mergesEnabled;
private volatile Type mergePolicyType;

public static final double DEFAULT_EXPUNGE_DELETES_ALLOWED = 10d;
public static final ByteSizeValue DEFAULT_FLOOR_SEGMENT = new ByteSizeValue(2, ByteSizeUnit.MB);
public static final int DEFAULT_MAX_MERGE_AT_ONCE = 10;
public static final ByteSizeValue DEFAULT_MAX_MERGED_SEGMENT = new ByteSizeValue(5, ByteSizeUnit.GB);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider raising this default "a bit" when using log-byte-size policy? To get to a similar segment size as in the tiered merge policy.

I say this because, I could imagine us wanting to merge up all the remaining small segments after rollover. And with just below 5GB, that may give 11 segs rather than 10, which seems wasteful. It would be good to have a ~5.1GB avg large segment size - to match our defaults of 50GB shards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, TieredMergePolicy may stop merging anywhere between 2.5GB and 5GB (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java#L375-L390). It's less frequent than with LogByteSizeMergePolicy because it has the ability to pack small segments together with large segments due to its ability to merge non-adjacent segments, but if one of the background merges produces a 4GB segment, then this segment won't get merged because the merge would mostly consist of rewriting this big large segment, which is wasteful.

In my opinion, what we should do depends on what we think is the best practice for time-based indices:

  • If we think that old indices should get force-merged to a single segment, it might make more sense to lower the max merged segment size to something like 1GB in order to save merging while the index is actively being written: since all segments will get rewritten in the end, it doesn't really matter where we end with background merges and leaning towards smaller segments would help decrease the merging overhead on ingestion.
  • If we think that it's a better practice to only merge the smaller segments and end up with segments in the order of 5GB on average as you are suggesting, then it might indeed make more sense to set the max merged segment size to something like 6GB.
  • Or maybe we should not configure a maximum merged segment size at all on time-based indices, because we already have a rollover size that already bounds somehow the maximum size of segments. If you have 10 1GB segments, why would you do 2 5-segments merges to get 2 5GB segments when doing a single 10-segments merge has roughly the same cost and results in fewer segments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the 3rd option, unless if there are practical limits we could run into (heap and memory primarily). We may still want some "desired segment size" for tail merging, but that can be added as part of that effort (if we implement it). Given that it is now a factor 16, it seems very likely that with any limit we will be doing several smaller merges to fit under the indicated roof. Compared to the 5GB limit we have now, we should not see 16 of those being merged together. Correct me if I am wrong, but removing the roof is thus unlikely to lead to more actual merging than with the 5GB limit?
I suppose we might be able to ease the merging a bit with a 1GB roof. Perhaps worth trying out what the amplification would be in the two cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no practical limits that I can think of and I agree that there shouldn't be more merging than what we're seeing now because 5GB is already within one mergeFactor of the rollover size of 50GB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henningandersen Here's a video that shows the difference between max segment size = 1GB and an unbounded max segment size: https://photos.app.goo.gl/cyWZeTmthhWwC3fZ9.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to mention that this video was created with 300kB flushes on average rather than 1MB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a new video then comparing 5GB max segment size to unbounded - with same flush size. Just to verify our mental model around the write amplification effect of the max segment size.
I wonder if we'd still want something like 100GB max segment size as protection against users using a non-default rollover size or hiccups in ILM and the rollover process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a video that compares 5GB to unbounded with TieredMergePolicy: https://photos.app.goo.gl/JEDJdGFVcUAbXaKP9, and there with LogByteSizeMergePolicy (mergeFactor=16, same parameters as TieredMergePolicy otherwise): https://photos.app.goo.gl/cp3my5RzEbTHrAXSA. Very similar numbers of segments and write amplification but the bigger segments are bigger when the max segment size is unbounded. You'd need to ingest more than 50GB in an index to start seeing a bigger difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I think a 100GB (or 50GB) roof should be our new default with log byte size merge policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a change that sets a roof of 100GB on time-based data.

@jpountz
Copy link
Contributor Author

jpountz commented Jan 25, 2023

Can you elaborate on where this is manifested in the code?

Sorry this comment was about the first video that I posted, not the code. It adds about 50k segments to an index in sequence, after each segment checks if merges need to run, and then looks at the number of segments in the index. The first video computed the number of segments as the average number of segments since the beginning. The second video uses a moving window of 1k segment flushes.

@jpountz
Copy link
Contributor Author

jpountz commented Jan 25, 2023

I also tested this change on the TSDB track to get another data point. Similar number of segments in the end, but lower merge count. I'm unclear as to why the merge time is the same, I need to dig.

|                                                        Metric |   Task |        Baseline |       Contender |        Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|-------:|----------------:|----------------:|------------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |        |   594.616       |   616.01        |    21.3944  |    min |   +3.60% |
|             Min cumulative indexing time across primary shard |        |   594.616       |   616.01        |    21.3944  |    min |   +3.60% |
|          Median cumulative indexing time across primary shard |        |   594.616       |   616.01        |    21.3944  |    min |   +3.60% |
|             Max cumulative indexing time across primary shard |        |   594.616       |   616.01        |    21.3944  |    min |   +3.60% |
|           Cumulative indexing throttle time of primary shards |        |     1.98028     |     1.28493     |    -0.69535 |    min |  -35.11% |
|    Min cumulative indexing throttle time across primary shard |        |     1.98028     |     1.28493     |    -0.69535 |    min |  -35.11% |
| Median cumulative indexing throttle time across primary shard |        |     1.98028     |     1.28493     |    -0.69535 |    min |  -35.11% |
|    Max cumulative indexing throttle time across primary shard |        |     1.98028     |     1.28493     |    -0.69535 |    min |  -35.11% |
|                       Cumulative merge time of primary shards |        |   224.364       |   226.241       |     1.87725 |    min |   +0.84% |
|                      Cumulative merge count of primary shards |        |  1322           |   822           |  -500       |        |  -37.82% |
|                Min cumulative merge time across primary shard |        |   224.364       |   226.241       |     1.87725 |    min |   +0.84% |
|             Median cumulative merge time across primary shard |        |   224.364       |   226.241       |     1.87725 |    min |   +0.84% |
|                Max cumulative merge time across primary shard |        |   224.364       |   226.241       |     1.87725 |    min |   +0.84% |
|              Cumulative merge throttle time of primary shards |        |     0.0394167   |     0.0350833   |    -0.00433 |    min |  -10.99% |
|       Min cumulative merge throttle time across primary shard |        |     0.0394167   |     0.0350833   |    -0.00433 |    min |  -10.99% |
|    Median cumulative merge throttle time across primary shard |        |     0.0394167   |     0.0350833   |    -0.00433 |    min |  -10.99% |
|       Max cumulative merge throttle time across primary shard |        |     0.0394167   |     0.0350833   |    -0.00433 |    min |  -10.99% |
|                     Cumulative refresh time of primary shards |        |     5.40807     |     5.06647     |    -0.3416  |    min |   -6.32% |
|                    Cumulative refresh count of primary shards |        |   309           |   321           |    12       |        |   +3.88% |
|              Min cumulative refresh time across primary shard |        |     5.40807     |     5.06647     |    -0.3416  |    min |   -6.32% |
|           Median cumulative refresh time across primary shard |        |     5.40807     |     5.06647     |    -0.3416  |    min |   -6.32% |
|              Max cumulative refresh time across primary shard |        |     5.40807     |     5.06647     |    -0.3416  |    min |   -6.32% |
|                       Cumulative flush time of primary shards |        |    22.9169      |    21.5721      |    -1.34473 |    min |   -5.87% |
|                      Cumulative flush count of primary shards |        |   289           |   297           |     8       |        |   +2.77% |
|                Min cumulative flush time across primary shard |        |    22.9169      |    21.5721      |    -1.34473 |    min |   -5.87% |
|             Median cumulative flush time across primary shard |        |    22.9169      |    21.5721      |    -1.34473 |    min |   -5.87% |
|                Max cumulative flush time across primary shard |        |    22.9169      |    21.5721      |    -1.34473 |    min |   -5.87% |
|                                       Total Young Gen GC time |        |    62.985       |    69.324       |     6.339   |      s |  +10.06% |
|                                      Total Young Gen GC count |        |  3109           |  3434           |   325       |        |  +10.45% |
|                                         Total Old Gen GC time |        |     0           |     0           |     0       |      s |    0.00% |
|                                        Total Old Gen GC count |        |     0           |     0           |     0       |        |    0.00% |
|                                                    Store size |        |    32.1394      |    31.935       |    -0.20442 |     GB |   -0.64% |
|                                                 Translog size |        |     5.12227e-08 |     5.12227e-08 |     0       |     GB |    0.00% |
|                                        Heap used for segments |        |     0           |     0           |     0       |     MB |    0.00% |
|                                      Heap used for doc values |        |     0           |     0           |     0       |     MB |    0.00% |
|                                           Heap used for terms |        |     0           |     0           |     0       |     MB |    0.00% |
|                                           Heap used for norms |        |     0           |     0           |     0       |     MB |    0.00% |
|                                          Heap used for points |        |     0           |     0           |     0       |     MB |    0.00% |
|                                   Heap used for stored fields |        |     0           |     0           |     0       |     MB |    0.00% |
|                                                 Segment count |        |    36           |    35           |    -1       |        |   -2.78% |
|                                   Total Ingest Pipeline count |        |     0           |     0           |     0       |        |    0.00% |
|                                    Total Ingest Pipeline time |        |     0           |     0           |     0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |        |     0           |     0           |     0       |        |    0.00% |
|                                                Min Throughput |  index | 64763.4         | 64111           |  -652.394   | docs/s |   -1.01% |
|                                               Mean Throughput |  index | 68614.4         | 68457.6         |  -156.766   | docs/s |   -0.23% |
|                                             Median Throughput |  index | 67918.5         | 68519.4         |   600.897   | docs/s |   +0.88% |
|                                                Max Throughput |  index | 74873.7         | 72540.8         | -2332.91    | docs/s |   -3.12% |
|                                       50th percentile latency |  index |  1608.74        |  1662.19        |    53.4449  |     ms |   +3.32% |
|                                       90th percentile latency |  index |  2545.07        |  2330.09        |  -214.978   |     ms |   -8.45% |
|                                       99th percentile latency |  index |  3955.35        |  4626.4         |   671.043   |     ms |  +16.97% |
|                                     99.9th percentile latency |  index |  4756.7         |  6007.28        |  1250.58    |     ms |  +26.29% |
|                                    99.99th percentile latency |  index |  5497.25        |  6334.82        |   837.573   |     ms |  +15.24% |
|                                      100th percentile latency |  index |  5911.13        |  7862.44        |  1951.31    |     ms |  +33.01% |
|                                  50th percentile service time |  index |  1608.74        |  1662.19        |    53.4449  |     ms |   +3.32% |
|                                  90th percentile service time |  index |  2545.07        |  2330.09        |  -214.978   |     ms |   -8.45% |
|                                  99th percentile service time |  index |  3955.35        |  4626.4         |   671.043   |     ms |  +16.97% |
|                                99.9th percentile service time |  index |  4756.7         |  6007.28        |  1250.58    |     ms |  +26.29% |
|                               99.99th percentile service time |  index |  5497.25        |  6334.82        |   837.573   |     ms |  +15.24% |
|                                 100th percentile service time |  index |  5911.13        |  7862.44        |  1951.31    |     ms |  +33.01% |
|                                                    error rate |  index |     0           |     0           |     0       |      % |    0.00% |

@cla-checker-service
Copy link

cla-checker-service bot commented Feb 2, 2023

💚 CLA has been signed

@jpountz jpountz changed the base branch from main to 8.6 February 2, 2023 10:26
@jpountz jpountz changed the base branch from 8.6 to main February 2, 2023 10:26
@jpountz jpountz force-pushed the adjacent_merges_data_streams branch from 8185866 to 7cd99d6 Compare February 2, 2023 10:33
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@miltonhultgren miltonhultgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why my team was pinged for Code owner code review here, would be great if someone can look into why that is happening. Could be due to force push.

I'm approving as to not block this PR seeing that it's already approved.

@rjernst rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023
@jpountz
Copy link
Contributor Author

jpountz commented Feb 9, 2023

@miltonhultgren I did something wrong in a local rebase which I only noticed after pushing, and this looks like it caused this ping for a code review so I don't fully understand why. Sorry for the annoyance.

@jpountz jpountz merged commit f55360d into elastic:main Feb 9, 2023
@jpountz jpountz deleted the adjacent_merges_data_streams branch February 9, 2023 09:08
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Feb 25, 2023
This is a follow-up to elastic#92684. elastic#92684 switched from `TieredMergePolicy` to
`LogByteSizeMergePolicy` for time-based data, trying to retain similar numbers
of segments in shards. This change goes further, and takes advantage of the
fact that adjacent segment merging gives segments (mostly) non-overlapping time
ranges, to reduce merging overhead without hurting the efficiency of range
queries on the timestamp field.

In general the trade-off of this change is that it yields:
 - Faster ingestion thanks to reduced merging overhead.
 - Similar performance for range queries on the timestamp field.
 - Very slightly degraded performance of term queries due to the increased
   number of segments. This should be hardly noticeable in most cases.
 - Possibly degraded performance of fuzzy, wildcard queries, as well as range
   queries on other fields than the timestamp field.
elasticsearchmachine pushed a commit that referenced this pull request Mar 1, 2023
This is a follow-up to #92684. #92684 switched from `TieredMergePolicy`
to `LogByteSizeMergePolicy` for time-based data, trying to retain
similar numbers of segments in shards. This change goes further, and
takes advantage of the fact that adjacent segment merging gives segments
(mostly) non-overlapping time ranges, to reduce merging overhead without
hurting the efficiency of range queries on the timestamp field.

In general the trade-off of this change is that it yields:  - Faster
ingestion thanks to reduced merging overhead.  - Similar performance for
range queries on the timestamp field.  - Very slightly degraded
performance of term queries due to the increased number of segments.
This should be hardly noticeable in most cases.  - Possibly degraded
performance of fuzzy, wildcard queries, as well as range queries on
other fields than the timestamp field.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants