Use LogByteSizeMergePolicy instead of TieredMergePolicy for time-based data. #92684

jpountz · 2023-01-04T17:05:11Z

TieredMergePolicy is better than LogByteSizeMergePolicy at computing cheaper merges, but it does so by allowing itself to merge non-adjacent segments and by delaying merges a bit until it has collected more segments to pick from. An important property we get when only merging adjacent segments and data gets indexed in time order is that segments have non-overlapping time ranges. This means that a range query on the time field will only partially match 2 segments at most, and other segments will either fully match or not match at all. This also means that for segments that partially match a range query, the matching docs will likely be clustered together in a sequential range of doc IDs, ie. there will be good data locality.

…d data. `TieredMergePolicy` is better than `LogByteSizeMergePolicy` at computing cheaper merges, but it does so by allowing itself to merge non-adjacent segments. An important property we get when only merging adjacent segments and data gets indexed in time order is that segments have non-overlapping time ranges. This means that a range query on the time field will only partially match 2 segments at most, and other segments will either fully match or not match at all. This also means that for segments that partially match a range query, the matching docs will likely be clustered together in a sequential range of doc IDs, ie. there will be good data locality.

elasticsearchmachine · 2023-01-04T17:05:36Z

Hi @jpountz, I've created a changelog YAML for you.

elasticsearchmachine · 2023-01-04T17:05:36Z

Pinging @elastic/es-distributed (Team:Distributed)

henningandersen

This looks good, left a couple of comments.

henningandersen · 2023-01-05T09:25:25Z

server/src/main/java/org/elasticsearch/index/MergePolicyConfig.java

    }

    void setMaxMergedSegment(ByteSizeValue maxMergedSegment) {
-        mergePolicy.setMaxMergedSegmentMB(maxMergedSegment.getMbFrac());
+        tieredMergePolicy.setMaxMergedSegmentMB(maxMergedSegment.getMbFrac());
+        // Note: max merge MB has different semantics on LogByteSizeMergePolicy: it's the maximum size for a segment to be considered for a


I think this opens us up for degenerate cases in both ends, either now stopping at 2.5GB or going up to 25GB. I imagine this being provoked by different input styles (either document size, frequency etc), i.e., some data streams work well (hitting around 5GB) whereas others may suffer from worse search or indexing (due to the larger merges) performance?

I wonder if we could adapt the log merge policy to be closer to the tiered merge policy here. I.e., if it can merge together 2 adjacent segments but not 3, then merge the 2?

Thanks for your question, it made me check the actual LogByteSizeMergePolicy behavior, and actually my comment is not right, LogByteSizeMergePolicy works as you describe since may last year via apache/lucene#935, which introduced the behavior you described as part of fixing another problem with this merge policy, the fact that this merge policy packs segments together is even tested.

henningandersen · 2023-01-05T09:31:56Z

server/src/main/java/org/elasticsearch/index/MergePolicyConfig.java

+            @Override
+            MergePolicy getMergePolicy(MergePolicyConfig config, boolean isTimeSeriesIndex) {
+                if (isTimeSeriesIndex) {
+                    // TieredMergePolicy is better than LogByteSizeMergePolicy at computing cheaper merges, but it does so by allowing


This is formulated as if it in practice could be (substantially) more expensive to use the log merge policy, perhaps we can elaborate here on why this is unlikely to be the case for data streams?

Agreed, I pushed an update that makes it sound like a better trade-off.

jpountz · 2023-01-09T14:23:55Z

I ran two Rally runs on the NYC taxis dataset with the following options to compare ingestion rates: /rally race --track=nyc_taxis --challenge=append-no-conflicts-index-only --preserve-install --car=4gheap. In both cases, the segment count is significantly lower in the end, which suggests that LogByteSizeMergePolicy merges more aggressively. I need to dig more into why.

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                        Metric |   Task |         Baseline |        Contender |        Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|-------:|-----------------:|-----------------:|------------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |        |    111.81        |    111.056       |    -0.75413 |    min |   -0.67% |
|             Min cumulative indexing time across primary shard |        |    111.81        |    111.056       |    -0.75413 |    min |   -0.67% |
|          Median cumulative indexing time across primary shard |        |    111.81        |    111.056       |    -0.75413 |    min |   -0.67% |
|             Max cumulative indexing time across primary shard |        |    111.81        |    111.056       |    -0.75413 |    min |   -0.67% |
|           Cumulative indexing throttle time of primary shards |        |      0           |      0           |     0       |    min |    0.00% |
|    Min cumulative indexing throttle time across primary shard |        |      0           |      0           |     0       |    min |    0.00% |
| Median cumulative indexing throttle time across primary shard |        |      0           |      0           |     0       |    min |    0.00% |
|    Max cumulative indexing throttle time across primary shard |        |      0           |      0           |     0       |    min |    0.00% |
|                       Cumulative merge time of primary shards |        |     40.9411      |     43.5569      |     2.61578 |    min |   +6.39% |
|                      Cumulative merge count of primary shards |        |     99           |    106           |     7       |        |   +7.07% |
|                Min cumulative merge time across primary shard |        |     40.9411      |     43.5569      |     2.61578 |    min |   +6.39% |
|             Median cumulative merge time across primary shard |        |     40.9411      |     43.5569      |     2.61578 |    min |   +6.39% |
|                Max cumulative merge time across primary shard |        |     40.9411      |     43.5569      |     2.61578 |    min |   +6.39% |
|              Cumulative merge throttle time of primary shards |        |      4.63548     |      2.23013     |    -2.40535 |    min |  -51.89% |
|       Min cumulative merge throttle time across primary shard |        |      4.63548     |      2.23013     |    -2.40535 |    min |  -51.89% |
|    Median cumulative merge throttle time across primary shard |        |      4.63548     |      2.23013     |    -2.40535 |    min |  -51.89% |
|       Max cumulative merge throttle time across primary shard |        |      4.63548     |      2.23013     |    -2.40535 |    min |  -51.89% |
|                     Cumulative refresh time of primary shards |        |      1.5524      |      1.75107     |     0.19867 |    min |  +12.80% |
|                    Cumulative refresh count of primary shards |        |     72           |     74           |     2       |        |   +2.78% |
|              Min cumulative refresh time across primary shard |        |      1.5524      |      1.75107     |     0.19867 |    min |  +12.80% |
|           Median cumulative refresh time across primary shard |        |      1.5524      |      1.75107     |     0.19867 |    min |  +12.80% |
|              Max cumulative refresh time across primary shard |        |      1.5524      |      1.75107     |     0.19867 |    min |  +12.80% |
|                       Cumulative flush time of primary shards |        |      2.26853     |      2.25813     |    -0.0104  |    min |   -0.46% |
|                      Cumulative flush count of primary shards |        |     31           |     32           |     1       |        |   +3.23% |
|                Min cumulative flush time across primary shard |        |      2.26853     |      2.25813     |    -0.0104  |    min |   -0.46% |
|             Median cumulative flush time across primary shard |        |      2.26853     |      2.25813     |    -0.0104  |    min |   -0.46% |
|                Max cumulative flush time across primary shard |        |      2.26853     |      2.25813     |    -0.0104  |    min |   -0.46% |
|                                       Total Young Gen GC time |        |     34.426       |     35.662       |     1.236   |      s |   +3.59% |
|                                      Total Young Gen GC count |        |   2247           |   2334           |    87       |        |   +3.87% |
|                                         Total Old Gen GC time |        |      0           |      0           |     0       |      s |    0.00% |
|                                        Total Old Gen GC count |        |      0           |      0           |     0       |        |    0.00% |
|                                                    Store size |        |     33.7444      |     26.326       |    -7.41844 |     GB |  -21.98% |
|                                                 Translog size |        |      5.12227e-08 |      5.12227e-08 |     0       |     GB |    0.00% |
|                                        Heap used for segments |        |      0           |      0           |     0       |     MB |    0.00% |
|                                      Heap used for doc values |        |      0           |      0           |     0       |     MB |    0.00% |
|                                           Heap used for terms |        |      0           |      0           |     0       |     MB |    0.00% |
|                                           Heap used for norms |        |      0           |      0           |     0       |     MB |    0.00% |
|                                          Heap used for points |        |      0           |      0           |     0       |     MB |    0.00% |
|                                   Heap used for stored fields |        |      0           |      0           |     0       |     MB |    0.00% |
|                                                 Segment count |        |     47           |     28           |   -19       |        |  -40.43% |
|                                   Total Ingest Pipeline count |        |      0           |      0           |     0       |        |    0.00% |
|                                    Total Ingest Pipeline time |        |      0           |      0           |     0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |        |      0           |      0           |     0       |        |    0.00% |
|                                                Min Throughput |  index | 149073           | 142754           | -6319.44    | docs/s |   -4.24% |
|                                               Mean Throughput |  index | 153128           | 146763           | -6365.24    | docs/s |   -4.16% |
|                                             Median Throughput |  index | 152939           | 146332           | -6606.65    | docs/s |   -4.32% |
|                                                Max Throughput |  index | 155999           | 149679           | -6320.11    | docs/s |   -4.05% |
|                                       50th percentile latency |  index |    427.853       |    439.047       |    11.1942  |     ms |   +2.62% |
|                                       90th percentile latency |  index |    516.156       |    561.514       |    45.3582  |     ms |   +8.79% |
|                                       99th percentile latency |  index |   2460.01        |   2598           |   137.99    |     ms |   +5.61% |
|                                     99.9th percentile latency |  index |   4652.12        |   4815.94        |   163.819   |     ms |   +3.52% |
|                                    99.99th percentile latency |  index |   5583.57        |   5876.35        |   292.782   |     ms |   +5.24% |
|                                      100th percentile latency |  index |   5859.37        |   7330.19        |  1470.82    |     ms |  +25.10% |
|                                  50th percentile service time |  index |    427.853       |    439.047       |    11.1942  |     ms |   +2.62% |
|                                  90th percentile service time |  index |    516.156       |    561.514       |    45.3582  |     ms |   +8.79% |
|                                  99th percentile service time |  index |   2460.01        |   2598           |   137.99    |     ms |   +5.61% |
|                                99.9th percentile service time |  index |   4652.12        |   4815.94        |   163.819   |     ms |   +3.52% |
|                               99.99th percentile service time |  index |   5583.57        |   5876.35        |   292.782   |     ms |   +5.24% |
|                                 100th percentile service time |  index |   5859.37        |   7330.19        |  1470.82    |     ms |  +25.10% |
|                                                    error rate |  index |      0           |      0           |     0       |      % |    0.00% |

|                                                        Metric |   Task |         Baseline |        Contender |       Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|-------:|-----------------:|-----------------:|-----------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |        |    107.493       |    107.775       |    0.28148 |    min |   +0.26% |
|             Min cumulative indexing time across primary shard |        |    107.493       |    107.775       |    0.28148 |    min |   +0.26% |
|          Median cumulative indexing time across primary shard |        |    107.493       |    107.775       |    0.28148 |    min |   +0.26% |
|             Max cumulative indexing time across primary shard |        |    107.493       |    107.775       |    0.28148 |    min |   +0.26% |
|           Cumulative indexing throttle time of primary shards |        |      0           |      0           |    0       |    min |    0.00% |
|    Min cumulative indexing throttle time across primary shard |        |      0           |      0           |    0       |    min |    0.00% |
| Median cumulative indexing throttle time across primary shard |        |      0           |      0           |    0       |    min |    0.00% |
|    Max cumulative indexing throttle time across primary shard |        |      0           |      0           |    0       |    min |    0.00% |
|                       Cumulative merge time of primary shards |        |     37.9238      |     39.9655      |    2.04177 |    min |   +5.38% |
|                      Cumulative merge count of primary shards |        |     95           |     97           |    2       |        |   +2.11% |
|                Min cumulative merge time across primary shard |        |     37.9238      |     39.9655      |    2.04177 |    min |   +5.38% |
|             Median cumulative merge time across primary shard |        |     37.9238      |     39.9655      |    2.04177 |    min |   +5.38% |
|                Max cumulative merge time across primary shard |        |     37.9238      |     39.9655      |    2.04177 |    min |   +5.38% |
|              Cumulative merge throttle time of primary shards |        |      3.38232     |      2.8477      |   -0.53462 |    min |  -15.81% |
|       Min cumulative merge throttle time across primary shard |        |      3.38232     |      2.8477      |   -0.53462 |    min |  -15.81% |
|    Median cumulative merge throttle time across primary shard |        |      3.38232     |      2.8477      |   -0.53462 |    min |  -15.81% |
|       Max cumulative merge throttle time across primary shard |        |      3.38232     |      2.8477      |   -0.53462 |    min |  -15.81% |
|                     Cumulative refresh time of primary shards |        |      1.91515     |      1.86215     |   -0.053   |    min |   -2.77% |
|                    Cumulative refresh count of primary shards |        |     71           |     71           |    0       |        |    0.00% |
|              Min cumulative refresh time across primary shard |        |      1.91515     |      1.86215     |   -0.053   |    min |   -2.77% |
|           Median cumulative refresh time across primary shard |        |      1.91515     |      1.86215     |   -0.053   |    min |   -2.77% |
|              Max cumulative refresh time across primary shard |        |      1.91515     |      1.86215     |   -0.053   |    min |   -2.77% |
|                       Cumulative flush time of primary shards |        |      2.98818     |      2.65518     |   -0.333   |    min |  -11.14% |
|                      Cumulative flush count of primary shards |        |     29           |     29           |    0       |        |    0.00% |
|                Min cumulative flush time across primary shard |        |      2.98818     |      2.65518     |   -0.333   |    min |  -11.14% |
|             Median cumulative flush time across primary shard |        |      2.98818     |      2.65518     |   -0.333   |    min |  -11.14% |
|                Max cumulative flush time across primary shard |        |      2.98818     |      2.65518     |   -0.333   |    min |  -11.14% |
|                                       Total Young Gen GC time |        |     35.336       |     35.723       |    0.387   |      s |   +1.10% |
|                                      Total Young Gen GC count |        |   2279           |   2328           |   49       |        |   +2.15% |
|                                         Total Old Gen GC time |        |      0           |      0           |    0       |      s |    0.00% |
|                                        Total Old Gen GC count |        |      0           |      0           |    0       |        |    0.00% |
|                                                    Store size |        |     28.2845      |     27.7215      |   -0.56301 |     GB |   -1.99% |
|                                                 Translog size |        |      5.12227e-08 |      5.12227e-08 |    0       |     GB |    0.00% |
|                                        Heap used for segments |        |      0           |      0           |    0       |     MB |    0.00% |
|                                      Heap used for doc values |        |      0           |      0           |    0       |     MB |    0.00% |
|                                           Heap used for terms |        |      0           |      0           |    0       |     MB |    0.00% |
|                                           Heap used for norms |        |      0           |      0           |    0       |     MB |    0.00% |
|                                          Heap used for points |        |      0           |      0           |    0       |     MB |    0.00% |
|                                   Heap used for stored fields |        |      0           |      0           |    0       |     MB |    0.00% |
|                                                 Segment count |        |     44           |     19           |  -25       |        |  -56.82% |
|                                   Total Ingest Pipeline count |        |      0           |      0           |    0       |        |    0.00% |
|                                    Total Ingest Pipeline time |        |      0           |      0           |    0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |        |      0           |      0           |    0       |        |    0.00% |
|                                                Min Throughput |  index | 142391           | 145147           | 2755.87    | docs/s |   +1.94% |
|                                               Mean Throughput |  index | 146888           | 148094           | 1205.95    | docs/s |   +0.82% |
|                                             Median Throughput |  index | 146300           | 147390           | 1090.09    | docs/s |   +0.75% |
|                                                Max Throughput |  index | 151926           | 152498           |  571.718   | docs/s |   +0.38% |
|                                       50th percentile latency |  index |    434.912       |    429.054       |   -5.85768 |     ms |   -1.35% |
|                                       90th percentile latency |  index |    588.696       |    565.785       |  -22.9107  |     ms |   -3.89% |
|                                       99th percentile latency |  index |   2536.5         |   2623.29        |   86.7887  |     ms |   +3.42% |
|                                     99.9th percentile latency |  index |   4399.82        |   4526.09        |  126.27    |     ms |   +2.87% |
|                                    99.99th percentile latency |  index |   6521.79        |   5968.96        | -552.833   |     ms |   -8.48% |
|                                      100th percentile latency |  index |   6912.31        |   6465.72        | -446.591   |     ms |   -6.46% |
|                                  50th percentile service time |  index |    434.912       |    429.054       |   -5.85768 |     ms |   -1.35% |
|                                  90th percentile service time |  index |    588.696       |    565.785       |  -22.9107  |     ms |   -3.89% |
|                                  99th percentile service time |  index |   2536.5         |   2623.29        |   86.7887  |     ms |   +3.42% |
|                                99.9th percentile service time |  index |   4399.82        |   4526.09        |  126.27    |     ms |   +2.87% |
|                               99.99th percentile service time |  index |   6521.79        |   5968.96        | -552.833   |     ms |   -8.48% |
|                                 100th percentile service time |  index |   6912.31        |   6465.72        | -446.591   |     ms |   -6.46% |
|                                                    error rate |  index |      0           |      0           |    0       |      % |    0.00% |

jpountz · 2023-01-09T14:57:53Z

For reference here are final segment sizes in both cases:

TieredMergePolicy:

5.1gb
5.1gb
4.1gb
3.9gb
3.1gb
799.8mb
584mb
326.1mb
197.1mb
193.6mb
177.7mb
138.3mb
116.2mb
100.3mb
91.9mb
91.3mb
83.5mb
66.9mb
57.5mb
8.2mb
8.3mb
7mb
6.3mb
5.1mb
3.9mb
3.4mb
2.7mb
813kb
776.9kb

LogByteSizeMergePolicy:

4.8gb
4.2gb
4gb
2.1gb
2.1gb
2gb
1.8gb
1.2gb
1gb
104.7mb
100.1mb
65.2mb
55.5mb
51.2mb
41.9mb
30.7mb
26.1mb
9mb
1.7mb

jpountz · 2023-01-09T16:37:56Z

I dug into this and this looks like an expected property of LogByteSizeMergePolicy:

TieredMergePolicy computes a budget of number of segments given the total size of the shard, and computes the cheapest merge to run if the current number exceeds this budget.
LogByteSizeMergePolicy returns a merge as soon as it can find a sequence of mergeFactor segments where maxSegmentSize / minSegmentSize <= mergeFactor^0.75. For the default merge factor of 10, the threshold is ~5.6.

So e.g. in the final segment structure we get with TieredMergePolicy, LogByteSizeMergePolicy would merge segments whose sizes are reported below, because they are adjacent and 326.1/83.5 = 3.9 < 10^0.75 = 5.6.

326.1mb
197.1mb
193.6mb
177.7mb
138.3mb
116.2mb
100.3mb
91.9mb
91.3mb
83.5mb

I ran into some simulations using Lucene's BaseMergePolicyTestCase to see how this translates in terms of write amplification, and the amplification increase looks modest overall. This seems confirmed by the fact that we're seeing similar merge counts and times.

jpountz · 2023-01-13T12:27:16Z

Another idea I've been willing to play with consists of merging segments less aggressively for time-based data. So I ran the NYC taxis indexing benchmark again with LogByteSizeMergePolicy and a number of segments per tier equal to 16. The benchmark reports 62 merges that took 38 minutes overall, and 23 segments in the end. It's interesting that these 3 numbers are equal or lower than what I was getting with TieredMergePolicy and the default number of 10 segments per tier (~95 merges, ~38 minutes spent merging, ~45 segments).

jpountz · 2023-01-23T15:25:13Z

New iteration: instead of trying to reuse the same "segments per tier" configuration option of TieredMergePolicy to configure the "merge factor" of LogByteSizeMergePolicy, I introduced a different "merge factor" setting which I defaulted to 20. Here is what it gives for Rally, the final segment count is in the same ballpart, fewer merges ran overall, and a similar amount of time was spent merging:

|                                                        Metric |   Task |         Baseline |        Contender |       Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|-------:|-----------------:|-----------------:|-----------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |        |    110.922       |    107.054       |   -3.8684  |    min |   -3.49% |
|             Min cumulative indexing time across primary shard |        |    110.922       |    107.054       |   -3.8684  |    min |   -3.49% |
|          Median cumulative indexing time across primary shard |        |    110.922       |    107.054       |   -3.8684  |    min |   -3.49% |
|             Max cumulative indexing time across primary shard |        |    110.922       |    107.054       |   -3.8684  |    min |   -3.49% |
|           Cumulative indexing throttle time of primary shards |        |      0           |      0           |    0       |    min |    0.00% |
|    Min cumulative indexing throttle time across primary shard |        |      0           |      0           |    0       |    min |    0.00% |
| Median cumulative indexing throttle time across primary shard |        |      0           |      0           |    0       |    min |    0.00% |
|    Max cumulative indexing throttle time across primary shard |        |      0           |      0           |    0       |    min |    0.00% |
|                       Cumulative merge time of primary shards |        |     40.5229      |     41.7068      |    1.18392 |    min |   +2.92% |
|                      Cumulative merge count of primary shards |        |     99           |     56           |  -43       |        |  -43.43% |
|                Min cumulative merge time across primary shard |        |     40.5229      |     41.7068      |    1.18392 |    min |   +2.92% |
|             Median cumulative merge time across primary shard |        |     40.5229      |     41.7068      |    1.18392 |    min |   +2.92% |
|                Max cumulative merge time across primary shard |        |     40.5229      |     41.7068      |    1.18392 |    min |   +2.92% |
|              Cumulative merge throttle time of primary shards |        |      4.17575     |      4.5495      |    0.37375 |    min |   +8.95% |
|       Min cumulative merge throttle time across primary shard |        |      4.17575     |      4.5495      |    0.37375 |    min |   +8.95% |
|    Median cumulative merge throttle time across primary shard |        |      4.17575     |      4.5495      |    0.37375 |    min |   +8.95% |
|       Max cumulative merge throttle time across primary shard |        |      4.17575     |      4.5495      |    0.37375 |    min |   +8.95% |
|                     Cumulative refresh time of primary shards |        |      1.5781      |      1.79573     |    0.21763 |    min |  +13.79% |
|                    Cumulative refresh count of primary shards |        |     72           |     84           |   12       |        |  +16.67% |
|              Min cumulative refresh time across primary shard |        |      1.5781      |      1.79573     |    0.21763 |    min |  +13.79% |
|           Median cumulative refresh time across primary shard |        |      1.5781      |      1.79573     |    0.21763 |    min |  +13.79% |
|              Max cumulative refresh time across primary shard |        |      1.5781      |      1.79573     |    0.21763 |    min |  +13.79% |
|                       Cumulative flush time of primary shards |        |      1.92293     |      2.7306      |    0.80767 |    min |  +42.00% |
|                      Cumulative flush count of primary shards |        |     30           |     40           |   10       |        |  +33.33% |
|                Min cumulative flush time across primary shard |        |      1.92293     |      2.7306      |    0.80767 |    min |  +42.00% |
|             Median cumulative flush time across primary shard |        |      1.92293     |      2.7306      |    0.80767 |    min |  +42.00% |
|                Max cumulative flush time across primary shard |        |      1.92293     |      2.7306      |    0.80767 |    min |  +42.00% |
|                                       Total Young Gen GC time |        |     35.158       |     34.629       |   -0.529   |      s |   -1.50% |
|                                      Total Young Gen GC count |        |   2321           |   2281           |  -40       |        |   -1.72% |
|                                         Total Old Gen GC time |        |      0           |      0           |    0       |      s |    0.00% |
|                                        Total Old Gen GC count |        |      0           |      0           |    0       |        |    0.00% |
|                                                    Store size |        |     34.7863      |     29.8789      |   -4.90745 |     GB |  -14.11% |
|                                                 Translog size |        |      5.12227e-08 |      5.12227e-08 |    0       |     GB |    0.00% |
|                                        Heap used for segments |        |      0           |      0           |    0       |     MB |    0.00% |
|                                      Heap used for doc values |        |      0           |      0           |    0       |     MB |    0.00% |
|                                           Heap used for terms |        |      0           |      0           |    0       |     MB |    0.00% |
|                                           Heap used for norms |        |      0           |      0           |    0       |     MB |    0.00% |
|                                          Heap used for points |        |      0           |      0           |    0       |     MB |    0.00% |
|                                   Heap used for stored fields |        |      0           |      0           |    0       |     MB |    0.00% |
|                                                 Segment count |        |     34           |     30           |   -4       |        |  -11.76% |
|                                   Total Ingest Pipeline count |        |      0           |      0           |    0       |        |    0.00% |
|                                    Total Ingest Pipeline time |        |      0           |      0           |    0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |        |      0           |      0           |    0       |        |    0.00% |
|                                                Min Throughput |  index | 150420           | 149587           | -833.572   | docs/s |   -0.55% |
|                                               Mean Throughput |  index | 153917           | 156610           | 2693.33    | docs/s |   +1.75% |
|                                             Median Throughput |  index | 153556           | 156535           | 2978.36    | docs/s |   +1.94% |
|                                                Max Throughput |  index | 156250           | 160905           | 4654.22    | docs/s |   +2.98% |
|                                       50th percentile latency |  index |    421.743       |    416.996       |   -4.74713 |     ms |   -1.13% |
|                                       90th percentile latency |  index |    506.589       |    526.923       |   20.3344  |     ms |   +4.01% |
|                                       99th percentile latency |  index |   2542.61        |   2476.12        |  -66.4871  |     ms |   -2.61% |
|                                     99.9th percentile latency |  index |   4755.45        |   4653.35        | -102.106   |     ms |   -2.15% |
|                                    99.99th percentile latency |  index |   5662.68        |   5773.84        |  111.153   |     ms |   +1.96% |
|                                      100th percentile latency |  index |   6126.71        |   7736.29        | 1609.58    |     ms |  +26.27% |
|                                  50th percentile service time |  index |    421.743       |    416.996       |   -4.74713 |     ms |   -1.13% |
|                                  90th percentile service time |  index |    506.589       |    526.923       |   20.3344  |     ms |   +4.01% |
|                                  99th percentile service time |  index |   2542.61        |   2476.12        |  -66.4871  |     ms |   -2.61% |
|                                99.9th percentile service time |  index |   4755.45        |   4653.35        | -102.106   |     ms |   -2.15% |
|                               99.99th percentile service time |  index |   5662.68        |   5773.84        |  111.153   |     ms |   +1.96% |
|                                 100th percentile service time |  index |   6126.71        |   7736.29        | 1609.58    |     ms |  +26.27% |
|                                                    error rate |  index |      0           |      0           |    0       |      % |    0.00% |

I'll do more analysis of segment distributions with these two different merge policies.

jpountz · 2023-01-23T17:27:30Z

I simulated an index that gets loaded with segments whose size is anywhere between 0 and 2MB until it reaches 50GB with both TieredMergePolicy (today's default) and LogByteSizeMergePolicy with the parameters suggested in this PR, especially mergeFactor=20. Here's a video that shows the distribution of segment sizes as data gets loaded. For context,

amp is the write amplification due to merges, ie. number of bytes written to disk due to merges divided by the total index size
segs is the average number of segments of the index since it was created.

Overall I find it very appealing: the fact that LogByteSizeMergePolicy merges segments more aggressively allowed increasing the number of segments that get merged together significantly without massively increasing the number of segments in the index while massively decreasing write amplification.

jpountz · 2023-01-24T09:05:41Z

The way the number of segments was computed, as the average since the beginning, gave too much weight to early point-in-time views of the index, so I changed it to be a moving average over the last 1,000 point-in-time views of the index and decreased the merge factor to 16. It's now essentially impossible to tell which one of the two merge policies has more segments on average, while the patch still has a much lower write amplification. Here is a new video.

I've had quite a few iterations on this PR so I'll try to summarize what this change is and the benefits:

Only doing adjacent merges helps speed up range queries on the @timestamp field if documents get ingested in (approximate) @timestamp order.
This would also play better with things like the frozen tier as documents with a similar @timestamp would naturally get colocated.
The new merge policy also has a lower write amplification without increasing the average number of segments in an index in a meaningful way, which should help speed up ingestion.

So what are the downsides?

While the average number of segments is approximately the same as before, the variance of the number of segments is going to be a bit higher.
The constraint of only merging adjacent segments makes this merge policy bad at handling deletes efficiently.
Segments of the max merged segment size (5GB by default) will not be as close to the max size limit as they were before, because the requirement to only merge adjacent segments will prevent from greedily adding small segments to the merge until reaching 5GB.

All these downsides feel minor to me, especially in the context of time-based data.

henningandersen

I am not sure I follow/see this part:

The way the number of segments was computed, as the average since the beginning, gave too much weight to early point-in-time views of the index, so I changed it to be a moving average over the last 1,000 point-in-time views of the index and decreased the merge factor to 16.

Can you elaborate on where this is manifested in the code?

henningandersen · 2023-01-24T14:51:06Z

server/src/main/java/org/elasticsearch/index/MergePolicyConfig.java

    private final Logger logger;
    private final boolean mergesEnabled;
+    private volatile Type mergePolicyType;

    public static final double DEFAULT_EXPUNGE_DELETES_ALLOWED = 10d;
    public static final ByteSizeValue DEFAULT_FLOOR_SEGMENT = new ByteSizeValue(2, ByteSizeUnit.MB);
    public static final int DEFAULT_MAX_MERGE_AT_ONCE = 10;
    public static final ByteSizeValue DEFAULT_MAX_MERGED_SEGMENT = new ByteSizeValue(5, ByteSizeUnit.GB);


Should we consider raising this default "a bit" when using log-byte-size policy? To get to a similar segment size as in the tiered merge policy.

I say this because, I could imagine us wanting to merge up all the remaining small segments after rollover. And with just below 5GB, that may give 11 segs rather than 10, which seems wasteful. It would be good to have a ~5.1GB avg large segment size - to match our defaults of 50GB shards.

For the record, TieredMergePolicy may stop merging anywhere between 2.5GB and 5GB (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java#L375-L390). It's less frequent than with LogByteSizeMergePolicy because it has the ability to pack small segments together with large segments due to its ability to merge non-adjacent segments, but if one of the background merges produces a 4GB segment, then this segment won't get merged because the merge would mostly consist of rewriting this big large segment, which is wasteful.

In my opinion, what we should do depends on what we think is the best practice for time-based indices:

If we think that old indices should get force-merged to a single segment, it might make more sense to lower the max merged segment size to something like 1GB in order to save merging while the index is actively being written: since all segments will get rewritten in the end, it doesn't really matter where we end with background merges and leaning towards smaller segments would help decrease the merging overhead on ingestion.

If we think that it's a better practice to only merge the smaller segments and end up with segments in the order of 5GB on average as you are suggesting, then it might indeed make more sense to set the max merged segment size to something like 6GB.

Or maybe we should not configure a maximum merged segment size at all on time-based indices, because we already have a rollover size that already bounds somehow the maximum size of segments. If you have 10 1GB segments, why would you do 2 5-segments merges to get 2 5GB segments when doing a single 10-segments merge has roughly the same cost and results in fewer segments?

I like the 3rd option, unless if there are practical limits we could run into (heap and memory primarily). We may still want some "desired segment size" for tail merging, but that can be added as part of that effort (if we implement it). Given that it is now a factor 16, it seems very likely that with any limit we will be doing several smaller merges to fit under the indicated roof. Compared to the 5GB limit we have now, we should not see 16 of those being merged together. Correct me if I am wrong, but removing the roof is thus unlikely to lead to more actual merging than with the 5GB limit?
I suppose we might be able to ease the merging a bit with a 1GB roof. Perhaps worth trying out what the amplification would be in the two cases?

There are no practical limits that I can think of and I agree that there shouldn't be more merging than what we're seeing now because 5GB is already within one mergeFactor of the rollover size of 50GB.

@henningandersen Here's a video that shows the difference between max segment size = 1GB and an unbounded max segment size: https://photos.app.goo.gl/cyWZeTmthhWwC3fZ9.

Sorry I forgot to mention that this video was created with 300kB flushes on average rather than 1MB.

It would be good to have a new video then comparing 5GB max segment size to unbounded - with same flush size. Just to verify our mental model around the write amplification effect of the max segment size.
I wonder if we'd still want something like 100GB max segment size as protection against users using a non-default rollover size or hiccups in ILM and the rollover process.

Here's a video that compares 5GB to unbounded with TieredMergePolicy: https://photos.app.goo.gl/JEDJdGFVcUAbXaKP9, and there with LogByteSizeMergePolicy (mergeFactor=16, same parameters as TieredMergePolicy otherwise): https://photos.app.goo.gl/cp3my5RzEbTHrAXSA. Very similar numbers of segments and write amplification but the bigger segments are bigger when the max segment size is unbounded. You'd need to ingest more than 50GB in an index to start seeing a bigger difference.

This looks good to me, I think a 100GB (or 50GB) roof should be our new default with log byte size merge policy.

I pushed a change that sets a roof of 100GB on time-based data.

jpountz · 2023-01-25T12:21:30Z

Can you elaborate on where this is manifested in the code?

Sorry this comment was about the first video that I posted, not the code. It adds about 50k segments to an index in sequence, after each segment checks if merges need to run, and then looks at the number of segments in the index. The first video computed the number of segments as the average number of segments since the beginning. The second video uses a moving window of 1k segment flushes.

jpountz · 2023-01-25T13:02:33Z

I also tested this change on the TSDB track to get another data point. Similar number of segments in the end, but lower merge count. I'm unclear as to why the merge time is the same, I need to dig.

|                                                        Metric |   Task |        Baseline |       Contender |        Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|-------:|----------------:|----------------:|------------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |        |   594.616       |   616.01        |    21.3944  |    min |   +3.60% |
|             Min cumulative indexing time across primary shard |        |   594.616       |   616.01        |    21.3944  |    min |   +3.60% |
|          Median cumulative indexing time across primary shard |        |   594.616       |   616.01        |    21.3944  |    min |   +3.60% |
|             Max cumulative indexing time across primary shard |        |   594.616       |   616.01        |    21.3944  |    min |   +3.60% |
|           Cumulative indexing throttle time of primary shards |        |     1.98028     |     1.28493     |    -0.69535 |    min |  -35.11% |
|    Min cumulative indexing throttle time across primary shard |        |     1.98028     |     1.28493     |    -0.69535 |    min |  -35.11% |
| Median cumulative indexing throttle time across primary shard |        |     1.98028     |     1.28493     |    -0.69535 |    min |  -35.11% |
|    Max cumulative indexing throttle time across primary shard |        |     1.98028     |     1.28493     |    -0.69535 |    min |  -35.11% |
|                       Cumulative merge time of primary shards |        |   224.364       |   226.241       |     1.87725 |    min |   +0.84% |
|                      Cumulative merge count of primary shards |        |  1322           |   822           |  -500       |        |  -37.82% |
|                Min cumulative merge time across primary shard |        |   224.364       |   226.241       |     1.87725 |    min |   +0.84% |
|             Median cumulative merge time across primary shard |        |   224.364       |   226.241       |     1.87725 |    min |   +0.84% |
|                Max cumulative merge time across primary shard |        |   224.364       |   226.241       |     1.87725 |    min |   +0.84% |
|              Cumulative merge throttle time of primary shards |        |     0.0394167   |     0.0350833   |    -0.00433 |    min |  -10.99% |
|       Min cumulative merge throttle time across primary shard |        |     0.0394167   |     0.0350833   |    -0.00433 |    min |  -10.99% |
|    Median cumulative merge throttle time across primary shard |        |     0.0394167   |     0.0350833   |    -0.00433 |    min |  -10.99% |
|       Max cumulative merge throttle time across primary shard |        |     0.0394167   |     0.0350833   |    -0.00433 |    min |  -10.99% |
|                     Cumulative refresh time of primary shards |        |     5.40807     |     5.06647     |    -0.3416  |    min |   -6.32% |
|                    Cumulative refresh count of primary shards |        |   309           |   321           |    12       |        |   +3.88% |
|              Min cumulative refresh time across primary shard |        |     5.40807     |     5.06647     |    -0.3416  |    min |   -6.32% |
|           Median cumulative refresh time across primary shard |        |     5.40807     |     5.06647     |    -0.3416  |    min |   -6.32% |
|              Max cumulative refresh time across primary shard |        |     5.40807     |     5.06647     |    -0.3416  |    min |   -6.32% |
|                       Cumulative flush time of primary shards |        |    22.9169      |    21.5721      |    -1.34473 |    min |   -5.87% |
|                      Cumulative flush count of primary shards |        |   289           |   297           |     8       |        |   +2.77% |
|                Min cumulative flush time across primary shard |        |    22.9169      |    21.5721      |    -1.34473 |    min |   -5.87% |
|             Median cumulative flush time across primary shard |        |    22.9169      |    21.5721      |    -1.34473 |    min |   -5.87% |
|                Max cumulative flush time across primary shard |        |    22.9169      |    21.5721      |    -1.34473 |    min |   -5.87% |
|                                       Total Young Gen GC time |        |    62.985       |    69.324       |     6.339   |      s |  +10.06% |
|                                      Total Young Gen GC count |        |  3109           |  3434           |   325       |        |  +10.45% |
|                                         Total Old Gen GC time |        |     0           |     0           |     0       |      s |    0.00% |
|                                        Total Old Gen GC count |        |     0           |     0           |     0       |        |    0.00% |
|                                                    Store size |        |    32.1394      |    31.935       |    -0.20442 |     GB |   -0.64% |
|                                                 Translog size |        |     5.12227e-08 |     5.12227e-08 |     0       |     GB |    0.00% |
|                                        Heap used for segments |        |     0           |     0           |     0       |     MB |    0.00% |
|                                      Heap used for doc values |        |     0           |     0           |     0       |     MB |    0.00% |
|                                           Heap used for terms |        |     0           |     0           |     0       |     MB |    0.00% |
|                                           Heap used for norms |        |     0           |     0           |     0       |     MB |    0.00% |
|                                          Heap used for points |        |     0           |     0           |     0       |     MB |    0.00% |
|                                   Heap used for stored fields |        |     0           |     0           |     0       |     MB |    0.00% |
|                                                 Segment count |        |    36           |    35           |    -1       |        |   -2.78% |
|                                   Total Ingest Pipeline count |        |     0           |     0           |     0       |        |    0.00% |
|                                    Total Ingest Pipeline time |        |     0           |     0           |     0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |        |     0           |     0           |     0       |        |    0.00% |
|                                                Min Throughput |  index | 64763.4         | 64111           |  -652.394   | docs/s |   -1.01% |
|                                               Mean Throughput |  index | 68614.4         | 68457.6         |  -156.766   | docs/s |   -0.23% |
|                                             Median Throughput |  index | 67918.5         | 68519.4         |   600.897   | docs/s |   +0.88% |
|                                                Max Throughput |  index | 74873.7         | 72540.8         | -2332.91    | docs/s |   -3.12% |
|                                       50th percentile latency |  index |  1608.74        |  1662.19        |    53.4449  |     ms |   +3.32% |
|                                       90th percentile latency |  index |  2545.07        |  2330.09        |  -214.978   |     ms |   -8.45% |
|                                       99th percentile latency |  index |  3955.35        |  4626.4         |   671.043   |     ms |  +16.97% |
|                                     99.9th percentile latency |  index |  4756.7         |  6007.28        |  1250.58    |     ms |  +26.29% |
|                                    99.99th percentile latency |  index |  5497.25        |  6334.82        |   837.573   |     ms |  +15.24% |
|                                      100th percentile latency |  index |  5911.13        |  7862.44        |  1951.31    |     ms |  +33.01% |
|                                  50th percentile service time |  index |  1608.74        |  1662.19        |    53.4449  |     ms |   +3.32% |
|                                  90th percentile service time |  index |  2545.07        |  2330.09        |  -214.978   |     ms |   -8.45% |
|                                  99th percentile service time |  index |  3955.35        |  4626.4         |   671.043   |     ms |  +16.97% |
|                                99.9th percentile service time |  index |  4756.7         |  6007.28        |  1250.58    |     ms |  +26.29% |
|                               99.99th percentile service time |  index |  5497.25        |  6334.82        |   837.573   |     ms |  +15.24% |
|                                 100th percentile service time |  index |  5911.13        |  7862.44        |  1951.31    |     ms |  +33.01% |
|                                                    error rate |  index |     0           |     0           |     0       |      % |    0.00% |

cla-checker-service · 2023-02-02T10:19:29Z

💚 CLA has been signed

henningandersen

LGTM.

miltonhultgren

I don't understand why my team was pinged for Code owner code review here, would be great if someone can look into why that is happening. Could be due to force push.

I'm approving as to not block this PR seeing that it's already approved.

jpountz · 2023-02-09T09:07:14Z

@miltonhultgren I did something wrong in a local rebase which I only noticed after pushing, and this looks like it caused this ping for a code review so I don't fully understand why. Sorry for the annoyance.

This is a follow-up to elastic#92684. elastic#92684 switched from `TieredMergePolicy` to `LogByteSizeMergePolicy` for time-based data, trying to retain similar numbers of segments in shards. This change goes further, and takes advantage of the fact that adjacent segment merging gives segments (mostly) non-overlapping time ranges, to reduce merging overhead without hurting the efficiency of range queries on the timestamp field. In general the trade-off of this change is that it yields: - Faster ingestion thanks to reduced merging overhead. - Similar performance for range queries on the timestamp field. - Very slightly degraded performance of term queries due to the increased number of segments. This should be hardly noticeable in most cases. - Possibly degraded performance of fuzzy, wildcard queries, as well as range queries on other fields than the timestamp field.

This is a follow-up to #92684. #92684 switched from `TieredMergePolicy` to `LogByteSizeMergePolicy` for time-based data, trying to retain similar numbers of segments in shards. This change goes further, and takes advantage of the fact that adjacent segment merging gives segments (mostly) non-overlapping time ranges, to reduce merging overhead without hurting the efficiency of range queries on the timestamp field. In general the trade-off of this change is that it yields: - Faster ingestion thanks to reduced merging overhead. - Similar performance for range queries on the timestamp field. - Very slightly degraded performance of term queries due to the increased number of segments. This should be hardly noticeable in most cases. - Possibly degraded performance of fuzzy, wildcard queries, as well as range queries on other fields than the timestamp field.

jpountz added >enhancement :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v8.7.0 labels Jan 4, 2023

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jan 4, 2023

jpountz added 3 commits January 4, 2023 18:05

Update docs/changelog/92684.yaml

81c1bca

Add escape hatch.

a657bda

checkstyle

447782b

henningandersen reviewed Jan 5, 2023

View reviewed changes

jpountz added 3 commits January 5, 2023 14:03

feedback

2e2db78

fix tests

b986142

Merge branch 'main' into adjacent_merges_data_streams

6689711

jpountz added 2 commits January 13, 2023 10:26

Adjust semantics of segments per tier.

ae7b9ef

Fix tests.

501943f

Separate configuration for merge_factor.

1757ddb

jpountz added 3 commits January 23, 2023 18:57

Merge branch 'main' into adjacent_merges_data_streams

3951b8a

Decrease merge factor to 16.

a112f4b

Merge branch 'main' into adjacent_merges_data_streams

6bbcaf3

jpountz added 2 commits January 24, 2023 10:13

Fix CCR tests.

28afd5c

Merge branch 'main' into adjacent_merges_data_streams

3ca06ab

henningandersen reviewed Jan 25, 2023

View reviewed changes

This was referenced Jan 26, 2023

Skip duplicate checks on segments that don't document's timestamp #92456

Merged

Allow indexing data in order with multiple indexing clients elastic/rally#1650

Closed

jpountz requested a review from a team as a code owner February 2, 2023 10:19

jpountz changed the base branch from main to 8.6 February 2, 2023 10:26

jpountz changed the base branch from 8.6 to main February 2, 2023 10:26

jpountz added 2 commits February 2, 2023 11:32

Merge branch 'main' into adjacent_merges_data_streams

2139f08

Remove the max merged segment size on time-based data.

7cd99d6

jpountz force-pushed the adjacent_merges_data_streams branch from 8185866 to 7cd99d6 Compare February 2, 2023 10:33

henningandersen approved these changes Feb 2, 2023

View reviewed changes

miltonhultgren approved these changes Feb 2, 2023

View reviewed changes

rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023

jpountz merged commit f55360d into elastic:main Feb 9, 2023

jpountz deleted the adjacent_merges_data_streams branch February 9, 2023 09:08

jpountz mentioned this pull request Feb 25, 2023

Increase the merge factor to 32 for time-based data. #94134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LogByteSizeMergePolicy instead of TieredMergePolicy for time-based data. #92684

Use LogByteSizeMergePolicy instead of TieredMergePolicy for time-based data. #92684

jpountz commented Jan 4, 2023 •

edited

Loading

elasticsearchmachine commented Jan 4, 2023

elasticsearchmachine commented Jan 4, 2023

henningandersen left a comment

henningandersen Jan 5, 2023

jpountz Jan 5, 2023

henningandersen Jan 5, 2023

jpountz Jan 5, 2023

jpountz commented Jan 9, 2023

jpountz commented Jan 9, 2023

jpountz commented Jan 9, 2023 •

edited

Loading

jpountz commented Jan 13, 2023

jpountz commented Jan 23, 2023

jpountz commented Jan 23, 2023

jpountz commented Jan 24, 2023

henningandersen left a comment

henningandersen Jan 24, 2023

jpountz Jan 25, 2023

henningandersen Jan 26, 2023

jpountz Jan 26, 2023

jpountz Jan 26, 2023

jpountz Jan 26, 2023

henningandersen Jan 27, 2023

jpountz Jan 31, 2023

henningandersen Jan 31, 2023

jpountz Feb 2, 2023

jpountz commented Jan 25, 2023

jpountz commented Jan 25, 2023

cla-checker-service bot commented Feb 2, 2023 •

edited

Loading

henningandersen left a comment

miltonhultgren left a comment •

edited

Loading

jpountz commented Feb 9, 2023

Use LogByteSizeMergePolicy instead of TieredMergePolicy for time-based data. #92684

Use LogByteSizeMergePolicy instead of TieredMergePolicy for time-based data. #92684

Conversation

jpountz commented Jan 4, 2023 • edited Loading

elasticsearchmachine commented Jan 4, 2023

elasticsearchmachine commented Jan 4, 2023

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Jan 9, 2023

jpountz commented Jan 9, 2023

jpountz commented Jan 9, 2023 • edited Loading

jpountz commented Jan 13, 2023

jpountz commented Jan 23, 2023

jpountz commented Jan 23, 2023

jpountz commented Jan 24, 2023

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Jan 25, 2023

jpountz commented Jan 25, 2023

cla-checker-service bot commented Feb 2, 2023 • edited Loading

henningandersen left a comment

Choose a reason for hiding this comment

miltonhultgren left a comment • edited Loading

Choose a reason for hiding this comment

jpountz commented Feb 9, 2023

jpountz commented Jan 4, 2023 •

edited

Loading

jpountz commented Jan 9, 2023 •

edited

Loading

cla-checker-service bot commented Feb 2, 2023 •

edited

Loading

miltonhultgren left a comment •

edited

Loading