Netty4: switch to composite cumulator #49478

henningandersen · 2019-11-22T07:47:15Z

The default merge cumulator used in netty transport leads to additional
GC pressure and memory copying when a message that exceeds the chunk
size is handled. This is especially a problem on G1 GC, since we get
many "humongous" allocations and that can in theory cause real memory
circuit breaker to break unnecessarily.

Will add performance test details in separate comments.

The default merge cumulator used in netty transport leads to additional GC pressure and memory copying when a message that exceeds the chunk size is handled. This is especially a problem on G1 GC, since we get many "humongous" allocations and that can in theory cause real memory circuit breaker to break unnecessarily.

elasticmachine · 2019-11-22T07:47:18Z

Pinging @elastic/es-distributed (:Distributed/Network)

henningandersen · 2019-11-22T07:59:38Z

Did a few rally experiments using geonmames append-no-conflicts-index-only, 4G heap, 3 nodes, 2 replicas, using both CMS and G1, all against 7.4. The results look near identical. Summary of tests:

Standard bulk size and clients.
Very large bulk size (200K) and 1-6 clients (most only G1)

One noteworthy improvement is that with 6 clients, bulk size 200K, G1 could successfully complete the test with the change and failed with circuit breaker exception without it.

A couple of comparisons (baseline is 7.4, contender is 7.4 using composite cumulator).

CMS, standard bulk size and clients:

|                                                        Metric |         Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|-------------:|------------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |              |     28.2774 |     27.9363 | -0.34108 |    min |
|             Min cumulative indexing time across primary shard |              |     5.44917 |     5.53135 |  0.08218 |    min |
|          Median cumulative indexing time across primary shard |              |     5.59793 |     5.55233 |  -0.0456 |    min |
|             Max cumulative indexing time across primary shard |              |     5.85203 |     5.72773 |  -0.1243 |    min |
|           Cumulative indexing throttle time of primary shards |              |           0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |              |     5.12782 |     5.23317 |  0.10535 |    min |
|                      Cumulative merge count of primary shards |              |          52 |          52 |        0 |        |
|                Min cumulative merge time across primary shard |              |    0.886667 |    0.832667 |   -0.054 |    min |
|             Median cumulative merge time across primary shard |              |     1.05418 |     1.06012 |  0.00593 |    min |
|                Max cumulative merge time across primary shard |              |     1.14142 |     1.22498 |  0.08357 |    min |
|              Cumulative merge throttle time of primary shards |              |     1.49637 |     1.43865 | -0.05772 |    min |
|       Min cumulative merge throttle time across primary shard |              |     0.21995 |     0.17105 |  -0.0489 |    min |
|    Median cumulative merge throttle time across primary shard |              |    0.316267 |      0.2976 | -0.01867 |    min |
|       Max cumulative merge throttle time across primary shard |              |       0.371 |    0.447033 |  0.07603 |    min |
|                     Cumulative refresh time of primary shards |              |     1.34108 |     1.35227 |  0.01118 |    min |
|                    Cumulative refresh count of primary shards |              |         128 |         129 |        1 |        |
|              Min cumulative refresh time across primary shard |              |    0.224733 |     0.24935 |  0.02462 |    min |
|           Median cumulative refresh time across primary shard |              |    0.274933 |     0.27105 | -0.00388 |    min |
|              Max cumulative refresh time across primary shard |              |    0.306917 |    0.281033 | -0.02588 |    min |
|                       Cumulative flush time of primary shards |              |    0.482683 |      0.4599 | -0.02278 |    min |
|                      Cumulative flush count of primary shards |              |           5 |           5 |        0 |        |
|                Min cumulative flush time across primary shard |              |   0.0873833 |   0.0256833 |  -0.0617 |    min |
|             Median cumulative flush time across primary shard |              |   0.0939333 |    0.111017 |  0.01708 |    min |
|                Max cumulative flush time across primary shard |              |      0.1056 |    0.115483 |  0.00988 |    min |
|                                            Total Young Gen GC |              |      32.801 |      34.034 |    1.233 |      s |
|                                              Total Old Gen GC |              |       1.362 |       1.721 |    0.359 |      s |
|                                                    Store size |              |     10.9549 |     11.0767 |   0.1218 |     GB |
|                                                 Translog size |              | 7.68341e-07 | 7.68341e-07 |        0 |     GB |
|                                        Heap used for segments |              |     5.83621 |     5.75584 | -0.08038 |     MB |
|                                      Heap used for doc values |              |   0.0857506 |   0.0898933 |  0.00414 |     MB |
|                                           Heap used for terms |              |     4.46203 |      4.3651 | -0.09692 |     MB |
|                                           Heap used for norms |              |    0.154541 |    0.154663 |  0.00012 |     MB |
|                                          Heap used for points |              |    0.284233 |    0.292837 |   0.0086 |     MB |
|                                   Heap used for stored fields |              |    0.849663 |     0.85334 |  0.00368 |     MB |
|                                                 Segment count |              |         197 |         197 |        0 |        |
|                                                Min Throughput | index-append |     55495.4 |     55185.1 | -310.291 | docs/s |
|                                             Median Throughput | index-append |     57400.6 |     57152.4 | -248.217 | docs/s |
|                                                Max Throughput | index-append |     58010.2 |       57997 | -13.2215 | docs/s |
|                                       50th percentile latency | index-append |     529.814 |     543.091 |  13.2766 |     ms |
|                                       90th percentile latency | index-append |     1203.99 |     1183.37 | -20.6185 |     ms |
|                                       99th percentile latency | index-append |     2363.94 |     2096.08 | -267.865 |     ms |
|                                      100th percentile latency | index-append |     3346.22 |     3446.65 |  100.433 |     ms |
|                                  50th percentile service time | index-append |     529.814 |     543.091 |  13.2766 |     ms |
|                                  90th percentile service time | index-append |     1203.99 |     1183.37 | -20.6185 |     ms |
|                                  99th percentile service time | index-append |     2363.94 |     2096.08 | -267.865 |     ms |
|                                 100th percentile service time | index-append |     3346.22 |     3446.65 |  100.433 |     ms |
|                                                    error rate | index-append |           0 |           0 |        0 |      % |

G1, standard bulk size and clients:

            
|                                                        Metric |         Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|-------------:|------------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |              |     27.6803 |     28.0672 |  0.38697 |    min |
|             Min cumulative indexing time across primary shard |              |     5.34183 |     5.35855 |  0.01672 |    min |
|          Median cumulative indexing time across primary shard |              |     5.57365 |     5.40825 |  -0.1654 |    min |
|             Max cumulative indexing time across primary shard |              |     5.63128 |       5.969 |  0.33772 |    min |
|           Cumulative indexing throttle time of primary shards |              |           0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |              |     5.29257 |     5.27422 | -0.01835 |    min |
|                      Cumulative merge count of primary shards |              |          52 |          50 |       -2 |        |
|                Min cumulative merge time across primary shard |              |      0.9769 |    0.965617 | -0.01128 |    min |
|             Median cumulative merge time across primary shard |              |     1.06222 |     1.01505 | -0.04717 |    min |
|                Max cumulative merge time across primary shard |              |     1.13682 |     1.25985 |  0.12303 |    min |
|              Cumulative merge throttle time of primary shards |              |     1.34913 |      1.4916 |  0.14247 |    min |
|       Min cumulative merge throttle time across primary shard |              |    0.219133 |     0.21845 | -0.00068 |    min |
|    Median cumulative merge throttle time across primary shard |              |    0.274583 |    0.269933 | -0.00465 |    min |
|       Max cumulative merge throttle time across primary shard |              |    0.297267 |    0.430717 |  0.13345 |    min |
|                     Cumulative refresh time of primary shards |              |     1.35573 |     1.39133 |   0.0356 |    min |
|                    Cumulative refresh count of primary shards |              |         128 |         124 |       -4 |        |
|              Min cumulative refresh time across primary shard |              |    0.254533 |    0.253667 | -0.00087 |    min |
|           Median cumulative refresh time across primary shard |              |      0.2806 |    0.279967 | -0.00063 |    min |
|              Max cumulative refresh time across primary shard |              |    0.283383 |    0.299233 |  0.01585 |    min |
|                       Cumulative flush time of primary shards |              |    0.417983 |    0.431833 |  0.01385 |    min |
|                      Cumulative flush count of primary shards |              |           5 |           5 |        0 |        |
|                Min cumulative flush time across primary shard |              |   0.0359667 |   0.0365833 |  0.00062 |    min |
|             Median cumulative flush time across primary shard |              |      0.0724 |     0.08775 |  0.01535 |    min |
|                Max cumulative flush time across primary shard |              |     0.12215 |      0.1184 | -0.00375 |    min |
|                                            Total Young Gen GC |              |      22.192 |      25.169 |    2.977 |      s |
|                                              Total Old Gen GC |              |           0 |           0 |        0 |      s |
|                                                    Store size |              |     10.7333 |     11.1128 |  0.37953 |     GB |
|                                                 Translog size |              | 7.68341e-07 | 7.68341e-07 |        0 |     GB |
|                                        Heap used for segments |              |     5.63861 |     5.73557 |  0.09696 |     MB |
|                                      Heap used for doc values |              |   0.0585403 |   0.0794907 |  0.02095 |     MB |
|                                           Heap used for terms |              |     4.29781 |     4.35829 |  0.06048 |     MB |
|                                           Heap used for norms |              |    0.135864 |    0.155579 |  0.01971 |     MB |
|                                          Heap used for points |              |    0.289915 |    0.286085 | -0.00383 |     MB |
|                                   Heap used for stored fields |              |    0.856476 |    0.856117 | -0.00036 |     MB |
|                                                 Segment count |              |         172 |         198 |       26 |        |
|                                                Min Throughput | index-append |     55521.6 |     54407.8 | -1113.73 | docs/s |
|                                             Median Throughput | index-append |     57376.4 |     56037.7 |  -1338.7 | docs/s |
|                                                Max Throughput | index-append |     57989.1 |     56442.4 | -1546.72 | docs/s |
|                                       50th percentile latency | index-append |     520.636 |     546.836 |     26.2 |     ms |
|                                       90th percentile latency | index-append |     1220.95 |     1221.08 |  0.13095 |     ms |
|                                       99th percentile latency | index-append |     2941.36 |     2332.48 | -608.881 |     ms |
|                                      100th percentile latency | index-append |      3152.5 |     3652.69 |  500.193 |     ms |
|                                  50th percentile service time | index-append |     520.636 |     546.836 |     26.2 |     ms |
|                                  90th percentile service time | index-append |     1220.95 |     1221.08 |  0.13095 |     ms |
|                                  99th percentile service time | index-append |     2941.36 |     2332.48 | -608.881 |     ms |
|                                 100th percentile service time | index-append |      3152.5 |     3652.69 |  500.193 |     ms |
|                                                    error rate | index-append |           0 |           0 |        0 |      % |

original-brownbear

LGTM and it's not even close IMO. We're doign this for HTTP messages anyway and saving a bunch of memory is worth a lot more than saving some cycles on the IO loop imo (since we decode the full messages on the IO loop anyway and use bulk ByteBuf operations throughout everything now this shouldn't be so bad relative to the former and in absolute terms).

Tim-Brooks

LGTM - as I mentioned when we talked this is the one that conceptually makes sense for us. We would need evidence AGAINST making this change IMO. Your benchmarks seem to slightly be in favor of making the change.

Our current usage is essentially complete content aggregation which should definitely be COMPOSITE. If we were only doing a small frame or something, then the MERGE might make more sense.

henningandersen · 2019-11-22T17:12:33Z

Thanks @original-brownbear and @tbrooks8 .

The default merge cumulator used in netty transport leads to additional GC pressure and memory copying when a message that exceeds the chunk size is handled. This is especially a problem on G1 GC, since we get many "humongous" allocations and that can in theory cause real memory circuit breaker to break unnecessarily.

xjtushilei · 2019-12-05T13:27:41Z

Did a few rally experiments using geonmames append-no-conflicts-index-only, 4G heap, 3 nodes, 2 replicas, using both CMS and G1, all against 7.4. The results look near identical. Summary of tests:

Standard bulk size and clients.
Very large bulk size (200K) and 1-6 clients (most only G1)

One noteworthy improvement is that with 6 clients, bulk size 200K, G1 could successfully complete the test with the change and failed with circuit breaker exception without it.

A couple of comparisons (baseline is 7.4, contender is 7.4 using composite cumulator).

CMS, standard bulk size and clients:

|                                                        Metric |         Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|-------------:|------------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |              |     28.2774 |     27.9363 | -0.34108 |    min |
|             Min cumulative indexing time across primary shard |              |     5.44917 |     5.53135 |  0.08218 |    min |
|          Median cumulative indexing time across primary shard |              |     5.59793 |     5.55233 |  -0.0456 |    min |
|             Max cumulative indexing time across primary shard |              |     5.85203 |     5.72773 |  -0.1243 |    min |
|           Cumulative indexing throttle time of primary shards |              |           0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |              |     5.12782 |     5.23317 |  0.10535 |    min |
|                      Cumulative merge count of primary shards |              |          52 |          52 |        0 |        |
|                Min cumulative merge time across primary shard |              |    0.886667 |    0.832667 |   -0.054 |    min |
|             Median cumulative merge time across primary shard |              |     1.05418 |     1.06012 |  0.00593 |    min |
|                Max cumulative merge time across primary shard |              |     1.14142 |     1.22498 |  0.08357 |    min |
|              Cumulative merge throttle time of primary shards |              |     1.49637 |     1.43865 | -0.05772 |    min |
|       Min cumulative merge throttle time across primary shard |              |     0.21995 |     0.17105 |  -0.0489 |    min |
|    Median cumulative merge throttle time across primary shard |              |    0.316267 |      0.2976 | -0.01867 |    min |
|       Max cumulative merge throttle time across primary shard |              |       0.371 |    0.447033 |  0.07603 |    min |
|                     Cumulative refresh time of primary shards |              |     1.34108 |     1.35227 |  0.01118 |    min |
|                    Cumulative refresh count of primary shards |              |         128 |         129 |        1 |        |
|              Min cumulative refresh time across primary shard |              |    0.224733 |     0.24935 |  0.02462 |    min |
|           Median cumulative refresh time across primary shard |              |    0.274933 |     0.27105 | -0.00388 |    min |
|              Max cumulative refresh time across primary shard |              |    0.306917 |    0.281033 | -0.02588 |    min |
|                       Cumulative flush time of primary shards |              |    0.482683 |      0.4599 | -0.02278 |    min |
|                      Cumulative flush count of primary shards |              |           5 |           5 |        0 |        |
|                Min cumulative flush time across primary shard |              |   0.0873833 |   0.0256833 |  -0.0617 |    min |
|             Median cumulative flush time across primary shard |              |   0.0939333 |    0.111017 |  0.01708 |    min |
|                Max cumulative flush time across primary shard |              |      0.1056 |    0.115483 |  0.00988 |    min |
|                                            Total Young Gen GC |              |      32.801 |      34.034 |    1.233 |      s |
|                                              Total Old Gen GC |              |       1.362 |       1.721 |    0.359 |      s |
|                                                    Store size |              |     10.9549 |     11.0767 |   0.1218 |     GB |
|                                                 Translog size |              | 7.68341e-07 | 7.68341e-07 |        0 |     GB |
|                                        Heap used for segments |              |     5.83621 |     5.75584 | -0.08038 |     MB |
|                                      Heap used for doc values |              |   0.0857506 |   0.0898933 |  0.00414 |     MB |
|                                           Heap used for terms |              |     4.46203 |      4.3651 | -0.09692 |     MB |
|                                           Heap used for norms |              |    0.154541 |    0.154663 |  0.00012 |     MB |
|                                          Heap used for points |              |    0.284233 |    0.292837 |   0.0086 |     MB |
|                                   Heap used for stored fields |              |    0.849663 |     0.85334 |  0.00368 |     MB |
|                                                 Segment count |              |         197 |         197 |        0 |        |
|                                                Min Throughput | index-append |     55495.4 |     55185.1 | -310.291 | docs/s |
|                                             Median Throughput | index-append |     57400.6 |     57152.4 | -248.217 | docs/s |
|                                                Max Throughput | index-append |     58010.2 |       57997 | -13.2215 | docs/s |
|                                       50th percentile latency | index-append |     529.814 |     543.091 |  13.2766 |     ms |
|                                       90th percentile latency | index-append |     1203.99 |     1183.37 | -20.6185 |     ms |
|                                       99th percentile latency | index-append |     2363.94 |     2096.08 | -267.865 |     ms |
|                                      100th percentile latency | index-append |     3346.22 |     3446.65 |  100.433 |     ms |
|                                  50th percentile service time | index-append |     529.814 |     543.091 |  13.2766 |     ms |
|                                  90th percentile service time | index-append |     1203.99 |     1183.37 | -20.6185 |     ms |
|                                  99th percentile service time | index-append |     2363.94 |     2096.08 | -267.865 |     ms |
|                                 100th percentile service time | index-append |     3346.22 |     3446.65 |  100.433 |     ms |
|                                                    error rate | index-append |           0 |           0 |        0 |      % |

G1, standard bulk size and clients:

            
|                                                        Metric |         Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|-------------:|------------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |              |     27.6803 |     28.0672 |  0.38697 |    min |
|             Min cumulative indexing time across primary shard |              |     5.34183 |     5.35855 |  0.01672 |    min |
|          Median cumulative indexing time across primary shard |              |     5.57365 |     5.40825 |  -0.1654 |    min |
|             Max cumulative indexing time across primary shard |              |     5.63128 |       5.969 |  0.33772 |    min |
|           Cumulative indexing throttle time of primary shards |              |           0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |              |     5.29257 |     5.27422 | -0.01835 |    min |
|                      Cumulative merge count of primary shards |              |          52 |          50 |       -2 |        |
|                Min cumulative merge time across primary shard |              |      0.9769 |    0.965617 | -0.01128 |    min |
|             Median cumulative merge time across primary shard |              |     1.06222 |     1.01505 | -0.04717 |    min |
|                Max cumulative merge time across primary shard |              |     1.13682 |     1.25985 |  0.12303 |    min |
|              Cumulative merge throttle time of primary shards |              |     1.34913 |      1.4916 |  0.14247 |    min |
|       Min cumulative merge throttle time across primary shard |              |    0.219133 |     0.21845 | -0.00068 |    min |
|    Median cumulative merge throttle time across primary shard |              |    0.274583 |    0.269933 | -0.00465 |    min |
|       Max cumulative merge throttle time across primary shard |              |    0.297267 |    0.430717 |  0.13345 |    min |
|                     Cumulative refresh time of primary shards |              |     1.35573 |     1.39133 |   0.0356 |    min |
|                    Cumulative refresh count of primary shards |              |         128 |         124 |       -4 |        |
|              Min cumulative refresh time across primary shard |              |    0.254533 |    0.253667 | -0.00087 |    min |
|           Median cumulative refresh time across primary shard |              |      0.2806 |    0.279967 | -0.00063 |    min |
|              Max cumulative refresh time across primary shard |              |    0.283383 |    0.299233 |  0.01585 |    min |
|                       Cumulative flush time of primary shards |              |    0.417983 |    0.431833 |  0.01385 |    min |
|                      Cumulative flush count of primary shards |              |           5 |           5 |        0 |        |
|                Min cumulative flush time across primary shard |              |   0.0359667 |   0.0365833 |  0.00062 |    min |
|             Median cumulative flush time across primary shard |              |      0.0724 |     0.08775 |  0.01535 |    min |
|                Max cumulative flush time across primary shard |              |     0.12215 |      0.1184 | -0.00375 |    min |
|                                            Total Young Gen GC |              |      22.192 |      25.169 |    2.977 |      s |
|                                              Total Old Gen GC |              |           0 |           0 |        0 |      s |
|                                                    Store size |              |     10.7333 |     11.1128 |  0.37953 |     GB |
|                                                 Translog size |              | 7.68341e-07 | 7.68341e-07 |        0 |     GB |
|                                        Heap used for segments |              |     5.63861 |     5.73557 |  0.09696 |     MB |
|                                      Heap used for doc values |              |   0.0585403 |   0.0794907 |  0.02095 |     MB |
|                                           Heap used for terms |              |     4.29781 |     4.35829 |  0.06048 |     MB |
|                                           Heap used for norms |              |    0.135864 |    0.155579 |  0.01971 |     MB |
|                                          Heap used for points |              |    0.289915 |    0.286085 | -0.00383 |     MB |
|                                   Heap used for stored fields |              |    0.856476 |    0.856117 | -0.00036 |     MB |
|                                                 Segment count |              |         172 |         198 |       26 |        |
|                                                Min Throughput | index-append |     55521.6 |     54407.8 | -1113.73 | docs/s |
|                                             Median Throughput | index-append |     57376.4 |     56037.7 |  -1338.7 | docs/s |
|                                                Max Throughput | index-append |     57989.1 |     56442.4 | -1546.72 | docs/s |
|                                       50th percentile latency | index-append |     520.636 |     546.836 |     26.2 |     ms |
|                                       90th percentile latency | index-append |     1220.95 |     1221.08 |  0.13095 |     ms |
|                                       99th percentile latency | index-append |     2941.36 |     2332.48 | -608.881 |     ms |
|                                      100th percentile latency | index-append |      3152.5 |     3652.69 |  500.193 |     ms |
|                                  50th percentile service time | index-append |     520.636 |     546.836 |     26.2 |     ms |
|                                  90th percentile service time | index-append |     1220.95 |     1221.08 |  0.13095 |     ms |
|                                  99th percentile service time | index-append |     2941.36 |     2332.48 | -608.881 |     ms |
|                                 100th percentile service time | index-append |      3152.5 |     3652.69 |  500.193 |     ms |
|                                                    error rate | index-append |           0 |           0 |        0 |      % |

I'm sorry, but I'd like to ask where I can see the optimization effect of GC in the test report.

xjtushilei · 2019-12-05T13:43:30Z

I think “netty chunk size” and “-XX:G1HeapRegionSize” decide whether to generate “humongous object”， not netty “composite calculator”.

When a large number of bulks continue to arrive, a large number of “humongous objects” will still be generated

Tim-Brooks · 2019-12-05T17:08:55Z

I think “netty chunk size” and “-XX:G1HeapRegionSize” decide whether to generate “humongous object”， not netty “composite calculator”.

When a large number of bulks continue to arrive, a large number of “humongous objects” will still be generated

The MERGE cumulator forces the entire message we are cumulating to be in a single buffer. This means a constant resizing and probably reallocating of the buffer. If the message is greater than 16 MB it will not fit in a recyclable chunk which means that the message becomes a non-recycled allocation.

Using the composite cumulator means that chunks are aggregated in a collection. The allocations should consistently be recycled buffers.

henningandersen · 2019-12-05T17:30:30Z

@xjtushilei you cannot see the GC optimization effect of the change in that report. The main reason for performance testing this change was to ensure that performance did not degrade.

The GC benefit mainly happens when receiving very large transport requests (or returning similarly large responses). Requests above 16MB (and in particular those that were much larger) would cause additional humongous allocations and garbage, which we now handle better with this change. It should be noted that provoking such large requests or responses are not recommended.

This commit reverts switching to the unpooled allocator (for now) to let some benchmarks run to see if this is the source of an increase in GC times. Relates #22452

xjtushilei · 2019-12-06T02:35:41Z

@henningandersen @tbrooks8 hi , Thank you for your reply.

I think “netty chunk size” and “-XX:G1HeapRegionSize” decide whether to generate “humongous object”， not netty “composite calculator”.

My advice is real and useful. It has been verified. When -XX:G1HeapRegionSize=32M ,and you can set -Dio.netty.allocator.maxOrder=10 , then the chunk size will be 8MB, so there is no more “humongous objects” .

When a large number of bulks continue to arrive, a large number of “humongous objects” will still be generated

I'm sorry. Maybe "Netty4: switch to composite cumulator" this optimization of netty doesn't work.
I tested es with the netty change "switch to composite cumulator" .Then I dump jvm heap and find there is still a lot of “humongous objects” , which was generated by netty.

xjtushilei · 2019-12-11T04:08:54Z

@tbrooks8 @henningandersen
Hello, have you refuted my point of view above? Or try a test?

Tim-Brooks · 2019-12-11T04:34:05Z

@xjtushilei Our responses were not disagreeing with anything you were saying. We are aware that the default netty chunk size is a humongous allocation. This PR was about removing a major source of large ad-hoc allocations that are not recycled.

We are aware that the 16MB pooled chunks are humongous (forced into dedicated regions). We are still looking into if this is something we want to mitigate.

henningandersen added >enhancement :Distributed Coordination/Network Http and internode communication implementations v8.0.0 v7.6.0 v7.5.1 labels Nov 22, 2019

original-brownbear approved these changes Nov 22, 2019

View reviewed changes

Tim-Brooks approved these changes Nov 22, 2019

View reviewed changes

henningandersen merged commit 796cd00 into elastic:master Nov 22, 2019

xjtushilei referenced this pull request Dec 6, 2019

Remove disabling Netty pooled allocator

f1ee224

This commit reverts switching to the unpooled allocator (for now) to let some benchmarks run to see if this is the source of an increase in GC times. Relates #22452

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Netty4: switch to composite cumulator #49478

Netty4: switch to composite cumulator #49478

henningandersen commented Nov 22, 2019

elasticmachine commented Nov 22, 2019

henningandersen commented Nov 22, 2019

original-brownbear left a comment

Tim-Brooks left a comment

henningandersen commented Nov 22, 2019

xjtushilei commented Dec 5, 2019

xjtushilei commented Dec 5, 2019

Tim-Brooks commented Dec 5, 2019

henningandersen commented Dec 5, 2019

xjtushilei commented Dec 6, 2019

xjtushilei commented Dec 11, 2019

Tim-Brooks commented Dec 11, 2019

Netty4: switch to composite cumulator #49478

Netty4: switch to composite cumulator #49478

Conversation

henningandersen commented Nov 22, 2019

elasticmachine commented Nov 22, 2019

henningandersen commented Nov 22, 2019

original-brownbear left a comment

Choose a reason for hiding this comment

Tim-Brooks left a comment

Choose a reason for hiding this comment

henningandersen commented Nov 22, 2019

xjtushilei commented Dec 5, 2019

xjtushilei commented Dec 5, 2019

Tim-Brooks commented Dec 5, 2019

henningandersen commented Dec 5, 2019

xjtushilei commented Dec 6, 2019

xjtushilei commented Dec 11, 2019

Tim-Brooks commented Dec 11, 2019