Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netty4: switch to composite cumulator #49478

Merged

Conversation

henningandersen
Copy link
Contributor

The default merge cumulator used in netty transport leads to additional
GC pressure and memory copying when a message that exceeds the chunk
size is handled. This is especially a problem on G1 GC, since we get
many "humongous" allocations and that can in theory cause real memory
circuit breaker to break unnecessarily.

Will add performance test details in separate comments.

The default merge cumulator used in netty transport leads to additional
GC pressure and memory copying when a message that exceeds the chunk
size is handled. This is especially a problem on G1 GC, since we get
many "humongous" allocations and that can in theory cause real memory
circuit breaker to break unnecessarily.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Network)

@henningandersen
Copy link
Contributor Author

Did a few rally experiments using geonmames append-no-conflicts-index-only, 4G heap, 3 nodes, 2 replicas, using both CMS and G1, all against 7.4. The results look near identical. Summary of tests:

  • Standard bulk size and clients.
  • Very large bulk size (200K) and 1-6 clients (most only G1)

One noteworthy improvement is that with 6 clients, bulk size 200K, G1 could successfully complete the test with the change and failed with circuit breaker exception without it.

A couple of comparisons (baseline is 7.4, contender is 7.4 using composite cumulator).

CMS, standard bulk size and clients:

|                                                        Metric |         Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|-------------:|------------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |              |     28.2774 |     27.9363 | -0.34108 |    min |
|             Min cumulative indexing time across primary shard |              |     5.44917 |     5.53135 |  0.08218 |    min |
|          Median cumulative indexing time across primary shard |              |     5.59793 |     5.55233 |  -0.0456 |    min |
|             Max cumulative indexing time across primary shard |              |     5.85203 |     5.72773 |  -0.1243 |    min |
|           Cumulative indexing throttle time of primary shards |              |           0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |              |     5.12782 |     5.23317 |  0.10535 |    min |
|                      Cumulative merge count of primary shards |              |          52 |          52 |        0 |        |
|                Min cumulative merge time across primary shard |              |    0.886667 |    0.832667 |   -0.054 |    min |
|             Median cumulative merge time across primary shard |              |     1.05418 |     1.06012 |  0.00593 |    min |
|                Max cumulative merge time across primary shard |              |     1.14142 |     1.22498 |  0.08357 |    min |
|              Cumulative merge throttle time of primary shards |              |     1.49637 |     1.43865 | -0.05772 |    min |
|       Min cumulative merge throttle time across primary shard |              |     0.21995 |     0.17105 |  -0.0489 |    min |
|    Median cumulative merge throttle time across primary shard |              |    0.316267 |      0.2976 | -0.01867 |    min |
|       Max cumulative merge throttle time across primary shard |              |       0.371 |    0.447033 |  0.07603 |    min |
|                     Cumulative refresh time of primary shards |              |     1.34108 |     1.35227 |  0.01118 |    min |
|                    Cumulative refresh count of primary shards |              |         128 |         129 |        1 |        |
|              Min cumulative refresh time across primary shard |              |    0.224733 |     0.24935 |  0.02462 |    min |
|           Median cumulative refresh time across primary shard |              |    0.274933 |     0.27105 | -0.00388 |    min |
|              Max cumulative refresh time across primary shard |              |    0.306917 |    0.281033 | -0.02588 |    min |
|                       Cumulative flush time of primary shards |              |    0.482683 |      0.4599 | -0.02278 |    min |
|                      Cumulative flush count of primary shards |              |           5 |           5 |        0 |        |
|                Min cumulative flush time across primary shard |              |   0.0873833 |   0.0256833 |  -0.0617 |    min |
|             Median cumulative flush time across primary shard |              |   0.0939333 |    0.111017 |  0.01708 |    min |
|                Max cumulative flush time across primary shard |              |      0.1056 |    0.115483 |  0.00988 |    min |
|                                            Total Young Gen GC |              |      32.801 |      34.034 |    1.233 |      s |
|                                              Total Old Gen GC |              |       1.362 |       1.721 |    0.359 |      s |
|                                                    Store size |              |     10.9549 |     11.0767 |   0.1218 |     GB |
|                                                 Translog size |              | 7.68341e-07 | 7.68341e-07 |        0 |     GB |
|                                        Heap used for segments |              |     5.83621 |     5.75584 | -0.08038 |     MB |
|                                      Heap used for doc values |              |   0.0857506 |   0.0898933 |  0.00414 |     MB |
|                                           Heap used for terms |              |     4.46203 |      4.3651 | -0.09692 |     MB |
|                                           Heap used for norms |              |    0.154541 |    0.154663 |  0.00012 |     MB |
|                                          Heap used for points |              |    0.284233 |    0.292837 |   0.0086 |     MB |
|                                   Heap used for stored fields |              |    0.849663 |     0.85334 |  0.00368 |     MB |
|                                                 Segment count |              |         197 |         197 |        0 |        |
|                                                Min Throughput | index-append |     55495.4 |     55185.1 | -310.291 | docs/s |
|                                             Median Throughput | index-append |     57400.6 |     57152.4 | -248.217 | docs/s |
|                                                Max Throughput | index-append |     58010.2 |       57997 | -13.2215 | docs/s |
|                                       50th percentile latency | index-append |     529.814 |     543.091 |  13.2766 |     ms |
|                                       90th percentile latency | index-append |     1203.99 |     1183.37 | -20.6185 |     ms |
|                                       99th percentile latency | index-append |     2363.94 |     2096.08 | -267.865 |     ms |
|                                      100th percentile latency | index-append |     3346.22 |     3446.65 |  100.433 |     ms |
|                                  50th percentile service time | index-append |     529.814 |     543.091 |  13.2766 |     ms |
|                                  90th percentile service time | index-append |     1203.99 |     1183.37 | -20.6185 |     ms |
|                                  99th percentile service time | index-append |     2363.94 |     2096.08 | -267.865 |     ms |
|                                 100th percentile service time | index-append |     3346.22 |     3446.65 |  100.433 |     ms |
|                                                    error rate | index-append |           0 |           0 |        0 |      % |

G1, standard bulk size and clients:

            
|                                                        Metric |         Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|-------------:|------------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |              |     27.6803 |     28.0672 |  0.38697 |    min |
|             Min cumulative indexing time across primary shard |              |     5.34183 |     5.35855 |  0.01672 |    min |
|          Median cumulative indexing time across primary shard |              |     5.57365 |     5.40825 |  -0.1654 |    min |
|             Max cumulative indexing time across primary shard |              |     5.63128 |       5.969 |  0.33772 |    min |
|           Cumulative indexing throttle time of primary shards |              |           0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |              |     5.29257 |     5.27422 | -0.01835 |    min |
|                      Cumulative merge count of primary shards |              |          52 |          50 |       -2 |        |
|                Min cumulative merge time across primary shard |              |      0.9769 |    0.965617 | -0.01128 |    min |
|             Median cumulative merge time across primary shard |              |     1.06222 |     1.01505 | -0.04717 |    min |
|                Max cumulative merge time across primary shard |              |     1.13682 |     1.25985 |  0.12303 |    min |
|              Cumulative merge throttle time of primary shards |              |     1.34913 |      1.4916 |  0.14247 |    min |
|       Min cumulative merge throttle time across primary shard |              |    0.219133 |     0.21845 | -0.00068 |    min |
|    Median cumulative merge throttle time across primary shard |              |    0.274583 |    0.269933 | -0.00465 |    min |
|       Max cumulative merge throttle time across primary shard |              |    0.297267 |    0.430717 |  0.13345 |    min |
|                     Cumulative refresh time of primary shards |              |     1.35573 |     1.39133 |   0.0356 |    min |
|                    Cumulative refresh count of primary shards |              |         128 |         124 |       -4 |        |
|              Min cumulative refresh time across primary shard |              |    0.254533 |    0.253667 | -0.00087 |    min |
|           Median cumulative refresh time across primary shard |              |      0.2806 |    0.279967 | -0.00063 |    min |
|              Max cumulative refresh time across primary shard |              |    0.283383 |    0.299233 |  0.01585 |    min |
|                       Cumulative flush time of primary shards |              |    0.417983 |    0.431833 |  0.01385 |    min |
|                      Cumulative flush count of primary shards |              |           5 |           5 |        0 |        |
|                Min cumulative flush time across primary shard |              |   0.0359667 |   0.0365833 |  0.00062 |    min |
|             Median cumulative flush time across primary shard |              |      0.0724 |     0.08775 |  0.01535 |    min |
|                Max cumulative flush time across primary shard |              |     0.12215 |      0.1184 | -0.00375 |    min |
|                                            Total Young Gen GC |              |      22.192 |      25.169 |    2.977 |      s |
|                                              Total Old Gen GC |              |           0 |           0 |        0 |      s |
|                                                    Store size |              |     10.7333 |     11.1128 |  0.37953 |     GB |
|                                                 Translog size |              | 7.68341e-07 | 7.68341e-07 |        0 |     GB |
|                                        Heap used for segments |              |     5.63861 |     5.73557 |  0.09696 |     MB |
|                                      Heap used for doc values |              |   0.0585403 |   0.0794907 |  0.02095 |     MB |
|                                           Heap used for terms |              |     4.29781 |     4.35829 |  0.06048 |     MB |
|                                           Heap used for norms |              |    0.135864 |    0.155579 |  0.01971 |     MB |
|                                          Heap used for points |              |    0.289915 |    0.286085 | -0.00383 |     MB |
|                                   Heap used for stored fields |              |    0.856476 |    0.856117 | -0.00036 |     MB |
|                                                 Segment count |              |         172 |         198 |       26 |        |
|                                                Min Throughput | index-append |     55521.6 |     54407.8 | -1113.73 | docs/s |
|                                             Median Throughput | index-append |     57376.4 |     56037.7 |  -1338.7 | docs/s |
|                                                Max Throughput | index-append |     57989.1 |     56442.4 | -1546.72 | docs/s |
|                                       50th percentile latency | index-append |     520.636 |     546.836 |     26.2 |     ms |
|                                       90th percentile latency | index-append |     1220.95 |     1221.08 |  0.13095 |     ms |
|                                       99th percentile latency | index-append |     2941.36 |     2332.48 | -608.881 |     ms |
|                                      100th percentile latency | index-append |      3152.5 |     3652.69 |  500.193 |     ms |
|                                  50th percentile service time | index-append |     520.636 |     546.836 |     26.2 |     ms |
|                                  90th percentile service time | index-append |     1220.95 |     1221.08 |  0.13095 |     ms |
|                                  99th percentile service time | index-append |     2941.36 |     2332.48 | -608.881 |     ms |
|                                 100th percentile service time | index-append |      3152.5 |     3652.69 |  500.193 |     ms |
|                                                    error rate | index-append |           0 |           0 |        0 |      % |

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and it's not even close IMO. We're doign this for HTTP messages anyway and saving a bunch of memory is worth a lot more than saving some cycles on the IO loop imo (since we decode the full messages on the IO loop anyway and use bulk ByteBuf operations throughout everything now this shouldn't be so bad relative to the former and in absolute terms).

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - as I mentioned when we talked this is the one that conceptually makes sense for us. We would need evidence AGAINST making this change IMO. Your benchmarks seem to slightly be in favor of making the change.

Our current usage is essentially complete content aggregation which should definitely be COMPOSITE. If we were only doing a small frame or something, then the MERGE might make more sense.

@henningandersen
Copy link
Contributor Author

Thanks @original-brownbear and @tbrooks8 .

@henningandersen henningandersen merged commit 796cd00 into elastic:master Nov 22, 2019
henningandersen added a commit that referenced this pull request Nov 22, 2019
The default merge cumulator used in netty transport leads to additional
GC pressure and memory copying when a message that exceeds the chunk
size is handled. This is especially a problem on G1 GC, since we get
many "humongous" allocations and that can in theory cause real memory
circuit breaker to break unnecessarily.
henningandersen added a commit that referenced this pull request Nov 22, 2019
The default merge cumulator used in netty transport leads to additional
GC pressure and memory copying when a message that exceeds the chunk
size is handled. This is especially a problem on G1 GC, since we get
many "humongous" allocations and that can in theory cause real memory
circuit breaker to break unnecessarily.
@xjtushilei
Copy link
Contributor

Did a few rally experiments using geonmames append-no-conflicts-index-only, 4G heap, 3 nodes, 2 replicas, using both CMS and G1, all against 7.4. The results look near identical. Summary of tests:

  • Standard bulk size and clients.
  • Very large bulk size (200K) and 1-6 clients (most only G1)

One noteworthy improvement is that with 6 clients, bulk size 200K, G1 could successfully complete the test with the change and failed with circuit breaker exception without it.

A couple of comparisons (baseline is 7.4, contender is 7.4 using composite cumulator).

CMS, standard bulk size and clients:

|                                                        Metric |         Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|-------------:|------------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |              |     28.2774 |     27.9363 | -0.34108 |    min |
|             Min cumulative indexing time across primary shard |              |     5.44917 |     5.53135 |  0.08218 |    min |
|          Median cumulative indexing time across primary shard |              |     5.59793 |     5.55233 |  -0.0456 |    min |
|             Max cumulative indexing time across primary shard |              |     5.85203 |     5.72773 |  -0.1243 |    min |
|           Cumulative indexing throttle time of primary shards |              |           0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |              |     5.12782 |     5.23317 |  0.10535 |    min |
|                      Cumulative merge count of primary shards |              |          52 |          52 |        0 |        |
|                Min cumulative merge time across primary shard |              |    0.886667 |    0.832667 |   -0.054 |    min |
|             Median cumulative merge time across primary shard |              |     1.05418 |     1.06012 |  0.00593 |    min |
|                Max cumulative merge time across primary shard |              |     1.14142 |     1.22498 |  0.08357 |    min |
|              Cumulative merge throttle time of primary shards |              |     1.49637 |     1.43865 | -0.05772 |    min |
|       Min cumulative merge throttle time across primary shard |              |     0.21995 |     0.17105 |  -0.0489 |    min |
|    Median cumulative merge throttle time across primary shard |              |    0.316267 |      0.2976 | -0.01867 |    min |
|       Max cumulative merge throttle time across primary shard |              |       0.371 |    0.447033 |  0.07603 |    min |
|                     Cumulative refresh time of primary shards |              |     1.34108 |     1.35227 |  0.01118 |    min |
|                    Cumulative refresh count of primary shards |              |         128 |         129 |        1 |        |
|              Min cumulative refresh time across primary shard |              |    0.224733 |     0.24935 |  0.02462 |    min |
|           Median cumulative refresh time across primary shard |              |    0.274933 |     0.27105 | -0.00388 |    min |
|              Max cumulative refresh time across primary shard |              |    0.306917 |    0.281033 | -0.02588 |    min |
|                       Cumulative flush time of primary shards |              |    0.482683 |      0.4599 | -0.02278 |    min |
|                      Cumulative flush count of primary shards |              |           5 |           5 |        0 |        |
|                Min cumulative flush time across primary shard |              |   0.0873833 |   0.0256833 |  -0.0617 |    min |
|             Median cumulative flush time across primary shard |              |   0.0939333 |    0.111017 |  0.01708 |    min |
|                Max cumulative flush time across primary shard |              |      0.1056 |    0.115483 |  0.00988 |    min |
|                                            Total Young Gen GC |              |      32.801 |      34.034 |    1.233 |      s |
|                                              Total Old Gen GC |              |       1.362 |       1.721 |    0.359 |      s |
|                                                    Store size |              |     10.9549 |     11.0767 |   0.1218 |     GB |
|                                                 Translog size |              | 7.68341e-07 | 7.68341e-07 |        0 |     GB |
|                                        Heap used for segments |              |     5.83621 |     5.75584 | -0.08038 |     MB |
|                                      Heap used for doc values |              |   0.0857506 |   0.0898933 |  0.00414 |     MB |
|                                           Heap used for terms |              |     4.46203 |      4.3651 | -0.09692 |     MB |
|                                           Heap used for norms |              |    0.154541 |    0.154663 |  0.00012 |     MB |
|                                          Heap used for points |              |    0.284233 |    0.292837 |   0.0086 |     MB |
|                                   Heap used for stored fields |              |    0.849663 |     0.85334 |  0.00368 |     MB |
|                                                 Segment count |              |         197 |         197 |        0 |        |
|                                                Min Throughput | index-append |     55495.4 |     55185.1 | -310.291 | docs/s |
|                                             Median Throughput | index-append |     57400.6 |     57152.4 | -248.217 | docs/s |
|                                                Max Throughput | index-append |     58010.2 |       57997 | -13.2215 | docs/s |
|                                       50th percentile latency | index-append |     529.814 |     543.091 |  13.2766 |     ms |
|                                       90th percentile latency | index-append |     1203.99 |     1183.37 | -20.6185 |     ms |
|                                       99th percentile latency | index-append |     2363.94 |     2096.08 | -267.865 |     ms |
|                                      100th percentile latency | index-append |     3346.22 |     3446.65 |  100.433 |     ms |
|                                  50th percentile service time | index-append |     529.814 |     543.091 |  13.2766 |     ms |
|                                  90th percentile service time | index-append |     1203.99 |     1183.37 | -20.6185 |     ms |
|                                  99th percentile service time | index-append |     2363.94 |     2096.08 | -267.865 |     ms |
|                                 100th percentile service time | index-append |     3346.22 |     3446.65 |  100.433 |     ms |
|                                                    error rate | index-append |           0 |           0 |        0 |      % |

G1, standard bulk size and clients:

            
|                                                        Metric |         Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|-------------:|------------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |              |     27.6803 |     28.0672 |  0.38697 |    min |
|             Min cumulative indexing time across primary shard |              |     5.34183 |     5.35855 |  0.01672 |    min |
|          Median cumulative indexing time across primary shard |              |     5.57365 |     5.40825 |  -0.1654 |    min |
|             Max cumulative indexing time across primary shard |              |     5.63128 |       5.969 |  0.33772 |    min |
|           Cumulative indexing throttle time of primary shards |              |           0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |              |     5.29257 |     5.27422 | -0.01835 |    min |
|                      Cumulative merge count of primary shards |              |          52 |          50 |       -2 |        |
|                Min cumulative merge time across primary shard |              |      0.9769 |    0.965617 | -0.01128 |    min |
|             Median cumulative merge time across primary shard |              |     1.06222 |     1.01505 | -0.04717 |    min |
|                Max cumulative merge time across primary shard |              |     1.13682 |     1.25985 |  0.12303 |    min |
|              Cumulative merge throttle time of primary shards |              |     1.34913 |      1.4916 |  0.14247 |    min |
|       Min cumulative merge throttle time across primary shard |              |    0.219133 |     0.21845 | -0.00068 |    min |
|    Median cumulative merge throttle time across primary shard |              |    0.274583 |    0.269933 | -0.00465 |    min |
|       Max cumulative merge throttle time across primary shard |              |    0.297267 |    0.430717 |  0.13345 |    min |
|                     Cumulative refresh time of primary shards |              |     1.35573 |     1.39133 |   0.0356 |    min |
|                    Cumulative refresh count of primary shards |              |         128 |         124 |       -4 |        |
|              Min cumulative refresh time across primary shard |              |    0.254533 |    0.253667 | -0.00087 |    min |
|           Median cumulative refresh time across primary shard |              |      0.2806 |    0.279967 | -0.00063 |    min |
|              Max cumulative refresh time across primary shard |              |    0.283383 |    0.299233 |  0.01585 |    min |
|                       Cumulative flush time of primary shards |              |    0.417983 |    0.431833 |  0.01385 |    min |
|                      Cumulative flush count of primary shards |              |           5 |           5 |        0 |        |
|                Min cumulative flush time across primary shard |              |   0.0359667 |   0.0365833 |  0.00062 |    min |
|             Median cumulative flush time across primary shard |              |      0.0724 |     0.08775 |  0.01535 |    min |
|                Max cumulative flush time across primary shard |              |     0.12215 |      0.1184 | -0.00375 |    min |
|                                            Total Young Gen GC |              |      22.192 |      25.169 |    2.977 |      s |
|                                              Total Old Gen GC |              |           0 |           0 |        0 |      s |
|                                                    Store size |              |     10.7333 |     11.1128 |  0.37953 |     GB |
|                                                 Translog size |              | 7.68341e-07 | 7.68341e-07 |        0 |     GB |
|                                        Heap used for segments |              |     5.63861 |     5.73557 |  0.09696 |     MB |
|                                      Heap used for doc values |              |   0.0585403 |   0.0794907 |  0.02095 |     MB |
|                                           Heap used for terms |              |     4.29781 |     4.35829 |  0.06048 |     MB |
|                                           Heap used for norms |              |    0.135864 |    0.155579 |  0.01971 |     MB |
|                                          Heap used for points |              |    0.289915 |    0.286085 | -0.00383 |     MB |
|                                   Heap used for stored fields |              |    0.856476 |    0.856117 | -0.00036 |     MB |
|                                                 Segment count |              |         172 |         198 |       26 |        |
|                                                Min Throughput | index-append |     55521.6 |     54407.8 | -1113.73 | docs/s |
|                                             Median Throughput | index-append |     57376.4 |     56037.7 |  -1338.7 | docs/s |
|                                                Max Throughput | index-append |     57989.1 |     56442.4 | -1546.72 | docs/s |
|                                       50th percentile latency | index-append |     520.636 |     546.836 |     26.2 |     ms |
|                                       90th percentile latency | index-append |     1220.95 |     1221.08 |  0.13095 |     ms |
|                                       99th percentile latency | index-append |     2941.36 |     2332.48 | -608.881 |     ms |
|                                      100th percentile latency | index-append |      3152.5 |     3652.69 |  500.193 |     ms |
|                                  50th percentile service time | index-append |     520.636 |     546.836 |     26.2 |     ms |
|                                  90th percentile service time | index-append |     1220.95 |     1221.08 |  0.13095 |     ms |
|                                  99th percentile service time | index-append |     2941.36 |     2332.48 | -608.881 |     ms |
|                                 100th percentile service time | index-append |      3152.5 |     3652.69 |  500.193 |     ms |
|                                                    error rate | index-append |           0 |           0 |        0 |      % |

I'm sorry, but I'd like to ask where I can see the optimization effect of GC in the test report.

@xjtushilei
Copy link
Contributor

I think “netty chunk size” and “-XX:G1HeapRegionSize” decide whether to generate “humongous object”, not netty “composite calculator”.

When a large number of bulks continue to arrive, a large number of “humongous objects” will still be generated

@Tim-Brooks
Copy link
Contributor

I think “netty chunk size” and “-XX:G1HeapRegionSize” decide whether to generate “humongous object”, not netty “composite calculator”.

When a large number of bulks continue to arrive, a large number of “humongous objects” will still be generated

The MERGE cumulator forces the entire message we are cumulating to be in a single buffer. This means a constant resizing and probably reallocating of the buffer. If the message is greater than 16 MB it will not fit in a recyclable chunk which means that the message becomes a non-recycled allocation.

Using the composite cumulator means that chunks are aggregated in a collection. The allocations should consistently be recycled buffers.

@henningandersen
Copy link
Contributor Author

@xjtushilei you cannot see the GC optimization effect of the change in that report. The main reason for performance testing this change was to ensure that performance did not degrade.

The GC benefit mainly happens when receiving very large transport requests (or returning similarly large responses). Requests above 16MB (and in particular those that were much larger) would cause additional humongous allocations and garbage, which we now handle better with this change. It should be noted that provoking such large requests or responses are not recommended.

xjtushilei referenced this pull request Dec 6, 2019
This commit reverts switching to the unpooled allocator (for now) to let
some benchmarks run to see if this is the source of an increase in GC
times.

Relates #22452
@xjtushilei
Copy link
Contributor

@henningandersen @tbrooks8 hi , Thank you for your reply.

I think “netty chunk size” and “-XX:G1HeapRegionSize” decide whether to generate “humongous object”, not netty “composite calculator”.

My advice is real and useful. It has been verified. When -XX:G1HeapRegionSize=32M ,and you can set -Dio.netty.allocator.maxOrder=10 , then the chunk size will be 8MB, so there is no more “humongous objects” .

When a large number of bulks continue to arrive, a large number of “humongous objects” will still be generated

I'm sorry. Maybe "Netty4: switch to composite cumulator" this optimization of netty doesn't work.
I tested es with the netty change "switch to composite cumulator" .Then I dump jvm heap and find there is still a lot of “humongous objects” , which was generated by netty.

@xjtushilei
Copy link
Contributor

@tbrooks8 @henningandersen
Hello, have you refuted my point of view above? Or try a test?

@Tim-Brooks
Copy link
Contributor

@xjtushilei Our responses were not disagreeing with anything you were saying. We are aware that the default netty chunk size is a humongous allocation. This PR was about removing a major source of large ad-hoc allocations that are not recycled.

We are aware that the 16MB pooled chunks are humongous (forced into dedicated regions). We are still looking into if this is something we want to mitigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants