-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Netty4: switch to composite cumulator #49478
Netty4: switch to composite cumulator #49478
Conversation
The default merge cumulator used in netty transport leads to additional GC pressure and memory copying when a message that exceeds the chunk size is handled. This is especially a problem on G1 GC, since we get many "humongous" allocations and that can in theory cause real memory circuit breaker to break unnecessarily.
Pinging @elastic/es-distributed (:Distributed/Network) |
Did a few rally experiments using geonmames
One noteworthy improvement is that with 6 clients, bulk size 200K, G1 could successfully complete the test with the change and failed with circuit breaker exception without it. A couple of comparisons (baseline is 7.4, contender is 7.4 using composite cumulator). CMS, standard bulk size and clients:
G1, standard bulk size and clients:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and it's not even close IMO. We're doign this for HTTP messages anyway and saving a bunch of memory is worth a lot more than saving some cycles on the IO loop imo (since we decode the full messages on the IO loop anyway and use bulk ByteBuf
operations throughout everything now this shouldn't be so bad relative to the former and in absolute terms).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - as I mentioned when we talked this is the one that conceptually makes sense for us. We would need evidence AGAINST making this change IMO. Your benchmarks seem to slightly be in favor of making the change.
Our current usage is essentially complete content aggregation which should definitely be COMPOSITE. If we were only doing a small frame or something, then the MERGE might make more sense.
Thanks @original-brownbear and @tbrooks8 . |
The default merge cumulator used in netty transport leads to additional GC pressure and memory copying when a message that exceeds the chunk size is handled. This is especially a problem on G1 GC, since we get many "humongous" allocations and that can in theory cause real memory circuit breaker to break unnecessarily.
The default merge cumulator used in netty transport leads to additional GC pressure and memory copying when a message that exceeds the chunk size is handled. This is especially a problem on G1 GC, since we get many "humongous" allocations and that can in theory cause real memory circuit breaker to break unnecessarily.
I'm sorry, but I'd like to ask where I can see the optimization effect of GC in the test report. |
I think “netty chunk size” and “-XX:G1HeapRegionSize” decide whether to generate “humongous object”, not netty “composite calculator”. When a large number of bulks continue to arrive, a large number of “humongous objects” will still be generated |
The MERGE cumulator forces the entire message we are cumulating to be in a single buffer. This means a constant resizing and probably reallocating of the buffer. If the message is greater than 16 MB it will not fit in a recyclable chunk which means that the message becomes a non-recycled allocation. Using the composite cumulator means that chunks are aggregated in a collection. The allocations should consistently be recycled buffers. |
@xjtushilei you cannot see the GC optimization effect of the change in that report. The main reason for performance testing this change was to ensure that performance did not degrade. The GC benefit mainly happens when receiving very large transport requests (or returning similarly large responses). Requests above 16MB (and in particular those that were much larger) would cause additional humongous allocations and garbage, which we now handle better with this change. It should be noted that provoking such large requests or responses are not recommended. |
This commit reverts switching to the unpooled allocator (for now) to let some benchmarks run to see if this is the source of an increase in GC times. Relates #22452
@henningandersen @tbrooks8 hi , Thank you for your reply.
My advice is real and useful. It has been verified. When -XX:G1HeapRegionSize=32M ,and you can set -Dio.netty.allocator.maxOrder=10 , then the chunk size will be 8MB, so there is no more “humongous objects” .
I'm sorry. Maybe "Netty4: switch to composite cumulator" this optimization of netty doesn't work. |
@tbrooks8 @henningandersen |
@xjtushilei Our responses were not disagreeing with anything you were saying. We are aware that the default netty chunk size is a humongous allocation. This PR was about removing a major source of large ad-hoc allocations that are not recycled. We are aware that the 16MB pooled chunks are humongous (forced into dedicated regions). We are still looking into if this is something we want to mitigate. |
The default merge cumulator used in netty transport leads to additional
GC pressure and memory copying when a message that exceeds the chunk
size is handled. This is especially a problem on G1 GC, since we get
many "humongous" allocations and that can in theory cause real memory
circuit breaker to break unnecessarily.
Will add performance test details in separate comments.