-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Akka.Remote: improved write performance with DotNetty flush-batching #4106
Akka.Remote: improved write performance with DotNetty flush-batching #4106
Conversation
Looks like these changes are interfering with Akka.Remote clean shutdown at the moment - need to fix that. |
public override Task WriteAsync(IChannelHandlerContext context, object message) | ||
{ | ||
var write = base.WriteAsync(context, message); | ||
if (++_currentPendingWrites == _maxPendingWrites) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better design for doing this, potentially: since all of the messages being buffered into the Channel
are all IByteBuf
, we can compute the length of buffered writes for this socket thus far.
Therefore, it might be a better idea for us to change our buffering strategy to flushing when X amount of bytes are pending, rather than counting the number of messages.
That being said - if the write rate is pretty high and the message length is consistently small, we don't want to unnecessarily buffer those for too long either, so we might need a strategy that is more adaptive, based on how the channel is actually used.
void ScheduleFlush(IChannelHandlerContext context) | ||
{ | ||
// Schedule a recurring flush - only fires when there's writable data | ||
var time = TimeSpan.FromMilliseconds(_maxPendingMillis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, we check to see if messages need to be flushed every 40ms - this is designed for very low-volume systems where they'll probably never meet the msgCount / max bytes threshold I set earlier.
Going to pull some figures from #4108 momentarily |
Dev Benchmark Results3 runs, all on the same machine (12 core Intel i7 2.6Ghz Dell laptop) Run 1ProcessorCount: 12 Num clients, Total [msg], Msgs/sec, Total [ms] Run 2C:\Repositories\akka.net\src\benchmark\RemotePingPong [increase-RemotePingPong ≡] Num clients, Total [msg], Msgs/sec, Total [ms] Run 3C:\Repositories\akka.net\src\benchmark\RemotePingPong [increase-RemotePingPong ≡] Num clients, Total [msg], Msgs/sec, Total [ms] |
On the |
dotnetty-batching ResultsSame machine as the Run 1λ dotnet run -c Release --framework netcoreapp2.1 Num clients, Total [msg], Msgs/sec, Total [ms] Run 2C:\Repositories\akka.net\src\benchmark\RemotePingPong [dotnetty-batching ≡] Num clients, Total [msg], Msgs/sec, Total [ms] I only ran 2 benchmarks because the values were so consistent - during all of the "higher volume" tests, i.e. with more than a million request->response pairs, the system consistently ran between 142k and 145k msg/s. Performance was a bit lower for the smallest possible test value and higher for the ~1m sweet spot. The worst case performance of this build is about equal to the best case performance of the last one, and is much more consistent and is entirely unoptimized - I'm just using the arbitrary values I picked. This PR works by grouping logical writes together into larger physical writes, taking advantage of DotNetty's pipeline to avoid flushing to the socket on every single write. Flushes are now done according to the following algorithm: For If currentWrites >= maxPendingWrites || currentBytes >= maxPendingBytes, then flush I thought about writing some adaptive code to determine the optimal rate for flushing, but at the moment that seems unnecessary and complicated. Given the improvements from using this static batch values, I'm inclined to just merge this and push further optimization into a subsequent pull request. |
Should help address #2378 |
return true; | ||
} | ||
return false; | ||
} | ||
|
||
private IByteBuffer ToByteBuffer(ByteString payload) | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static IByteBuffer ToByteBuffer(IChannel channel, ByteString payload) | ||
{ | ||
//TODO: optimize DotNetty byte buffer usage | ||
// (maybe custom IByteBuffer working directly on ByteString?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this is the best possible implementation until we support System.Memory in Akka.NET Core. Tested using a bunch of others.
|
||
public override Task WriteAsync(IChannelHandlerContext context, object message) | ||
{ | ||
var write = base.WriteAsync(context, message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add a comment here - need to complete the prior WriteAsync
call first before we call flush, otherwise the message currently being written may not be included in the flush even though it was counted against `_currentPendingWrites'
Cluster.Sharding specs failing regularly on this PR - need to investigate to see if that's related to the remoting changes in this branch. Looks like this won't make it into the v1.3.17 release. |
I figured out the issue with both the NodeChurn spec and RemoteDeliverySpec failures in this instance - the problem is that both of these tests depend on a large volume of messages being delivered intermittently in request / response fashion, but below the default thresholds I programmed in the akka.net/src/core/Akka.Remote.Tests.MultiNode/RemoteDeliverySpec.cs Lines 106 to 119 in 1191a20
We're not going to be able to hit these numbers by default, thus the "timer" stage, which runs every 40ms, is going to be what ultimately causes this data to be flushed over the wire. That adds a significant amount of overhead in this scenario. I'm going to make the batching stage fully configurable via HOCON so it can be performance-tuned on a case-by-case basis, and I think the max byte size should be smaller than 128k by default. |
Taken from one of the very first performance optimizations recommended here: http://normanmaurer.me/presentations/2014-facebook-eng-netty/slides.html