Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka.Remote: improved write performance with DotNetty flush-batching #4106

Merged
merged 26 commits into from
Jan 21, 2020

Conversation

Aaronontheweb
Copy link
Member

Taken from one of the very first performance optimizations recommended here: http://normanmaurer.me/presentations/2014-facebook-eng-netty/slides.html

@Aaronontheweb
Copy link
Member Author

Looks like these changes are interfering with Akka.Remote clean shutdown at the moment - need to fix that.

public override Task WriteAsync(IChannelHandlerContext context, object message)
{
var write = base.WriteAsync(context, message);
if (++_currentPendingWrites == _maxPendingWrites)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better design for doing this, potentially: since all of the messages being buffered into the Channel are all IByteBuf, we can compute the length of buffered writes for this socket thus far.

Therefore, it might be a better idea for us to change our buffering strategy to flushing when X amount of bytes are pending, rather than counting the number of messages.

That being said - if the write rate is pretty high and the message length is consistently small, we don't want to unnecessarily buffer those for too long either, so we might need a strategy that is more adaptive, based on how the channel is actually used.

void ScheduleFlush(IChannelHandlerContext context)
{
// Schedule a recurring flush - only fires when there's writable data
var time = TimeSpan.FromMilliseconds(_maxPendingMillis);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, we check to see if messages need to be flushed every 40ms - this is designed for very low-volume systems where they'll probably never meet the msgCount / max bytes threshold I set earlier.

@Aaronontheweb Aaronontheweb marked this pull request as ready for review December 18, 2019 18:50
@Aaronontheweb
Copy link
Member Author

Going to pull some figures from #4108 momentarily

@Aaronontheweb
Copy link
Member Author

Dev Benchmark Results

3 runs, all on the same machine (12 core Intel i7 2.6Ghz Dell laptop)

Run 1

ProcessorCount: 12
ClockSpeed: 0 MHZ
Actor Count: 24
Messages sent/received per client: 200000 (2e5)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 125000, 1600.57
5, 1000000, 101031, 9898.05
10, 2000000, 43070, 46437.06
15, 3000000, 133559, 22462.01
20, 4000000, 33977, 117729.04
25, 5000000, 117889, 42413.91
30, 6000000, 118850, 50484.55
Done..

Run 2

C:\Repositories\akka.net\src\benchmark\RemotePingPong [increase-RemotePingPong ≡]
λ dotnet run -c Release --framework netcoreapp2.1
ProcessorCount: 12
ClockSpeed: 0 MHZ
Actor Count: 24
Messages sent/received per client: 200000 (2e5)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 84818, 2358.84
5, 1000000, 100868, 9914.90
10, 2000000, 138351, 14456.42
15, 3000000, 37722, 79531.70
20, 4000000, 35562, 112481.36
25, 5000000, 32807, 152410.82

Run 3

C:\Repositories\akka.net\src\benchmark\RemotePingPong [increase-RemotePingPong ≡]
λ dotnet run -c Release --framework netcoreapp2.1
ProcessorCount: 12
ClockSpeed: 0 MHZ
Actor Count: 24
Messages sent/received per client: 200000 (2e5)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 69736, 2868.60
5, 1000000, 141243, 7080.98
10, 2000000, 136771, 14623.27
15, 3000000, 38190, 78556.49
20, 4000000, 32401, 123454.60
25, 5000000, 33341, 149967.08
30, 6000000, 126093, 47584.92

@Aaronontheweb
Copy link
Member Author

On the dev branch, running at 200k msg per connection, msg/s globally was anywhere from 32401 msg/s to 141243 msg/s - a lot of variance here. I figured when I was poking around some of the remoting code earlier this week that there was no way this could be a function of GC or serialization overhead. It had to be a system call that's responsible for this huge variation.

@Aaronontheweb
Copy link
Member Author

dotnetty-batching Results

Same machine as the dev tests

Run 1

λ dotnet run -c Release --framework netcoreapp2.1
ProcessorCount: 12
ClockSpeed: 0 MHZ
Actor Count: 24
Messages sent/received per client: 200000 (2e5)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 86656, 2308.78
5, 1000000, 161187, 6204.40
10, 2000000, 151780, 13177.96
15, 3000000, 148640, 20183.48
20, 4000000, 146280, 27345.40
25, 5000000, 145341, 34402.19
30, 6000000, 143958, 41679.71
Done..

Run 2

C:\Repositories\akka.net\src\benchmark\RemotePingPong [dotnetty-batching ≡]
λ dotnet run -c Release --framework netcoreapp2.1
ProcessorCount: 12
ClockSpeed: 0 MHZ
Actor Count: 24
Messages sent/received per client: 200000 (2e5)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 106610, 1876.02
5, 1000000, 161031, 6210.60
10, 2000000, 145805, 13717.31
15, 3000000, 145209, 20660.47
20, 4000000, 143211, 27931.22
25, 5000000, 142621, 35058.92
30, 6000000, 143147, 41915.11
Done..

I only ran 2 benchmarks because the values were so consistent - during all of the "higher volume" tests, i.e. with more than a million request->response pairs, the system consistently ran between 142k and 145k msg/s. Performance was a bit lower for the smallest possible test value and higher for the ~1m sweet spot.

The worst case performance of this build is about equal to the best case performance of the last one, and is much more consistent and is entirely unoptimized - I'm just using the arbitrary values I picked. This PR works by grouping logical writes together into larger physical writes, taking advantage of DotNetty's pipeline to avoid flushing to the socket on every single write.

Flushes are now done according to the following algorithm:

For int maxPendingWrites = 20, int maxPendingMillis = 40, int maxPendingBytes = 128000

If currentWrites >= maxPendingWrites || currentBytes >= maxPendingBytes, then flush
else -> wait for more writes, unless 40ms expires, in which case we flush anyway.

I thought about writing some adaptive code to determine the optimal rate for flushing, but at the moment that seems unnecessary and complicated. Given the improvements from using this static batch values, I'm inclined to just merge this and push further optimization into a subsequent pull request.

@Aaronontheweb
Copy link
Member Author

Should help address #2378

return true;
}
return false;
}

private IByteBuffer ToByteBuffer(ByteString payload)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static IByteBuffer ToByteBuffer(IChannel channel, ByteString payload)
{
//TODO: optimize DotNetty byte buffer usage
// (maybe custom IByteBuffer working directly on ByteString?)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this is the best possible implementation until we support System.Memory in Akka.NET Core. Tested using a bunch of others.


public override Task WriteAsync(IChannelHandlerContext context, object message)
{
var write = base.WriteAsync(context, message);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add a comment here - need to complete the prior WriteAsync call first before we call flush, otherwise the message currently being written may not be included in the flush even though it was counted against `_currentPendingWrites'

@Aaronontheweb
Copy link
Member Author

Cluster.Sharding specs failing regularly on this PR - need to investigate to see if that's related to the remoting changes in this branch. Looks like this won't make it into the v1.3.17 release.

@Aaronontheweb
Copy link
Member Author

I figured out the issue with both the NodeChurn spec and RemoteDeliverySpec failures in this instance - the problem is that both of these tests depend on a large volume of messages being delivered intermittently in request / response fashion, but below the default thresholds I programmed in the BatchWriter:

for (var n = 1; n <= 500; n++)
{
p1.Tell(new RemoteDeliveryMultiNetSpec.Letter(n, route));
var letterNumber = n;
ExpectMsg<RemoteDeliveryMultiNetSpec.Letter>(
letter => letter.N == letterNumber && letter.Route.Count == 0,
TimeSpan.FromSeconds(5));
// in case the loop count is increased it is good with some progress feedback
if (n%10000 == 0)
{
Log.Info("Passed [{0}]", n);
}
}

We're not going to be able to hit these numbers by default, thus the "timer" stage, which runs every 40ms, is going to be what ultimately causes this data to be flushed over the wire. That adds a significant amount of overhead in this scenario.

I'm going to make the batching stage fully configurable via HOCON so it can be performance-tuned on a case-by-case basis, and I think the max byte size should be smaller than 128k by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant