Akka.Remote: improved write performance with DotNetty flush-batching #4106

Aaronontheweb · 2019-12-17T17:41:16Z

Taken from one of the very first performance optimizations recommended here: http://normanmaurer.me/presentations/2014-facebook-eng-netty/slides.html

Aaronontheweb · 2019-12-18T08:11:39Z

Looks like these changes are interfering with Akka.Remote clean shutdown at the moment - need to fix that.

Aaronontheweb · 2019-12-18T08:38:24Z

src/core/Akka.Remote/Transport/DotNetty/AkkaLoggingHandler.cs

+        public override Task WriteAsync(IChannelHandlerContext context, object message)
+        {
+            var write = base.WriteAsync(context, message);
+            if (++_currentPendingWrites == _maxPendingWrites)


A better design for doing this, potentially: since all of the messages being buffered into the Channel are all IByteBuf, we can compute the length of buffered writes for this socket thus far.

Therefore, it might be a better idea for us to change our buffering strategy to flushing when X amount of bytes are pending, rather than counting the number of messages.

That being said - if the write rate is pretty high and the message length is consistently small, we don't want to unnecessarily buffer those for too long either, so we might need a strategy that is more adaptive, based on how the channel is actually used.

Aaronontheweb · 2019-12-18T08:40:15Z

src/core/Akka.Remote/Transport/DotNetty/AkkaLoggingHandler.cs

+        void ScheduleFlush(IChannelHandlerContext context)
+        {
+            // Schedule a recurring flush - only fires when there's writable data
+            var time = TimeSpan.FromMilliseconds(_maxPendingMillis);


By default, we check to see if messages need to be flushed every 40ms - this is designed for very low-volume systems where they'll probably never meet the msgCount / max bytes threshold I set earlier.

Aaronontheweb · 2019-12-18T19:29:29Z

Going to pull some figures from #4108 momentarily

Aaronontheweb · 2019-12-18T19:35:05Z

Dev Benchmark Results

3 runs, all on the same machine (12 core Intel i7 2.6Ghz Dell laptop)

Run 1

ProcessorCount: 12
ClockSpeed: 0 MHZ
Actor Count: 24
Messages sent/received per client: 200000 (2e5)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 125000, 1600.57
5, 1000000, 101031, 9898.05
10, 2000000, 43070, 46437.06
15, 3000000, 133559, 22462.01
20, 4000000, 33977, 117729.04
25, 5000000, 117889, 42413.91
30, 6000000, 118850, 50484.55
Done..

Run 2

C:\Repositories\akka.net\src\benchmark\RemotePingPong [increase-RemotePingPong ≡]
λ dotnet run -c Release --framework netcoreapp2.1
ProcessorCount: 12
ClockSpeed: 0 MHZ
Actor Count: 24
Messages sent/received per client: 200000 (2e5)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 84818, 2358.84
5, 1000000, 100868, 9914.90
10, 2000000, 138351, 14456.42
15, 3000000, 37722, 79531.70
20, 4000000, 35562, 112481.36
25, 5000000, 32807, 152410.82

Run 3

C:\Repositories\akka.net\src\benchmark\RemotePingPong [increase-RemotePingPong ≡]
λ dotnet run -c Release --framework netcoreapp2.1
ProcessorCount: 12
ClockSpeed: 0 MHZ
Actor Count: 24
Messages sent/received per client: 200000 (2e5)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 69736, 2868.60
5, 1000000, 141243, 7080.98
10, 2000000, 136771, 14623.27
15, 3000000, 38190, 78556.49
20, 4000000, 32401, 123454.60
25, 5000000, 33341, 149967.08
30, 6000000, 126093, 47584.92

Aaronontheweb · 2019-12-18T19:36:45Z

On the dev branch, running at 200k msg per connection, msg/s globally was anywhere from 32401 msg/s to 141243 msg/s - a lot of variance here. I figured when I was poking around some of the remoting code earlier this week that there was no way this could be a function of GC or serialization overhead. It had to be a system call that's responsible for this huge variation.

Aaronontheweb · 2019-12-18T19:56:58Z

dotnetty-batching Results

Same machine as the dev tests

Run 1

λ dotnet run -c Release --framework netcoreapp2.1
ProcessorCount: 12
ClockSpeed: 0 MHZ
Actor Count: 24
Messages sent/received per client: 200000 (2e5)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 86656, 2308.78
5, 1000000, 161187, 6204.40
10, 2000000, 151780, 13177.96
15, 3000000, 148640, 20183.48
20, 4000000, 146280, 27345.40
25, 5000000, 145341, 34402.19
30, 6000000, 143958, 41679.71
Done..

Run 2

C:\Repositories\akka.net\src\benchmark\RemotePingPong [dotnetty-batching ≡]
λ dotnet run -c Release --framework netcoreapp2.1
ProcessorCount: 12
ClockSpeed: 0 MHZ
Actor Count: 24
Messages sent/received per client: 200000 (2e5)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 106610, 1876.02
5, 1000000, 161031, 6210.60
10, 2000000, 145805, 13717.31
15, 3000000, 145209, 20660.47
20, 4000000, 143211, 27931.22
25, 5000000, 142621, 35058.92
30, 6000000, 143147, 41915.11
Done..

I only ran 2 benchmarks because the values were so consistent - during all of the "higher volume" tests, i.e. with more than a million request->response pairs, the system consistently ran between 142k and 145k msg/s. Performance was a bit lower for the smallest possible test value and higher for the ~1m sweet spot.

The worst case performance of this build is about equal to the best case performance of the last one, and is much more consistent and is entirely unoptimized - I'm just using the arbitrary values I picked. This PR works by grouping logical writes together into larger physical writes, taking advantage of DotNetty's pipeline to avoid flushing to the socket on every single write.

Flushes are now done according to the following algorithm:

For int maxPendingWrites = 20, int maxPendingMillis = 40, int maxPendingBytes = 128000

If currentWrites >= maxPendingWrites || currentBytes >= maxPendingBytes, then flush
else -> wait for more writes, unless 40ms expires, in which case we flush anyway.

I thought about writing some adaptive code to determine the optimal rate for flushing, but at the moment that seems unnecessary and complicated. Given the improvements from using this static batch values, I'm inclined to just merge this and push further optimization into a subsequent pull request.

Aaronontheweb · 2019-12-18T20:01:28Z

Should help address #2378

Aaronontheweb · 2019-12-19T20:31:31Z

src/core/Akka.Remote/Transport/DotNetty/TcpTransport.cs

                return true;
            }
            return false;
        }

-        private IByteBuffer ToByteBuffer(ByteString payload)
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private static IByteBuffer ToByteBuffer(IChannel channel, ByteString payload)
        {
            //TODO: optimize DotNetty byte buffer usage 
            // (maybe custom IByteBuffer working directly on ByteString?)


Currently, this is the best possible implementation until we support System.Memory in Akka.NET Core. Tested using a bunch of others.

Aaronontheweb · 2019-12-19T21:25:04Z

src/core/Akka.Remote/Transport/DotNetty/BatchWriter.cs

+
+        public override Task WriteAsync(IChannelHandlerContext context, object message)
+        {
+            var write = base.WriteAsync(context, message);


Might want to add a comment here - need to complete the prior WriteAsync call first before we call flush, otherwise the message currently being written may not be included in the flush even though it was counted against `_currentPendingWrites'

Aaronontheweb · 2019-12-20T15:54:06Z

Cluster.Sharding specs failing regularly on this PR - need to investigate to see if that's related to the remoting changes in this branch. Looks like this won't make it into the v1.3.17 release.

Aaronontheweb · 2020-01-21T16:12:00Z

I figured out the issue with both the NodeChurn spec and RemoteDeliverySpec failures in this instance - the problem is that both of these tests depend on a large volume of messages being delivered intermittently in request / response fashion, but below the default thresholds I programmed in the BatchWriter:

akka.net/src/core/Akka.Remote.Tests.MultiNode/RemoteDeliverySpec.cs

Lines 106 to 119 in 1191a20

    
           for (var n = 1; n <= 500; n++) 
        
           { 
        
               p1.Tell(new RemoteDeliveryMultiNetSpec.Letter(n, route)); 
        
               var letterNumber = n; 
        
               ExpectMsg<RemoteDeliveryMultiNetSpec.Letter>( 
        
                   letter => letter.N == letterNumber && letter.Route.Count == 0, 
        
                   TimeSpan.FromSeconds(5)); 
        
               // in case the loop count is increased it is good with some progress feedback 
        
               if (n%10000 == 0) 
        
               { 
        
                   Log.Info("Passed [{0}]", n); 
        
               } 
        
           }

We're not going to be able to hit these numbers by default, thus the "timer" stage, which runs every 40ms, is going to be what ultimately causes this data to be flushed over the wire. That adds a significant amount of overhead in this scenario.

I'm going to make the batching stage fully configurable via HOCON so it can be performance-tuned on a case-by-case basis, and I think the max byte size should be smaller than 128k by default.

Aaronontheweb added 2 commits December 17, 2019 02:11

adding DotNetty write batching support

d4944bc

DotNetty flush-batching (performance)

360e816

Aaronontheweb commented Dec 18, 2019

View reviewed changes

Aaronontheweb added 4 commits December 18, 2019 09:43

stash

0b474e6

Added batch triggers for byte size too

a91cf3e

use max frame size to determine flush limit

10f41d7

rewrote ToByteBuffer method as static with inlining

25cff73

Aaronontheweb marked this pull request as ready for review December 18, 2019 18:50

Merge branch 'dev' into dotnetty-batching

9a96ca6

Merge branch 'dev' into dotnetty-batching

51b83a8

Aaronontheweb added this to the 1.4.0 milestone Dec 18, 2019

Aaronontheweb added akka-remote dotnetty perf labels Dec 18, 2019

made sure to push write down the pipeline before flush

0794b06

Aaronontheweb commented Dec 19, 2019

View reviewed changes

Merge branch 'dev' into dotnetty-batching

ebe1a0d

Aaronontheweb added 2 commits December 20, 2019 12:36

Merge branch 'dev' into dotnetty-batching

eb8d4c0

Merge branch 'dev' into dotnetty-batching

9df4536

Aaronontheweb mentioned this pull request Dec 28, 2019

DotNetty.Codecs.CorruptedFrameException: negative pre-adjustment length field #3879

Closed

Merge branch 'dev' into dotnetty-batching

50faacd

Aaronontheweb added 9 commits January 20, 2020 12:12

Merge branch 'dev' into dotnetty-batching

a786bc4

changed semantics around how BatchWriter behaves under low traffic

eebca2d

added comment explaining why we call WriteAsync first before flush math

01754f3

handle termination-flush internally

5bbe95b

adding some tests to the batch writer

25ba123

fixed batching test

61d24a9

fixed issues with DotNetty batching spec

95c6207

don't use built-in mechanisms to determine scheduling

b1f0070

debugging RemoteDeliverySpec

6fe16f1

Aaronontheweb added 4 commits January 21, 2020 12:17

added BatchWriterSettings class

82e50a6

added HOCON configuration for tuning the batching system

18dcce6

disable batching inside problematic Akka.Remote tests

0384346

fix C#7 compilation issue

0944a15

Aaronontheweb merged commit cd11064 into akkadotnet:dev Jan 21, 2020

Aaronontheweb deleted the dotnetty-batching branch January 21, 2020 23:36

This was referenced Feb 26, 2020

disable buffer pooling in DotNetty transport #4252

Merged

disable Akka.Remote DotNetty batching system inside TestKit by default #4257

Merged

cuteant mentioned this pull request Aug 28, 2020

SingleThreadEventLoop: Akka.Remote::BatchWriter - slow performance cuteant/SpanNetty#16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Akka.Remote: improved write performance with DotNetty flush-batching #4106

Akka.Remote: improved write performance with DotNetty flush-batching #4106

Aaronontheweb commented Dec 17, 2019

Aaronontheweb commented Dec 18, 2019

Aaronontheweb Dec 18, 2019

Aaronontheweb Dec 18, 2019

Aaronontheweb commented Dec 18, 2019

Aaronontheweb commented Dec 18, 2019

Aaronontheweb commented Dec 18, 2019

Aaronontheweb commented Dec 18, 2019

Aaronontheweb commented Dec 18, 2019

Aaronontheweb Dec 19, 2019

Aaronontheweb Dec 19, 2019

Aaronontheweb commented Dec 20, 2019

Aaronontheweb commented Jan 21, 2020

Akka.Remote: improved write performance with DotNetty flush-batching #4106

Akka.Remote: improved write performance with DotNetty flush-batching #4106

Conversation

Aaronontheweb commented Dec 17, 2019

Aaronontheweb commented Dec 18, 2019

Aaronontheweb Dec 18, 2019

Choose a reason for hiding this comment

Aaronontheweb Dec 18, 2019

Choose a reason for hiding this comment

Aaronontheweb commented Dec 18, 2019

Aaronontheweb commented Dec 18, 2019

Dev Benchmark Results

Run 1

Run 2

Run 3

Aaronontheweb commented Dec 18, 2019

Aaronontheweb commented Dec 18, 2019

dotnetty-batching Results

Run 1

Run 2

Aaronontheweb commented Dec 18, 2019

Aaronontheweb Dec 19, 2019

Choose a reason for hiding this comment

Aaronontheweb Dec 19, 2019

Choose a reason for hiding this comment

Aaronontheweb commented Dec 20, 2019

Aaronontheweb commented Jan 21, 2020