Make `Envelope` a reference type again #6137

Aaronontheweb · 2022-10-05T15:15:13Z

Changes

Converts Envelope back into a sealed class rather than a struct - believe it or not, this significantly reduces total memory allocation and mailbox write latency.

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

This change follows the Akka.NET API Compatibility Guidelines.
This change follows the Akka.NET Wire Compatibility Guidelines.
I have reviewed my own pull request.
Changes in public API reviewed, if any.

Latest `v1.4` Benchmarks

All benchmarks are running on .NET 6.

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-AMVGCN : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1

Method	MsgCount	Mean	Error	StdDev	Median	Gen 0	Allocated
EnqueuePerformance	10000	215.7 μs	4.10 μs	5.75 μs	214.2 μs	-	385 KB
RunPerformance	10000	2,604.8 μs	301.89 μs	890.12 μs	3,027.6 μs	-	12 KB
EnqueuePerformance	100000	2,187.6 μs	42.70 μs	39.94 μs	2,168.9 μs	-	3,074 KB
RunPerformance	100000	12,313.3 μs	319.67 μs	937.54 μs	12,185.6 μs	-	102 KB
EnqueuePerformance	1000000	16,205.6 μs	113.18 μs	105.87 μs	16,222.6 μs	-	24,579 KB
RunPerformance	1000000	106,720.7 μs	2,118.54 μs	3,298.31 μs	106,311.6 μs	-	1,009 KB
EnqueuePerformance	10000000	163,270.1 μs	860.33 μs	671.69 μs	163,038.3 μs	-	245,765 KB
RunPerformance	10000000	1,063,885.3 μs	12,432.78 μs	11,629.63 μs	1,063,643.1 μs	2000.0000	10,082 KB

This PR's Benchmarks

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-ONWZPU : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1

Method	MsgCount	Mean	Error	StdDev	Median	Gen 0	Allocated
EnqueuePerformance	10000	133.2 μs	2.48 μs	3.05 μs	132.8 μs	-	258 KB
RunPerformance	10000	1,475.9 μs	135.40 μs	359.07 μs	1,398.5 μs	-	11 KB
EnqueuePerformance	100000	1,138.8 μs	19.96 μs	50.44 μs	1,122.1 μs	-	2,051 KB
RunPerformance	100000	11,379.5 μs	329.55 μs	971.69 μs	11,166.7 μs	-	102 KB
EnqueuePerformance	1000000	11,090.3 μs	172.91 μs	153.28 μs	11,124.0 μs	-	16,387 KB
RunPerformance	1000000	111,643.2 μs	2,053.49 μs	2,945.05 μs	110,532.5 μs	-	1,009 KB
EnqueuePerformance	10000000	136,359.7 μs	4,479.73 μs	13,208.58 μs	138,767.1 μs	-	163,846 KB
RunPerformance	10000000	1,041,018.8 μs	15,412.80 μs	14,417.15 μs	1,044,527.0 μs	2000.0000	10,083 KB

…ue-type

Aaronontheweb

So why does this change significantly improve performance (i.e. reduces memory allocation during mailbox writes by 33%, improves throughput by 10-20%) ?

The reason is that the Envelope really works using referential semantics - but is implemented as a value type (struct) currently. As a result of the struct being passed between different scopes (Mailbox, ConcurrentQueue<Envelope>, dequeue into ActorCell, push into actor implementation) it is frequently copied over and over again. While this doesn't necessarily create pressure on the GC, it does create memory pressure and does incur latency / throughput overhead that exceeds the cost of GC itself.

Therefore, it's best to make the Envelope back into a class and treat it as such - struct is not a special incantation that magically improves performance; it has to be used in the right context (tightly scoped or passed around via ref which isn't feasible here).

Aaronontheweb · 2022-10-05T15:16:21Z

src/core/Akka/Actor/Message.cs

@@ -12,7 +12,7 @@ namespace Akka.Actor
    /// <summary>
    /// Envelope class, represents a message and the sender of the message.
    /// </summary>
-    public struct Envelope
+    public sealed class Envelope


This is technically not a "binary compatible change" but I don't think it'll have any breaking changes on end-users:

Envelope, although a public type, is not accessed directly by users typically;

Even if it was, none of the public signatures have changed as this change is source-compatible.

As a (I think I'm at) tertiary curveball, did we see what happened if we made it a readonly struct? Will admit I am not exactly confident it will be better but I still wonder about certain locality aspects around dispatch.

Arkatufus

LGTM

to11mtm · 2022-10-06T01:21:54Z

Can we get some tests with more threads?

I'm not opposed in the abstract, but the half-concern I have would be around whether there is an impact in/around locality of dereferencing classes vs cost of copying a struct. IME it's one of those things where trade-offs can be surprising (I've seen many cases where classes were better, but also edge cases where structs were happier.

Aaronontheweb · 2022-10-06T01:27:00Z

Can we get some tests with more threads?

Let me see if the BDN PingPong benchmark has enough thread pressure. If that doesn't, I'll add a scatter-gather spec that does.

to11mtm · 2022-10-06T01:48:34Z

Can we get some tests with more threads?

Let me see if the BDN PingPong benchmark has enough thread pressure. If that doesn't, I'll add a scatter-gather spec that does.

If you want to dig deeper, Linq2Db's perftests put a fair (i.e. not torture-test-rude but we definitely put) load on the GC, even if it might be more painful to measure. When I played with the Akka.Streams.TCP transport as well as linq2db I found that stuffing test messages with random byte-loads lead to a more fair testing of GC impact/locality at play (esp since many benchmark messages are so small... I liked to test things with 512 or 1024 random bytes stuffed in.) It was rare that improvements at those baselines didn't improve the super-small-message path though.

Aaronontheweb · 2022-10-06T01:49:39Z

BDN PingPong is definitely not "big" enough to generate that type of data (reports back only 55 bytes allocated... which... I don't believe...)

Aaronontheweb · 2022-10-06T01:52:41Z

The reference type version of BDN PingPong reports 87 bytes.... yeah, I'm thinking that particular benchmark might have some instrumentation issues.

to11mtm · 2022-10-06T01:58:36Z

As an additional bench point to consider...
May I suggest the Sharding benches? (The ones that tested asks vs buffering... can find issue/PR# if needed)

It's a different but still meaningful use case; same internal message being passed between a number of actors in a hierarchy. (Which is a semi-frequent pattern in the abstract.)

Aaronontheweb · 2022-10-06T01:59:54Z

Can we get some tests with more threads?

Let me see if the BDN PingPong benchmark has enough thread pressure. If that doesn't, I'll add a scatter-gather spec that does.

If you want to dig deeper, Linq2Db's perftests put a fair (i.e. not torture-test-rude but we definitely put) load on the GC, even if it might be more painful to measure. When I played with the Akka.Streams.TCP transport as well as linq2db I found that stuffing test messages with random byte-loads lead to a more fair testing of GC impact/locality at play (esp since many benchmark messages are so small... I liked to test things with 512 or 1024 random bytes stuffed in.) It was rare that improvements at those baselines didn't improve the super-small-message path though.

I'm signing off for the evening here, but I think we need to add a BDN "memory pressure" test for actor messaging to make this more accurate. I tend to use static message definitions that get fed into the .Tell method so I don't count the overhead of the original message content itself - just its effects in the pipeline. The other benchmarks I've added have had no trouble measuring that, just the PingPong BDN appears to have trouble.

Aaronontheweb · 2022-10-06T02:03:13Z

As an additional bench point to consider... May I suggest the Sharding benches? (The ones that tested asks vs buffering... can find issue/PR# if needed)

It's a different but still meaningful use case; same internal message being passed between a number of actors in a hierarchy. (Which is a semi-frequent pattern in the abstract.)

That's a good idea. Let me go run those and see if there's any measurable impact - my worry there is that those benchmarks are too "crowded" with other sources of memory allocation (i.e. remoting, persistence) to measure this clearly, but it's worth a shot.

Aaronontheweb · 2022-10-06T02:58:34Z

`v1.4` Branch

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-POHWCC : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1

Method	StateMode	MsgCount	Mean	Error	StdDev	Gen 0	Gen 1	Allocated
SingleRequestResponseToLocalEntity	Persistence	10000	83.176 ms	1.6634 ms	2.3855 ms	2000.0000	-	11,121,800 B
StreamingToLocalEntity	Persistence	10000	NA	NA	NA	-	-	-
SingleRequestResponseToRemoteEntity	Persistence	10000	3,286.348 ms	9.0707 ms	7.5744 ms	83000.0000	20000.0000	312,434,816 B
SingleRequestResponseToRemoteEntityWithLocalProxy	Persistence	10000	3,694.924 ms	50.9317 ms	45.1497 ms	91000.0000	29000.0000	343,669,064 B
StreamingToRemoteEntity	Persistence	10000	NA	NA	NA	-	-	-
SingleRequestResponseToLocalEntity	DData	10000	85.800 ms	1.6552 ms	3.8032 ms	2000.0000	-	11,121,744 B
StreamingToLocalEntity	DData	10000	5.924 ms	0.3302 ms	0.9258 ms	-	-	40,192 B
SingleRequestResponseToRemoteEntity	DData	10000	3,264.181 ms	19.7592 ms	17.5160 ms	83000.0000	20000.0000	312,896,224 B
SingleRequestResponseToRemoteEntityWithLocalProxy	DData	10000	3,677.309 ms	10.4486 ms	9.2624 ms	92000.0000	30000.0000	345,031,832 B
StreamingToRemoteEntity	DData	10000	354.743 ms	3.8728 ms	3.2340 ms	79000.0000	1000.0000	301,140,544 B

Benchmarks with issues:
ShardMessageRoutingBenchmarks.StreamingToLocalEntity: Job-POHWCC(InvocationCount=1, UnrollFactor=1) [StateMode=Persistence, MsgCount=10000]
ShardMessageRoutingBenchmarks.StreamingToRemoteEntity: Job-POHWCC(InvocationCount=1, UnrollFactor=1) [StateMode=Persistence, MsgCount=10000]

This PR

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-GEKXWC : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1

Method	StateMode	MsgCount	Mean	Error	StdDev	Gen 0	Gen 1	Allocated
SingleRequestResponseToLocalEntity	Persistence	10000	83.807 ms	1.6179 ms	1.7983 ms	2000.0000	-	12,083,184 B
StreamingToLocalEntity	Persistence	10000	5.424 ms	0.2272 ms	0.6664 ms	-	-	675,824 B
SingleRequestResponseToRemoteEntity	Persistence	10000	3,260.137 ms	9.8890 ms	8.7663 ms	84000.0000	21000.0000	317,299,736 B
SingleRequestResponseToRemoteEntityWithLocalProxy	Persistence	10000	NA	NA	NA	-	-	-
StreamingToRemoteEntity	Persistence	10000	354.818 ms	5.0141 ms	3.9147 ms	81000.0000	1000.0000	307,740,648 B
SingleRequestResponseToLocalEntity	DData	10000	86.820 ms	1.6397 ms	2.9146 ms	3000.0000	-	12,722,136 B
StreamingToLocalEntity	DData	10000	5.654 ms	0.3449 ms	1.0115 ms	-	-	681,504 B
SingleRequestResponseToRemoteEntity	DData	10000	3,254.340 ms	11.5644 ms	10.2515 ms	84000.0000	20000.0000	317,795,128 B
SingleRequestResponseToRemoteEntityWithLocalProxy	DData	10000	3,612.266 ms	21.1919 ms	18.7861 ms	93000.0000	30000.0000	349,543,280 B
StreamingToRemoteEntity	DData	10000	357.022 ms	3.2794 ms	3.0676 ms	81000.0000	1000.0000	308,086,712 B

Aaronontheweb · 2022-10-06T02:59:47Z

Yeah I think the sharding PRs are too noisy - lots of other stuff going on in there other than messaging overhead

Aaronontheweb · 2022-10-06T03:00:30Z

Although if I did have to make a decision, looks like overall allocations are higher on this PR

Aaronontheweb · 2022-10-06T15:50:13Z

Got what I wanted here: #6147

Aaronontheweb · 2022-10-06T16:01:31Z

The results are in - looks like struct still wins when we dial up threading pressure:

`v1.4`

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-WGPBOC : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1

Method	MsgCount	ActorCount	Mean	Error	StdDev	Gen 0	Gen 1	Allocated
PushMsgs	100000	10	1.287 s	0.0067 s	0.0063 s	18000.0000	1000.0000	71 MB
PushMsgs	100000	100	12.818 s	0.1166 s	0.1091 s	176000.0000	1000.0000	689 MB

This PR

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-ZSWECQ : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1

Method	MsgCount	ActorCount	Mean	Error	StdDev	Median	Gen 0	Gen 1	Allocated
PushMsgs	100000	10	1.247 s	0.0733 s	0.2160 s	1.307 s	26000.0000	1000.0000	104 MB
PushMsgs	100000	100	12.923 s	0.1532 s	0.1433 s	12.958 s	255000.0000	1000.0000	997 MB

Next thing I'll try is @to11mtm 's suggestion of making Envelope a readonly struct.

Aaronontheweb · 2022-10-06T16:32:05Z

Updated numbers with readonly struct:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-OWNKCI : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1

Method	MsgCount	ActorCount	Mean	Error	StdDev	Median	Gen 0	Gen 1	Allocated
PushMsgs	100000	10	1.206 s	0.0855 s	0.2520 s	1.284 s	18000.0000	1000.0000	71 MB
PushMsgs	100000	100	12.726 s	0.1286 s	0.1140 s	12.750 s	176000.0000	1000.0000	689 MB

Didn't make any difference for memory allocation, but still the right thing to do.

* convert `Envelope` back into a reference type * approved API changes * changed to `readonly struct` * fixed API approvals

Aaronontheweb added 2 commits October 5, 2022 10:03

convert Envelope back into a reference type

1bca268

Merge remote-tracking branch 'akkadotnet/v1.4' into perf-envelope-val…

44b09c5

…ue-type

Aaronontheweb added akka-actor perf akka.net v1.4 Issues affecting Akka.NET v1.4 labels Oct 5, 2022

Aaronontheweb added this to the 1.4.44 milestone Oct 5, 2022

Aaronontheweb commented Oct 5, 2022

View reviewed changes

approved API changes

810705e

Arkatufus approved these changes Oct 5, 2022

View reviewed changes

Arkatufus enabled auto-merge (squash) October 5, 2022 16:18

Aaronontheweb disabled auto-merge October 5, 2022 18:03

Merge branch 'v1.4' into perf-envelope-value-type

9be437b

Merge branch 'v1.4' into perf-envelope-value-type

0e726ee

changed to readonly struct

6d5c806

fixed API approvals

0475004

Aaronontheweb merged commit a2b27a7 into akkadotnet:v1.4 Oct 6, 2022

Aaronontheweb deleted the perf-envelope-value-type branch October 6, 2022 18:31

Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this pull request Oct 8, 2022

Make Envelope a reference type again (akkadotnet#6137)

e308e28

* convert `Envelope` back into a reference type * approved API changes * changed to `readonly struct` * fixed API approvals

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `Envelope` a reference type again #6137

Make `Envelope` a reference type again #6137

Aaronontheweb commented Oct 5, 2022 •

edited

Loading

Aaronontheweb left a comment

Aaronontheweb Oct 5, 2022

to11mtm Oct 6, 2022

Arkatufus left a comment

to11mtm commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

to11mtm commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

to11mtm commented Oct 6, 2022 •

edited

Loading

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

Make Envelope a reference type again #6137

Make Envelope a reference type again #6137

Conversation

Aaronontheweb commented Oct 5, 2022 • edited Loading

Changes

Checklist

Latest v1.4 Benchmarks

This PR's Benchmarks

Aaronontheweb left a comment

Choose a reason for hiding this comment

Aaronontheweb Oct 5, 2022

Choose a reason for hiding this comment

to11mtm Oct 6, 2022

Choose a reason for hiding this comment

Arkatufus left a comment

Choose a reason for hiding this comment

to11mtm commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

to11mtm commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

to11mtm commented Oct 6, 2022 • edited Loading

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

v1.4 Branch

This PR

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

Aaronontheweb commented Oct 6, 2022

v1.4

This PR

Aaronontheweb commented Oct 6, 2022

Make `Envelope` a reference type again #6137

Make `Envelope` a reference type again #6137

Aaronontheweb commented Oct 5, 2022 •

edited

Loading

Latest `v1.4` Benchmarks

to11mtm commented Oct 6, 2022 •

edited

Loading

`v1.4` Branch

`v1.4`