Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Envelope a reference type again #6137

Merged

Conversation

Aaronontheweb
Copy link
Member

@Aaronontheweb Aaronontheweb commented Oct 5, 2022

Changes

Converts Envelope back into a sealed class rather than a struct - believe it or not, this significantly reduces total memory allocation and mailbox write latency.

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

Latest v1.4 Benchmarks

All benchmarks are running on .NET 6.

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-AMVGCN : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1  
Method MsgCount Mean Error StdDev Median Gen 0 Allocated
EnqueuePerformance 10000 215.7 μs 4.10 μs 5.75 μs 214.2 μs - 385 KB
RunPerformance 10000 2,604.8 μs 301.89 μs 890.12 μs 3,027.6 μs - 12 KB
EnqueuePerformance 100000 2,187.6 μs 42.70 μs 39.94 μs 2,168.9 μs - 3,074 KB
RunPerformance 100000 12,313.3 μs 319.67 μs 937.54 μs 12,185.6 μs - 102 KB
EnqueuePerformance 1000000 16,205.6 μs 113.18 μs 105.87 μs 16,222.6 μs - 24,579 KB
RunPerformance 1000000 106,720.7 μs 2,118.54 μs 3,298.31 μs 106,311.6 μs - 1,009 KB
EnqueuePerformance 10000000 163,270.1 μs 860.33 μs 671.69 μs 163,038.3 μs - 245,765 KB
RunPerformance 10000000 1,063,885.3 μs 12,432.78 μs 11,629.63 μs 1,063,643.1 μs 2000.0000 10,082 KB

This PR's Benchmarks

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-ONWZPU : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1  
Method MsgCount Mean Error StdDev Median Gen 0 Allocated
EnqueuePerformance 10000 133.2 μs 2.48 μs 3.05 μs 132.8 μs - 258 KB
RunPerformance 10000 1,475.9 μs 135.40 μs 359.07 μs 1,398.5 μs - 11 KB
EnqueuePerformance 100000 1,138.8 μs 19.96 μs 50.44 μs 1,122.1 μs - 2,051 KB
RunPerformance 100000 11,379.5 μs 329.55 μs 971.69 μs 11,166.7 μs - 102 KB
EnqueuePerformance 1000000 11,090.3 μs 172.91 μs 153.28 μs 11,124.0 μs - 16,387 KB
RunPerformance 1000000 111,643.2 μs 2,053.49 μs 2,945.05 μs 110,532.5 μs - 1,009 KB
EnqueuePerformance 10000000 136,359.7 μs 4,479.73 μs 13,208.58 μs 138,767.1 μs - 163,846 KB
RunPerformance 10000000 1,041,018.8 μs 15,412.80 μs 14,417.15 μs 1,044,527.0 μs 2000.0000 10,083 KB

@Aaronontheweb Aaronontheweb added akka-actor perf akka.net v1.4 Issues affecting Akka.NET v1.4 labels Oct 5, 2022
@Aaronontheweb Aaronontheweb added this to the 1.4.44 milestone Oct 5, 2022
Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why does this change significantly improve performance (i.e. reduces memory allocation during mailbox writes by 33%, improves throughput by 10-20%) ?

The reason is that the Envelope really works using referential semantics - but is implemented as a value type (struct) currently. As a result of the struct being passed between different scopes (Mailbox, ConcurrentQueue<Envelope>, dequeue into ActorCell, push into actor implementation) it is frequently copied over and over again. While this doesn't necessarily create pressure on the GC, it does create memory pressure and does incur latency / throughput overhead that exceeds the cost of GC itself.

Therefore, it's best to make the Envelope back into a class and treat it as such - struct is not a special incantation that magically improves performance; it has to be used in the right context (tightly scoped or passed around via ref which isn't feasible here).

@@ -12,7 +12,7 @@ namespace Akka.Actor
/// <summary>
/// Envelope class, represents a message and the sender of the message.
/// </summary>
public struct Envelope
public sealed class Envelope
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically not a "binary compatible change" but I don't think it'll have any breaking changes on end-users:

  1. Envelope, although a public type, is not accessed directly by users typically;
  2. Even if it was, none of the public signatures have changed as this change is source-compatible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a (I think I'm at) tertiary curveball, did we see what happened if we made it a readonly struct? Will admit I am not exactly confident it will be better but I still wonder about certain locality aspects around dispatch.

Copy link
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Arkatufus Arkatufus enabled auto-merge (squash) October 5, 2022 16:18
@to11mtm
Copy link
Member

to11mtm commented Oct 6, 2022

Can we get some tests with more threads?

I'm not opposed in the abstract, but the half-concern I have would be around whether there is an impact in/around locality of dereferencing classes vs cost of copying a struct. IME it's one of those things where trade-offs can be surprising (I've seen many cases where classes were better, but also edge cases where structs were happier.

@Aaronontheweb
Copy link
Member Author

Can we get some tests with more threads?

Let me see if the BDN PingPong benchmark has enough thread pressure. If that doesn't, I'll add a scatter-gather spec that does.

@to11mtm
Copy link
Member

to11mtm commented Oct 6, 2022

Can we get some tests with more threads?

Let me see if the BDN PingPong benchmark has enough thread pressure. If that doesn't, I'll add a scatter-gather spec that does.

If you want to dig deeper, Linq2Db's perftests put a fair (i.e. not torture-test-rude but we definitely put) load on the GC, even if it might be more painful to measure. When I played with the Akka.Streams.TCP transport as well as linq2db I found that stuffing test messages with random byte-loads lead to a more fair testing of GC impact/locality at play (esp since many benchmark messages are so small... I liked to test things with 512 or 1024 random bytes stuffed in.) It was rare that improvements at those baselines didn't improve the super-small-message path though.

@Aaronontheweb
Copy link
Member Author

BDN PingPong is definitely not "big" enough to generate that type of data (reports back only 55 bytes allocated... which... I don't believe...)

@Aaronontheweb
Copy link
Member Author

The reference type version of BDN PingPong reports 87 bytes.... yeah, I'm thinking that particular benchmark might have some instrumentation issues.

@to11mtm
Copy link
Member

to11mtm commented Oct 6, 2022

As an additional bench point to consider...
May I suggest the Sharding benches? (The ones that tested asks vs buffering... can find issue/PR# if needed)

It's a different but still meaningful use case; same internal message being passed between a number of actors in a hierarchy. (Which is a semi-frequent pattern in the abstract.)

@Aaronontheweb
Copy link
Member Author

Can we get some tests with more threads?

Let me see if the BDN PingPong benchmark has enough thread pressure. If that doesn't, I'll add a scatter-gather spec that does.

If you want to dig deeper, Linq2Db's perftests put a fair (i.e. not torture-test-rude but we definitely put) load on the GC, even if it might be more painful to measure. When I played with the Akka.Streams.TCP transport as well as linq2db I found that stuffing test messages with random byte-loads lead to a more fair testing of GC impact/locality at play (esp since many benchmark messages are so small... I liked to test things with 512 or 1024 random bytes stuffed in.) It was rare that improvements at those baselines didn't improve the super-small-message path though.

I'm signing off for the evening here, but I think we need to add a BDN "memory pressure" test for actor messaging to make this more accurate. I tend to use static message definitions that get fed into the .Tell method so I don't count the overhead of the original message content itself - just its effects in the pipeline. The other benchmarks I've added have had no trouble measuring that, just the PingPong BDN appears to have trouble.

@Aaronontheweb
Copy link
Member Author

As an additional bench point to consider... May I suggest the Sharding benches? (The ones that tested asks vs buffering... can find issue/PR# if needed)

It's a different but still meaningful use case; same internal message being passed between a number of actors in a hierarchy. (Which is a semi-frequent pattern in the abstract.)

That's a good idea. Let me go run those and see if there's any measurable impact - my worry there is that those benchmarks are too "crowded" with other sources of memory allocation (i.e. remoting, persistence) to measure this clearly, but it's worth a shot.

@Aaronontheweb
Copy link
Member Author

v1.4 Branch

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-POHWCC : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1  
Method StateMode MsgCount Mean Error StdDev Gen 0 Gen 1 Allocated
SingleRequestResponseToLocalEntity Persistence 10000 83.176 ms 1.6634 ms 2.3855 ms 2000.0000 - 11,121,800 B
StreamingToLocalEntity Persistence 10000 NA NA NA - - -
SingleRequestResponseToRemoteEntity Persistence 10000 3,286.348 ms 9.0707 ms 7.5744 ms 83000.0000 20000.0000 312,434,816 B
SingleRequestResponseToRemoteEntityWithLocalProxy Persistence 10000 3,694.924 ms 50.9317 ms 45.1497 ms 91000.0000 29000.0000 343,669,064 B
StreamingToRemoteEntity Persistence 10000 NA NA NA - - -
SingleRequestResponseToLocalEntity DData 10000 85.800 ms 1.6552 ms 3.8032 ms 2000.0000 - 11,121,744 B
StreamingToLocalEntity DData 10000 5.924 ms 0.3302 ms 0.9258 ms - - 40,192 B
SingleRequestResponseToRemoteEntity DData 10000 3,264.181 ms 19.7592 ms 17.5160 ms 83000.0000 20000.0000 312,896,224 B
SingleRequestResponseToRemoteEntityWithLocalProxy DData 10000 3,677.309 ms 10.4486 ms 9.2624 ms 92000.0000 30000.0000 345,031,832 B
StreamingToRemoteEntity DData 10000 354.743 ms 3.8728 ms 3.2340 ms 79000.0000 1000.0000 301,140,544 B

Benchmarks with issues:
ShardMessageRoutingBenchmarks.StreamingToLocalEntity: Job-POHWCC(InvocationCount=1, UnrollFactor=1) [StateMode=Persistence, MsgCount=10000]
ShardMessageRoutingBenchmarks.StreamingToRemoteEntity: Job-POHWCC(InvocationCount=1, UnrollFactor=1) [StateMode=Persistence, MsgCount=10000]

This PR

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-GEKXWC : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1  
Method StateMode MsgCount Mean Error StdDev Gen 0 Gen 1 Allocated
SingleRequestResponseToLocalEntity Persistence 10000 83.807 ms 1.6179 ms 1.7983 ms 2000.0000 - 12,083,184 B
StreamingToLocalEntity Persistence 10000 5.424 ms 0.2272 ms 0.6664 ms - - 675,824 B
SingleRequestResponseToRemoteEntity Persistence 10000 3,260.137 ms 9.8890 ms 8.7663 ms 84000.0000 21000.0000 317,299,736 B
SingleRequestResponseToRemoteEntityWithLocalProxy Persistence 10000 NA NA NA - - -
StreamingToRemoteEntity Persistence 10000 354.818 ms 5.0141 ms 3.9147 ms 81000.0000 1000.0000 307,740,648 B
SingleRequestResponseToLocalEntity DData 10000 86.820 ms 1.6397 ms 2.9146 ms 3000.0000 - 12,722,136 B
StreamingToLocalEntity DData 10000 5.654 ms 0.3449 ms 1.0115 ms - - 681,504 B
SingleRequestResponseToRemoteEntity DData 10000 3,254.340 ms 11.5644 ms 10.2515 ms 84000.0000 20000.0000 317,795,128 B
SingleRequestResponseToRemoteEntityWithLocalProxy DData 10000 3,612.266 ms 21.1919 ms 18.7861 ms 93000.0000 30000.0000 349,543,280 B
StreamingToRemoteEntity DData 10000 357.022 ms 3.2794 ms 3.0676 ms 81000.0000 1000.0000 308,086,712 B

@Aaronontheweb
Copy link
Member Author

Yeah I think the sharding PRs are too noisy - lots of other stuff going on in there other than messaging overhead

@Aaronontheweb
Copy link
Member Author

Although if I did have to make a decision, looks like overall allocations are higher on this PR

@Aaronontheweb
Copy link
Member Author

Got what I wanted here: #6147

@Aaronontheweb
Copy link
Member Author

The results are in - looks like struct still wins when we dial up threading pressure:

v1.4

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-WGPBOC : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1  
Method MsgCount ActorCount Mean Error StdDev Gen 0 Gen 1 Allocated
PushMsgs 100000 10 1.287 s 0.0067 s 0.0063 s 18000.0000 1000.0000 71 MB
PushMsgs 100000 100 12.818 s 0.1166 s 0.1091 s 176000.0000 1000.0000 689 MB

This PR

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-ZSWECQ : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1  
Method MsgCount ActorCount Mean Error StdDev Median Gen 0 Gen 1 Allocated
PushMsgs 100000 10 1.247 s 0.0733 s 0.2160 s 1.307 s 26000.0000 1000.0000 104 MB
PushMsgs 100000 100 12.923 s 0.1532 s 0.1433 s 12.958 s 255000.0000 1000.0000 997 MB

Next thing I'll try is @to11mtm 's suggestion of making Envelope a readonly struct.

@Aaronontheweb
Copy link
Member Author

Updated numbers with readonly struct:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  Job-OWNKCI : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT

InvocationCount=1  UnrollFactor=1  
Method MsgCount ActorCount Mean Error StdDev Median Gen 0 Gen 1 Allocated
PushMsgs 100000 10 1.206 s 0.0855 s 0.2520 s 1.284 s 18000.0000 1000.0000 71 MB
PushMsgs 100000 100 12.726 s 0.1286 s 0.1140 s 12.750 s 176000.0000 1000.0000 689 MB

Didn't make any difference for memory allocation, but still the right thing to do.

@Aaronontheweb Aaronontheweb merged commit a2b27a7 into akkadotnet:v1.4 Oct 6, 2022
@Aaronontheweb Aaronontheweb deleted the perf-envelope-value-type branch October 6, 2022 18:31
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this pull request Oct 8, 2022
* convert `Envelope` back into a reference type

* approved API changes

* changed to `readonly struct`

* fixed API approvals
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
akka.net v1.4 Issues affecting Akka.NET v1.4 akka-actor perf
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants