-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Envelope
a reference type again
#6137
Make Envelope
a reference type again
#6137
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why does this change significantly improve performance (i.e. reduces memory allocation during mailbox writes by 33%, improves throughput by 10-20%) ?
The reason is that the Envelope
really works using referential semantics - but is implemented as a value type (struct
) currently. As a result of the struct being passed between different scopes (Mailbox
, ConcurrentQueue<Envelope>
, dequeue into ActorCell
, push into actor implementation) it is frequently copied over and over again. While this doesn't necessarily create pressure on the GC, it does create memory pressure and does incur latency / throughput overhead that exceeds the cost of GC itself.
Therefore, it's best to make the Envelope
back into a class
and treat it as such - struct
is not a special incantation that magically improves performance; it has to be used in the right context (tightly scoped or passed around via ref
which isn't feasible here).
src/core/Akka/Actor/Message.cs
Outdated
@@ -12,7 +12,7 @@ namespace Akka.Actor | |||
/// <summary> | |||
/// Envelope class, represents a message and the sender of the message. | |||
/// </summary> | |||
public struct Envelope | |||
public sealed class Envelope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically not a "binary compatible change" but I don't think it'll have any breaking changes on end-users:
Envelope
, although apublic
type, is not accessed directly by users typically;- Even if it was, none of the public signatures have changed as this change is source-compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a (I think I'm at) tertiary curveball, did we see what happened if we made it a readonly struct
? Will admit I am not exactly confident it will be better but I still wonder about certain locality aspects around dispatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Can we get some tests with more threads? I'm not opposed in the abstract, but the half-concern I have would be around whether there is an impact in/around locality of dereferencing classes vs cost of copying a struct. IME it's one of those things where trade-offs can be surprising (I've seen many cases where classes were better, but also edge cases where structs were happier. |
Let me see if the BDN |
If you want to dig deeper, Linq2Db's perftests put a fair (i.e. not torture-test-rude but we definitely put) load on the GC, even if it might be more painful to measure. When I played with the Akka.Streams.TCP transport as well as linq2db I found that stuffing test messages with random byte-loads lead to a more fair testing of GC impact/locality at play (esp since many benchmark messages are so small... I liked to test things with 512 or 1024 random bytes stuffed in.) It was rare that improvements at those baselines didn't improve the super-small-message path though. |
BDN |
The reference type version of BDN |
As an additional bench point to consider... It's a different but still meaningful use case; same internal message being passed between a number of actors in a hierarchy. (Which is a semi-frequent pattern in the abstract.) |
I'm signing off for the evening here, but I think we need to add a BDN "memory pressure" test for actor messaging to make this more accurate. I tend to use |
That's a good idea. Let me go run those and see if there's any measurable impact - my worry there is that those benchmarks are too "crowded" with other sources of memory allocation (i.e. remoting, persistence) to measure this clearly, but it's worth a shot. |
|
Method | StateMode | MsgCount | Mean | Error | StdDev | Gen 0 | Gen 1 | Allocated |
---|---|---|---|---|---|---|---|---|
SingleRequestResponseToLocalEntity | Persistence | 10000 | 83.176 ms | 1.6634 ms | 2.3855 ms | 2000.0000 | - | 11,121,800 B |
StreamingToLocalEntity | Persistence | 10000 | NA | NA | NA | - | - | - |
SingleRequestResponseToRemoteEntity | Persistence | 10000 | 3,286.348 ms | 9.0707 ms | 7.5744 ms | 83000.0000 | 20000.0000 | 312,434,816 B |
SingleRequestResponseToRemoteEntityWithLocalProxy | Persistence | 10000 | 3,694.924 ms | 50.9317 ms | 45.1497 ms | 91000.0000 | 29000.0000 | 343,669,064 B |
StreamingToRemoteEntity | Persistence | 10000 | NA | NA | NA | - | - | - |
SingleRequestResponseToLocalEntity | DData | 10000 | 85.800 ms | 1.6552 ms | 3.8032 ms | 2000.0000 | - | 11,121,744 B |
StreamingToLocalEntity | DData | 10000 | 5.924 ms | 0.3302 ms | 0.9258 ms | - | - | 40,192 B |
SingleRequestResponseToRemoteEntity | DData | 10000 | 3,264.181 ms | 19.7592 ms | 17.5160 ms | 83000.0000 | 20000.0000 | 312,896,224 B |
SingleRequestResponseToRemoteEntityWithLocalProxy | DData | 10000 | 3,677.309 ms | 10.4486 ms | 9.2624 ms | 92000.0000 | 30000.0000 | 345,031,832 B |
StreamingToRemoteEntity | DData | 10000 | 354.743 ms | 3.8728 ms | 3.2340 ms | 79000.0000 | 1000.0000 | 301,140,544 B |
Benchmarks with issues:
ShardMessageRoutingBenchmarks.StreamingToLocalEntity: Job-POHWCC(InvocationCount=1, UnrollFactor=1) [StateMode=Persistence, MsgCount=10000]
ShardMessageRoutingBenchmarks.StreamingToRemoteEntity: Job-POHWCC(InvocationCount=1, UnrollFactor=1) [StateMode=Persistence, MsgCount=10000]
This PR
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
[Host] : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
Job-GEKXWC : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
InvocationCount=1 UnrollFactor=1
Method | StateMode | MsgCount | Mean | Error | StdDev | Gen 0 | Gen 1 | Allocated |
---|---|---|---|---|---|---|---|---|
SingleRequestResponseToLocalEntity | Persistence | 10000 | 83.807 ms | 1.6179 ms | 1.7983 ms | 2000.0000 | - | 12,083,184 B |
StreamingToLocalEntity | Persistence | 10000 | 5.424 ms | 0.2272 ms | 0.6664 ms | - | - | 675,824 B |
SingleRequestResponseToRemoteEntity | Persistence | 10000 | 3,260.137 ms | 9.8890 ms | 8.7663 ms | 84000.0000 | 21000.0000 | 317,299,736 B |
SingleRequestResponseToRemoteEntityWithLocalProxy | Persistence | 10000 | NA | NA | NA | - | - | - |
StreamingToRemoteEntity | Persistence | 10000 | 354.818 ms | 5.0141 ms | 3.9147 ms | 81000.0000 | 1000.0000 | 307,740,648 B |
SingleRequestResponseToLocalEntity | DData | 10000 | 86.820 ms | 1.6397 ms | 2.9146 ms | 3000.0000 | - | 12,722,136 B |
StreamingToLocalEntity | DData | 10000 | 5.654 ms | 0.3449 ms | 1.0115 ms | - | - | 681,504 B |
SingleRequestResponseToRemoteEntity | DData | 10000 | 3,254.340 ms | 11.5644 ms | 10.2515 ms | 84000.0000 | 20000.0000 | 317,795,128 B |
SingleRequestResponseToRemoteEntityWithLocalProxy | DData | 10000 | 3,612.266 ms | 21.1919 ms | 18.7861 ms | 93000.0000 | 30000.0000 | 349,543,280 B |
StreamingToRemoteEntity | DData | 10000 | 357.022 ms | 3.2794 ms | 3.0676 ms | 81000.0000 | 1000.0000 | 308,086,712 B |
Yeah I think the sharding PRs are too noisy - lots of other stuff going on in there other than messaging overhead |
Although if I did have to make a decision, looks like overall allocations are higher on this PR |
Got what I wanted here: #6147 |
The results are in - looks like
|
Method | MsgCount | ActorCount | Mean | Error | StdDev | Gen 0 | Gen 1 | Allocated |
---|---|---|---|---|---|---|---|---|
PushMsgs | 100000 | 10 | 1.287 s | 0.0067 s | 0.0063 s | 18000.0000 | 1000.0000 | 71 MB |
PushMsgs | 100000 | 100 | 12.818 s | 0.1166 s | 0.1091 s | 176000.0000 | 1000.0000 | 689 MB |
This PR
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
[Host] : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
Job-ZSWECQ : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
InvocationCount=1 UnrollFactor=1
Method | MsgCount | ActorCount | Mean | Error | StdDev | Median | Gen 0 | Gen 1 | Allocated |
---|---|---|---|---|---|---|---|---|---|
PushMsgs | 100000 | 10 | 1.247 s | 0.0733 s | 0.2160 s | 1.307 s | 26000.0000 | 1000.0000 | 104 MB |
PushMsgs | 100000 | 100 | 12.923 s | 0.1532 s | 0.1433 s | 12.958 s | 255000.0000 | 1000.0000 | 997 MB |
Next thing I'll try is @to11mtm 's suggestion of making Envelope
a readonly struct
.
Updated numbers with BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2006 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
[Host] : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
Job-OWNKCI : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
InvocationCount=1 UnrollFactor=1
Didn't make any difference for memory allocation, but still the right thing to do. |
* convert `Envelope` back into a reference type * approved API changes * changed to `readonly struct` * fixed API approvals
Changes
Converts
Envelope
back into asealed class
rather than astruct
- believe it or not, this significantly reduces total memory allocation and mailbox write latency.Checklist
For significant changes, please ensure that the following have been completed (delete if not relevant):
Latest
v1.4
BenchmarksAll benchmarks are running on .NET 6.
This PR's Benchmarks