-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Performance optimize Ask
#4965
[WIP] Performance optimize Ask
#4965
Conversation
Going to use the baseline data from #4962 to measure the impact on |
Baseline numbers on No idea why these throughput numbers are lower than the gains I posted for BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.928 (2004/?/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.201
[Host] : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
DefaultJob : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
|
Got a cleaner baseline this time BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.928 (2004/?/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.201
[Host] : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
DefaultJob : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
|
Introduced object pooling for BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.928 (2004/?/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.201
[Host] : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
DefaultJob : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
Throughput seems pretty much the same, but definitely a drop in Gen 1 allocation and total memory. |
Interestingly, this change will also affect how quickly actors can spawn since the same method I'm pooling |
internal static class PooledObject | ||
{ | ||
public static readonly ObjectPool<StringBuilder> StringBuilderPool = | ||
new DefaultObjectPoolProvider().CreateStringBuilderPool(512, 2048); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we expecting to re-use this pool elsewhere? These sizes may be a bit much otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's several other places where we can use it, but the sizes I chose are arbitrary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly curious if there was significant overhead in accessing and returning a pooled StringBuilder
vs Gen 0-ing one.
Ask
Ask
next = next >> 6; | ||
encodedBytes[writeIndex]= Base64Chars[index]; | ||
next = next >> 6; | ||
writeIndex++; | ||
} while (next != 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert on akka.net internals but here are my two cents
You can always unroll the loop instead of looping since this is a fixed-size data type it will increase performance at least fourfold
Also, the prefix is preventing it from doing micro-optimization so you can split methods (with any/with just 1/without prefix) to receive performance optimization case by case basis, you also can skip stackalloc in this way (stackalloc benefits from less allocation but you have to copy the memory block anyways, so one less, much better performance)
just a side note, I don't think this is important but I think we may have endianness issue here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm up for trying all of the above.
just a side note, I don't think this is important but I think we may have endianness issue here
ah, you mean I'm doing it big endian? Whoops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, you mean I'm doing it big endian? Whoops.
Meh, I'm not saying you're doing anything wrong and also I don't think anyone will try to bitconvert long to base64 and do whatever you're doing here themselves in between different architecture systems/cluster nodes, I'm saying just because I have zero knowledge regarding internals. so, it's a just side note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're just using this to help generate random actor names, which is something we have to do for the temporary 1-off actors that get used to power Ask
and GracefulStop
- when those actor paths get translated over the network it's all done as a string
so it's probably of little consequence, but I still don't like doing things wrong :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I don't think it does matter at all. you're doing it just fine 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I'm doing this for fun and.. I got quite confusing result while validating results
`BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i9-9900K CPU 3.60GHz (Coffee Lake), 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.202
[Host] : .NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT
DefaultJob : .NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT
Method | Mean | Error | StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|
Benchmark | 28.82 ns | 0.262 ns | 0.245 ns | 1.00 | 0.0057 | - | - | 48 B |
BenchmarkOptimization1 | 12.68 ns | 0.207 ns | 0.184 ns | 0.44 | 0.0057 | - | - | 48 B |
and Here's my code
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe string Base64EncodeOptimization1(long value)
{
// disable range checking
fixed (char* base64 = &base64Table[0])
{
Span<char> encodedBytes = stackalloc char[11];
var next = value;
encodedBytes[0] = base64[(int) ((next >> (6 * 0)) & 63)];
encodedBytes[1] = base64[(int) ((next >> (6 * 1)) & 63)];
encodedBytes[2] = base64[(int) ((next >> (6 * 2)) & 63)];
encodedBytes[3] = base64[(int) ((next >> (6 * 3)) & 63)];
encodedBytes[4] = base64[(int) ((next >> (6 * 4)) & 63)];
encodedBytes[5] = base64[(int) ((next >> (6 * 5)) & 63)];
encodedBytes[6] = base64[(int) ((next >> (6 * 6)) & 63)];
encodedBytes[7] = base64[(int) ((next >> (6 * 7)) & 63)];
encodedBytes[8] = base64[(int) ((next >> (6 * 8)) & 63)];
encodedBytes[9] = base64[(int) ((next >> (6 * 9)) & 63)];
encodedBytes[10] = base64[(int) ((next >> (6 * 10)) & 63)];
return new string(encodedBytes);
}
}
Here's the catch:
- This is actually base63 (no '='), I think this is intended it but made me scratch my head
- and original code loses all top bits if all zero (edge case: 0), well, if it's intended, okay.
- but the chance of hash conflict is higher than I thought because of 1) and 2)
- I think my micro-optimization is pointless unless you use this in a tight loop (those two are only nano-second differences..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @chris-sung - sorry I missed this!
This is actually base63 (no '='), I think this is intended it but made me scratch my head
That's correct - believe we can't use that due to Akka.NET Uri encoding restrictions. This function is primarily used to generate random actor names.
and original code loses all top bits if all zero (edge case: 0), well, if it's intended, okay.
That should also be fine.
I think my micro-optimization is pointless unless you use this in a tight loop (those two are only nano-second differences..)
We don't use this in a tight loop - most of the benefit from this function comes from eliminating the stringbuilder
allocation and reducing GC overhead. This function typically gets invoked when we're generating actor names, so the area where you'll see the biggest performance benefit is when a large number of Ask<T>
operations are being created all at once.
Maybe we should still incorporate this in v1.4.21 though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the explanation. what do you mean by "incorporate this" exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean use it instead of my current implementation!
No description provided.