-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix condition to align methods to 32 bytes boundary #42909
Conversation
@dotnet/jit-contrib , @AndyAyersMS |
Expected impact is
Would be good to highlight one or two benchmarks we really expect to stabilize with this (benchmarks with very simple timing payloads). Note because of the way BenchmarkDotNet operates the main "driving" loop for most benchmark runs (which invokes a delegate that runs the method to benchmark) will be jitted and has a loop, so bypasses Tiering. One hypothesis we're exploring is that the varying alignment of this loop over time contributes to some of our observed day to day perf fluctuations. Beyond that, once this is in, it would be good to do a retrospective study of benchmarks after we've gotten a few days worth of data, and see if we really did move the needle on stability. I had hoped to do that with #2249 but lab collection wasn't in great shape back then; looking back now it doesn't seem to have really had much impact, perhaps because it only affected Tier1 methods. Any idea what a typical delta this causes on the overall size of the code segment? |
I picked up LoopReturn benchmark which was recently found by BenchmarkDotNet to have 22% regression but is Another thing that I experimented along with method alignment is the effect on the benchmark timing because of loop alignment within the method. With the help of Andy's NopPadding branch, I added various padding of All the measurements are done on StabilityWithout the no-ops, the loop starts at offset Benchmark measurements for default alignment
Benchmark measurements for 32-byte alignment
PerformanceI also tried measuring the performance between default vs. 32-byte alignment and I see some segment of my observation that I saw in stability. As soon as the loop gets aligned to 32-byte boundary (after adding 3+ nops), the performance with 32 byte alignment becomes better. So perhaps, to improve performance as Andy pointed out multiple times, it will be good to align hot loops at 32-byte boundary which in combination with method alignment will give us both stability and performance wins.
|
@adamsitnik , @danmosemsft |
Pretty nice if we get perf for customers as well as stability for us. i didn't expect that: I assumed we would just move perf around when we made alignment stable. So great! |
If I remember correctly, @AndreyAkinshin has added a manual loop unrolling to this particular loop to minimize the effects of the alignment. If you pass private void WorkloadActionUnroll(System.Int64 invokeCount)
{
for (System.Int64 i = 0; i < invokeCount; i++)
{
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
consumer.Consume(workloadDelegate());
}
} to disable it, you need to pass private void WorkloadActionUnroll(System.Int64 invokeCount)
{
for (System.Int64 i = 0; i < invokeCount; i++)
{
consumer.Consume(workloadDelegate());
}
} |
@kunalspathak this looks and sounds awesome!
this would be perfect!! FWIW here are some benchmarks that I remember are very dependent on the loop alignment:
No loops, but super dependent on alignment on
|
I added logging around the change to see how many methods get 32-byte aligned vs. the default alignment and then ran pmi on .NET libraries. There is definitely duplicate methods that might get counted in either of them, but on an average, 532,667 methods continued getting default alignment while 70,573 methods got 32-bytes alignment i.e. out of 603,240 total methods, 70,573 methods got 32-byte aligned which is approx. 11% of them. This is close to what @AndyAyersMS anticipated in #2249 (comment).
I tried gathering the results for Fib benchmark (the only one which is bimodel and has no-loops) and seeing mixed results. master:
PR:
|
For x86 as well, I don't see much difference for Fib benchmark on my local machine. master:
PR:
|
If we assume half of these would have been 32 byte aligned anyways, then we have an additional 35K instances of 16 bytes wasted in the code segment, so around 560K bytes of extra allocation. I'd like to know how this compares to the total amount of code. What was the total code size for the 603,240 methods? |
That makes sense.
The total code size is 51377607 bytes or around 51MB. |
So perhaps as much as 1% of the code space is now padding. I suppose it's possible the code heap can fit some smaller methods into the larger padding voids. Would be nice at some point to actually measure the heap sizes instead of trying to do these back of the envelope estimations. Let's take this change and keep an eye on the perf lab runs to see if we do indeed see increased stability. |
Yes, I wanted to get the heap size number for this one and spent some time debugging the allocCode and allocCodeRaw but in order to get accurate numbers from the appropriate heap (specially around |
This is pretty cool! 😄 @AndyAyersMS Do you think this could be related to the .NET perf regession from #36907? |
This work is a prerequisite for aligning branches -- we plan to explore doing that soon. I'd expect this change might stabilize performance of your code, but would not necessarily optimize it. |
Ah, I see, thank you for the clarification! 😊 Was just curious whether this was at least in part related to that, happy to hear branch alignment work is planned soon! |
In dotnet#2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.
In #2249, we started doing alignment of methods to 32-byte boundary for
Tier1
. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.