Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix condition to align methods to 32 bytes boundary #42909

Merged
merged 1 commit into from
Oct 7, 2020

Conversation

kunalspathak
Copy link
Member

In #2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 30, 2020
@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib , @AndyAyersMS

@AndyAyersMS
Copy link
Member

Expected impact is

  • (short term) perf stability for benchmarks that run jitted code with loops
  • (medium term) ability to optimize perf by adjusting alignment of code within methods

Would be good to highlight one or two benchmarks we really expect to stabilize with this (benchmarks with very simple timing payloads).

Note because of the way BenchmarkDotNet operates the main "driving" loop for most benchmark runs (which invokes a delegate that runs the method to benchmark) will be jitted and has a loop, so bypasses Tiering. One hypothesis we're exploring is that the varying alignment of this loop over time contributes to some of our observed day to day perf fluctuations.

Beyond that, once this is in, it would be good to do a retrospective study of benchmarks after we've gotten a few days worth of data, and see if we really did move the needle on stability. I had hoped to do that with #2249 but lab collection wasn't in great shape back then; looking back now it doesn't seem to have really had much impact, perhaps because it only affected Tier1 methods.

Any idea what a typical delta this causes on the overall size of the code segment?

@kunalspathak
Copy link
Member Author

kunalspathak commented Oct 1, 2020

I picked up LoopReturn benchmark which was recently found by BenchmarkDotNet to have 22% regression but is Bimodel test. The benchmark itself has a tight loop on which I can do some experimentation. For the sake of argument, I removed the only condition in the loop code to eliminate any branch prediction effect. The loop is 17 bytes long, so a single fetch of 32B line should bring the entire loop code into the decoder and cached in DSB.

Another thing that I experimented along with method alignment is the effect on the benchmark timing because of loop alignment within the method. With the help of Andy's NopPadding branch, I added various padding of nop to adjust loop header with alignments ranging from 0x0 thru 0xF to see if fixing the method alignment to 32-byte boundary helps overall execution timing and stability of the benchmark regardless of where loop is aligned. The results are positive.

All the measurements are done on Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores.

Stability

Without the no-ops, the loop starts at offset 0x3D, so adding 1 no-op moves the loop to 0x3E and so forth. In below graph, X-axis is no. of nops padding I added in the method and Y-axis is the standard deviation of benchmark timings measured over 5 iterations. Key observation is that there is very less deviation with 32-byte alignment as compared to the default (16-byte alignment). With default alignment, for some nops padding, the benchmarks are stable, but in general they deviate if the loop's alignment is not favorable to the DSB line. As opposed to that, the 32-byte alignment provides stable timings with very little deviation, no matter what the loop alignment is. For 32-byte method alignment, when the loop's alignment gets to 0x40 (after adding 3 thru 15 nops), the performance even gets better. My theory is that after 3 nops are added in a method, the loop (17 bytes long) gets loaded entirely in a single fetch. For e.g. with 2 nops, the loop starts at 0x3F address which is not a boundary of 32-byte alignment and so there might be two fetches needed to bring the loop to decoder. However after adding, for example, 5 nops, although the loop gets aligned to 0x42, I assume the fetch happens starting from 0x40 (because it is a 32-byte boundary), so the entire loop gets loaded even though non-loop uops between 0x40~0x41 gets fetched too. This performance side-effect is guaranteed because the method is 32-byte aligned. So yes, I think this is the step in right direction.

image

Benchmark measurements for default alignment
Number of no-ops iter# 1 iter# 2 iter# 3 iter# 4 iter# 5 Median Average deviation loop offset
0 45.18 45.36 45.19 45.19 45.11 45.19 45.206 0.082608 0x3D
1 45.03 45.03 45.58 45.28 45.29 45.28 45.242 0.203902 0x3E
2 46.06 45.4 45.27 45.33 45.39 45.39 45.49 0.288791 0x3F
3 52.59 52.79 52.59 44.78 44.94 52.59 49.538 3.820604 0x40
4 45.1 53.73 46.99 45.37 45.41 45.41 47.32 3.273408 0x41
5 52.29 52.55 45.33 52.36 52.33 52.33 50.972 2.822413 0x42
6 50.85 52.47 55.56 52.3 52.46 52.46 52.728 1.540693 0x43
7 52.32 46.1 47.64 52.44 46.09 47.64 48.918 2.88271 0x44
8 45.19 45.98 52.6 45.98 46.12 45.98 47.174 2.732761 0x45
9 53.38 46.77 52.3 52.5 45.4 52.3 50.07 3.302508 0x46
10 46.01 52.41 46.19 51.9 52.91 51.9 49.884 3.10661 0x47
11 52.38 52.36 52.38 52.29 52.33 52.36 52.348 0.034293 0x48
12 52.6 52.6 52.81 52.61 52.61 52.61 52.646 0.082122 0x49
13 52.55 52.74 52.77 52.67 52.61 52.67 52.668 0.081093 0x4A
14 52.18 52.52 52.18 45.01 45.01 52.18 49.38 3.570249 0x4B
15 44.99 52.32 44.97 52.55 52.49 52.32 49.464 3.661954 0x4C
Benchmark measurements for 32-byte alignment
Number of nops iter# 1 iter# 2 iter# 3 iter# 4 iter# 5 Median Average deviation loop offset
0 52.62 52.15 52.21 52.43 52.15 52.21 52.32666667 0.185299757 0x3D
1 52.07 52.18 52.01 52.15 52.25 52.07 52.08666667 0.084 0x3E
2 51.96 51.97 52.41 52.23 52.01 51.97 52.11333333 0.176816289 0x3F
3 45.91 45.85 46.47 46.09 45.99 45.91 46.07666667 0.219308003 0x40
4 46.04 45.84 45.87 45.85 45.91 45.87 45.91666667 0.073047929 0x41
5 45.2 45.31 45.36 45.23 45.27 45.31 45.29 0.056780278 0x42
6 45.94 45.87 45.78 45.88 45.81 45.87 45.86333333 0.056071383 0x43
7 45.18 45.2 45.16 45.21 45.3 45.18 45.18 0.048166378 0x44
8 45.39 45.06 45.27 45.2 45.21 45.27 45.24 0.107070071 0x45
9 45.21 45.27 45.27 45.27 45.51 45.27 45.25 0.104613575 0x46
10 45.28 45.19 45.22 45.2 45.21 45.22 45.23 0.031622777 0x47
11 45.52 45.24 45.32 45.31 45.24 45.32 45.36 0.102683981 0x48
12 45.25 45.16 45.22 45.18 45.19 45.22 45.21 0.031622777 0x49
13 45.65 45.54 45.23 45.57 45.72 45.54 45.47333333 0.168214149 0x4A
14 46.87 45.6 45.44 45.55 44.98 45.6 45.97 0.630345937 0x4B
15 45.28 45.45 45.28 45.22 45.6 45.28 45.33666667 0.139942845 0x4C

Performance

I also tried measuring the performance between default vs. 32-byte alignment and I see some segment of my observation that I saw in stability. As soon as the loop gets aligned to 32-byte boundary (after adding 3+ nops), the performance with 32 byte alignment becomes better. So perhaps, to improve performance as Andy pointed out multiple times, it will be good to align hot loops at 32-byte boundary which in combination with method alignment will give us both stability and performance wins.

Number of nops median with default alignment median with 32-byte alignment % difference
0 45.18 52.21 16%
1 45.03 52.07 15%
2 46.06 51.97 14%
3 52.59 45.91 -13%
4 45.1 45.87 1%
5 52.29 45.31 -13%
6 50.85 45.87 -13%
7 52.32 45.18 -5%
8 45.19 45.27 -2%
9 53.38 45.27 -13%
10 46.01 45.22 -13%
11 52.38 45.32 -13%
12 52.6 45.22 -14%
13 52.55 45.54 -14%
14 52.18 45.6 -13%
15 44.99 45.28 -13%

@kunalspathak
Copy link
Member Author

@adamsitnik , @danmosemsft

@AndyAyersMS
Copy link
Member

LoopReturn seems like a reasonable choice, though I am also interested cases where the benchmark payload does not have a loop. Maybe we can find one?

Here's the historical record showing the bimodal behavior, seems like most of the time it gets a bad alignment and is slow.

newplot (3)

@danmoseley
Copy link
Member

Pretty nice if we get perf for customers as well as stability for us. i didn't expect that: I assumed we would just move perf around when we made alignment stable. So great!

@adamsitnik
Copy link
Member

Note because of the way BenchmarkDotNet operates the main "driving" loop for most benchmark runs (which invokes a delegate that runs the method to benchmark) will be jitted and has a loop, so bypasses Tiering. One hypothesis we're exploring is that the varying alignment of this loop over time contributes to some of our observed day to day perf fluctuations.

If I remember correctly, @AndreyAkinshin has added a manual loop unrolling to this particular loop to minimize the effects of the alignment.

If you pass --keepFiles and open the auto-generated *.notcs file you should be able to see something like this:

private void WorkloadActionUnroll(System.Int64 invokeCount)
{
    for (System.Int64  i = 0; i < invokeCount; i++)
    {
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
    }
}

to disable it, you need to pass --unrollFactor 1:

private void WorkloadActionUnroll(System.Int64 invokeCount)
{
    
    for (System.Int64  i = 0; i < invokeCount; i++)
    {
        consumer.Consume(workloadDelegate());
    }
}

@adamsitnik
Copy link
Member

Key observation is that there is very less deviation with 32-byte alignment as compared to the default (16-byte alignment).

@kunalspathak this looks and sounds awesome!

it will be good to align hot loops at 32-byte boundary which in combination with method alignment will give us both stability and performance wins

this would be perfect!!

FWIW here are some benchmarks that I remember are very dependent on the loop alignment:

  • System.Memory.Span<Char>.IndexOfValue
  • System.Memory.Span<Byte>.BinarySearch

No loops, but super dependent on alignment on x86:

  • Benchstone.BenchI.Fib.Test

@kunalspathak
Copy link
Member Author

Any idea what a typical delta this causes on the overall size of the code segment?

I added logging around the change to see how many methods get 32-byte aligned vs. the default alignment and then ran pmi on .NET libraries. There is definitely duplicate methods that might get counted in either of them, but on an average, 532,667 methods continued getting default alignment while 70,573 methods got 32-bytes alignment i.e. out of 603,240 total methods, 70,573 methods got 32-byte aligned which is approx. 11% of them. This is close to what @AndyAyersMS anticipated in #2249 (comment).

LoopReturn seems like a reasonable choice, though I am also interested cases where the benchmark payload does not have a loop. Maybe we can find one?

I tried gathering the results for Fib benchmark (the only one which is bimodel and has no-loops) and seeing mixed results.

master:

Method Mean Error StdDev Median Min Max
Fib 136.0 us 0.61 us 0.51 us 135.8 us 135.3 us 137.0 us
Fib 142.2 us 0.55 us 0.51 us 142.2 us 141.4 us 143.4 us
Fib 142.3 us 0.79 us 0.70 us 142.3 us 141.5 us 143.9 us
Fib 142.2 us 0.70 us 0.65 us 142.0 us 141.5 us 143.6 us
Fib 143.2 us 1.94 us 1.72 us 142.8 us 141.5 us 147.1 us

PR:

Method Mean Error StdDev Median Min Max
Fib 142.1 us 0.82 us 0.73 us 141.9 us 141.3 us 144.1 us
Fib 142.0 us 0.51 us 0.45 us 142.0 us 141.4 us 143.0 us
Fib 142.1 us 0.54 us 0.48 us 142.3 us 141.3 us 142.9 us
Fib 142.2 us 0.70 us 0.62 us 142.0 us 141.5 us 143.6 us
Fib 149.5 us 3.73 us 4.15 us 147.7 us 143.2 us 158.1 us

@kunalspathak
Copy link
Member Author

For x86 as well, I don't see much difference for Fib benchmark on my local machine.

master:

Method Mean Error StdDev Median Min Max
Fib 142.4 us 1.04 us 0.87 us 142.4 us 141.3 us 144.2 us
Fib 145.7 us 2.79 us 2.99 us 144.8 us 142.5 us 151.9 us
Fib 142.6 us 0.91 us 0.81 us 142.5 us 141.7 us 144.5 us
Fib 142.7 us 0.79 us 0.70 us 142.5 us 141.4 us 143.8 us
Fib 142.5 us 0.53 us 0.49 us 142.4 us 141.7 us 143.4 us

PR:

Method Mean Error StdDev Median Min Max
Fib 142.3 us 0.77 us 0.72 us 142.4 us 141.3 us 143.5 us
Fib 142.5 us 0.96 us 0.85 us 142.0 us 141.6 us 144.2 us
Fib 143.7 us 1.34 us 1.25 us 143.4 us 141.7 us 145.9 us
Fib 142.1 us 0.45 us 0.38 us 142.1 us 141.5 us 142.9 us
Fib 144.3 us 2.08 us 1.94 us 143.7 us 142.1 us 148.1 us

@AndyAyersMS
Copy link
Member

out of 603,240 total methods, 70,573 methods got 32-byte aligned

If we assume half of these would have been 32 byte aligned anyways, then we have an additional 35K instances of 16 bytes wasted in the code segment, so around 560K bytes of extra allocation.

I'd like to know how this compares to the total amount of code. What was the total code size for the 603,240 methods?

@kunalspathak
Copy link
Member Author

out of 603,240 total methods, 70,573 methods got 32-byte aligned

If we assume half of these would have been 32 byte aligned anyways, then we have an additional 35K instances of 16 bytes wasted in the code segment, so around 560K bytes of extra allocation.

That makes sense.

I'd like to know how this compares to the total amount of code. What was the total code size for the 603,240 methods?

The total code size is 51377607 bytes or around 51MB.

@AndyAyersMS
Copy link
Member

So perhaps as much as 1% of the code space is now padding. I suppose it's possible the code heap can fit some smaller methods into the larger padding voids.

Would be nice at some point to actually measure the heap sizes instead of trying to do these back of the envelope estimations.

Let's take this change and keep an eye on the perf lab runs to see if we do indeed see increased stability.

@kunalspathak
Copy link
Member Author

Would be nice at some point to actually measure the heap sizes instead of trying to do these back of the envelope estimations.

Yes, I wanted to get the heap size number for this one and spent some time debugging the allocCode and allocCodeRaw but in order to get accurate numbers from the appropriate heap (specially around allocCodeRaw and UnlockedAllocMemForCode_NoThrow) needed some extra special-casing because those methods are also called from runtime which I thought of deferring it given that I could get rough ballpark using other way. But it is on my TODO list to get the heap size.

@kunalspathak kunalspathak merged commit 03a0931 into dotnet:master Oct 7, 2020
@Sergio0694
Copy link
Contributor

This is pretty cool! 😄

@AndyAyersMS Do you think this could be related to the .NET perf regession from #36907?
Since your last hypothesis was about those switch branches in the hot path possibly not being aligned on .NET 5?

@AndyAyersMS
Copy link
Member

This work is a prerequisite for aligning branches -- we plan to explore doing that soon.

I'd expect this change might stabilize performance of your code, but would not necessarily optimize it.

@Sergio0694
Copy link
Contributor

Ah, I see, thank you for the clarification! 😊

Was just curious whether this was at least in part related to that, happy to hear branch alignment work is planned soon!

kunalspathak added a commit to kunalspathak/runtime that referenced this pull request Oct 23, 2020
In dotnet#2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants