Fix condition to align methods to 32 bytes boundary #42909

kunalspathak · 2020-09-30T18:22:05Z

In #2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.

kunalspathak · 2020-09-30T18:22:20Z

@dotnet/jit-contrib , @AndyAyersMS

AndyAyersMS · 2020-09-30T19:48:42Z

Expected impact is

(short term) perf stability for benchmarks that run jitted code with loops
(medium term) ability to optimize perf by adjusting alignment of code within methods

Would be good to highlight one or two benchmarks we really expect to stabilize with this (benchmarks with very simple timing payloads).

Note because of the way BenchmarkDotNet operates the main "driving" loop for most benchmark runs (which invokes a delegate that runs the method to benchmark) will be jitted and has a loop, so bypasses Tiering. One hypothesis we're exploring is that the varying alignment of this loop over time contributes to some of our observed day to day perf fluctuations.

Beyond that, once this is in, it would be good to do a retrospective study of benchmarks after we've gotten a few days worth of data, and see if we really did move the needle on stability. I had hoped to do that with #2249 but lab collection wasn't in great shape back then; looking back now it doesn't seem to have really had much impact, perhaps because it only affected Tier1 methods.

Any idea what a typical delta this causes on the overall size of the code segment?

kunalspathak · 2020-10-01T00:42:24Z

I picked up LoopReturn benchmark which was recently found by BenchmarkDotNet to have 22% regression but is Bimodel test. The benchmark itself has a tight loop on which I can do some experimentation. For the sake of argument, I removed the only condition in the loop code to eliminate any branch prediction effect. The loop is 17 bytes long, so a single fetch of 32B line should bring the entire loop code into the decoder and cached in DSB.

Another thing that I experimented along with method alignment is the effect on the benchmark timing because of loop alignment within the method. With the help of Andy's NopPadding branch, I added various padding of nop to adjust loop header with alignments ranging from 0x0 thru 0xF to see if fixing the method alignment to 32-byte boundary helps overall execution timing and stability of the benchmark regardless of where loop is aligned. The results are positive.

All the measurements are done on Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores.

Stability

Without the no-ops, the loop starts at offset 0x3D, so adding 1 no-op moves the loop to 0x3E and so forth. In below graph, X-axis is no. of nops padding I added in the method and Y-axis is the standard deviation of benchmark timings measured over 5 iterations. Key observation is that there is very less deviation with 32-byte alignment as compared to the default (16-byte alignment). With default alignment, for some nops padding, the benchmarks are stable, but in general they deviate if the loop's alignment is not favorable to the DSB line. As opposed to that, the 32-byte alignment provides stable timings with very little deviation, no matter what the loop alignment is. For 32-byte method alignment, when the loop's alignment gets to 0x40 (after adding 3 thru 15 nops), the performance even gets better. My theory is that after 3 nops are added in a method, the loop (17 bytes long) gets loaded entirely in a single fetch. For e.g. with 2 nops, the loop starts at 0x3F address which is not a boundary of 32-byte alignment and so there might be two fetches needed to bring the loop to decoder. However after adding, for example, 5 nops, although the loop gets aligned to 0x42, I assume the fetch happens starting from 0x40 (because it is a 32-byte boundary), so the entire loop gets loaded even though non-loop uops between 0x40~0x41 gets fetched too. This performance side-effect is guaranteed because the method is 32-byte aligned. So yes, I think this is the step in right direction.

Benchmark measurements for default alignment

Number of no-ops	iter# 1	iter# 2	iter# 3	iter# 4	iter# 5	Median	Average	deviation	loop offset
0	45.18	45.36	45.19	45.19	45.11	45.19	45.206	0.082608	0x3D
1	45.03	45.03	45.58	45.28	45.29	45.28	45.242	0.203902	0x3E
2	46.06	45.4	45.27	45.33	45.39	45.39	45.49	0.288791	0x3F
3	52.59	52.79	52.59	44.78	44.94	52.59	49.538	3.820604	0x40
4	45.1	53.73	46.99	45.37	45.41	45.41	47.32	3.273408	0x41
5	52.29	52.55	45.33	52.36	52.33	52.33	50.972	2.822413	0x42
6	50.85	52.47	55.56	52.3	52.46	52.46	52.728	1.540693	0x43
7	52.32	46.1	47.64	52.44	46.09	47.64	48.918	2.88271	0x44
8	45.19	45.98	52.6	45.98	46.12	45.98	47.174	2.732761	0x45
9	53.38	46.77	52.3	52.5	45.4	52.3	50.07	3.302508	0x46
10	46.01	52.41	46.19	51.9	52.91	51.9	49.884	3.10661	0x47
11	52.38	52.36	52.38	52.29	52.33	52.36	52.348	0.034293	0x48
12	52.6	52.6	52.81	52.61	52.61	52.61	52.646	0.082122	0x49
13	52.55	52.74	52.77	52.67	52.61	52.67	52.668	0.081093	0x4A
14	52.18	52.52	52.18	45.01	45.01	52.18	49.38	3.570249	0x4B
15	44.99	52.32	44.97	52.55	52.49	52.32	49.464	3.661954	0x4C

Benchmark measurements for 32-byte alignment

Number of nops	iter# 1	iter# 2	iter# 3	iter# 4	iter# 5	Median	Average	deviation	loop offset
0	52.62	52.15	52.21	52.43	52.15	52.21	52.32666667	0.185299757	0x3D
1	52.07	52.18	52.01	52.15	52.25	52.07	52.08666667	0.084	0x3E
2	51.96	51.97	52.41	52.23	52.01	51.97	52.11333333	0.176816289	0x3F
3	45.91	45.85	46.47	46.09	45.99	45.91	46.07666667	0.219308003	0x40
4	46.04	45.84	45.87	45.85	45.91	45.87	45.91666667	0.073047929	0x41
5	45.2	45.31	45.36	45.23	45.27	45.31	45.29	0.056780278	0x42
6	45.94	45.87	45.78	45.88	45.81	45.87	45.86333333	0.056071383	0x43
7	45.18	45.2	45.16	45.21	45.3	45.18	45.18	0.048166378	0x44
8	45.39	45.06	45.27	45.2	45.21	45.27	45.24	0.107070071	0x45
9	45.21	45.27	45.27	45.27	45.51	45.27	45.25	0.104613575	0x46
10	45.28	45.19	45.22	45.2	45.21	45.22	45.23	0.031622777	0x47
11	45.52	45.24	45.32	45.31	45.24	45.32	45.36	0.102683981	0x48
12	45.25	45.16	45.22	45.18	45.19	45.22	45.21	0.031622777	0x49
13	45.65	45.54	45.23	45.57	45.72	45.54	45.47333333	0.168214149	0x4A
14	46.87	45.6	45.44	45.55	44.98	45.6	45.97	0.630345937	0x4B
15	45.28	45.45	45.28	45.22	45.6	45.28	45.33666667	0.139942845	0x4C

Performance

I also tried measuring the performance between default vs. 32-byte alignment and I see some segment of my observation that I saw in stability. As soon as the loop gets aligned to 32-byte boundary (after adding 3+ nops), the performance with 32 byte alignment becomes better. So perhaps, to improve performance as Andy pointed out multiple times, it will be good to align hot loops at 32-byte boundary which in combination with method alignment will give us both stability and performance wins.

Number of nops	median with default alignment	median with 32-byte alignment	% difference
0	45.18	52.21	16%
1	45.03	52.07	15%
2	46.06	51.97	14%
3	52.59	45.91	-13%
4	45.1	45.87	1%
5	52.29	45.31	-13%
6	50.85	45.87	-13%
7	52.32	45.18	-5%
8	45.19	45.27	-2%
9	53.38	45.27	-13%
10	46.01	45.22	-13%
11	52.38	45.32	-13%
12	52.6	45.22	-14%
13	52.55	45.54	-14%
14	52.18	45.6	-13%
15	44.99	45.28	-13%

kunalspathak · 2020-10-01T00:45:47Z

@adamsitnik , @danmosemsft

AndyAyersMS · 2020-10-01T00:55:50Z

LoopReturn seems like a reasonable choice, though I am also interested cases where the benchmark payload does not have a loop. Maybe we can find one?

Here's the historical record showing the bimodal behavior, seems like most of the time it gets a bad alignment and is slow.

danmoseley · 2020-10-01T05:38:58Z

Pretty nice if we get perf for customers as well as stability for us. i didn't expect that: I assumed we would just move perf around when we made alignment stable. So great!

adamsitnik · 2020-10-01T15:20:00Z

Note because of the way BenchmarkDotNet operates the main "driving" loop for most benchmark runs (which invokes a delegate that runs the method to benchmark) will be jitted and has a loop, so bypasses Tiering. One hypothesis we're exploring is that the varying alignment of this loop over time contributes to some of our observed day to day perf fluctuations.

If I remember correctly, @AndreyAkinshin has added a manual loop unrolling to this particular loop to minimize the effects of the alignment.

If you pass --keepFiles and open the auto-generated *.notcs file you should be able to see something like this:

private void WorkloadActionUnroll(System.Int64 invokeCount)
{
    for (System.Int64  i = 0; i < invokeCount; i++)
    {
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
        consumer.Consume(workloadDelegate());
    }
}

to disable it, you need to pass --unrollFactor 1:

private void WorkloadActionUnroll(System.Int64 invokeCount)
{
    
    for (System.Int64  i = 0; i < invokeCount; i++)
    {
        consumer.Consume(workloadDelegate());
    }
}

adamsitnik · 2020-10-01T15:25:37Z

Key observation is that there is very less deviation with 32-byte alignment as compared to the default (16-byte alignment).

@kunalspathak this looks and sounds awesome!

it will be good to align hot loops at 32-byte boundary which in combination with method alignment will give us both stability and performance wins

this would be perfect!!

FWIW here are some benchmarks that I remember are very dependent on the loop alignment:

System.Memory.Span<Char>.IndexOfValue
System.Memory.Span<Byte>.BinarySearch

No loops, but super dependent on alignment on x86:

Benchstone.BenchI.Fib.Test

kunalspathak · 2020-10-06T20:08:34Z

Any idea what a typical delta this causes on the overall size of the code segment?

I added logging around the change to see how many methods get 32-byte aligned vs. the default alignment and then ran pmi on .NET libraries. There is definitely duplicate methods that might get counted in either of them, but on an average, 532,667 methods continued getting default alignment while 70,573 methods got 32-bytes alignment i.e. out of 603,240 total methods, 70,573 methods got 32-byte aligned which is approx. 11% of them. This is close to what @AndyAyersMS anticipated in #2249 (comment).

LoopReturn seems like a reasonable choice, though I am also interested cases where the benchmark payload does not have a loop. Maybe we can find one?

I tried gathering the results for Fib benchmark (the only one which is bimodel and has no-loops) and seeing mixed results.

master:

Method	Mean	Error	StdDev	Median	Min	Max
Fib	136.0 us	0.61 us	0.51 us	135.8 us	135.3 us	137.0 us
Fib	142.2 us	0.55 us	0.51 us	142.2 us	141.4 us	143.4 us
Fib	142.3 us	0.79 us	0.70 us	142.3 us	141.5 us	143.9 us
Fib	142.2 us	0.70 us	0.65 us	142.0 us	141.5 us	143.6 us
Fib	143.2 us	1.94 us	1.72 us	142.8 us	141.5 us	147.1 us

PR:

Method	Mean	Error	StdDev	Median	Min	Max
Fib	142.1 us	0.82 us	0.73 us	141.9 us	141.3 us	144.1 us
Fib	142.0 us	0.51 us	0.45 us	142.0 us	141.4 us	143.0 us
Fib	142.1 us	0.54 us	0.48 us	142.3 us	141.3 us	142.9 us
Fib	142.2 us	0.70 us	0.62 us	142.0 us	141.5 us	143.6 us
Fib	149.5 us	3.73 us	4.15 us	147.7 us	143.2 us	158.1 us

kunalspathak · 2020-10-06T21:47:56Z

For x86 as well, I don't see much difference for Fib benchmark on my local machine.

master:

Method	Mean	Error	StdDev	Median	Min	Max
Fib	142.4 us	1.04 us	0.87 us	142.4 us	141.3 us	144.2 us
Fib	145.7 us	2.79 us	2.99 us	144.8 us	142.5 us	151.9 us
Fib	142.6 us	0.91 us	0.81 us	142.5 us	141.7 us	144.5 us
Fib	142.7 us	0.79 us	0.70 us	142.5 us	141.4 us	143.8 us
Fib	142.5 us	0.53 us	0.49 us	142.4 us	141.7 us	143.4 us

PR:

Method	Mean	Error	StdDev	Median	Min	Max
Fib	142.3 us	0.77 us	0.72 us	142.4 us	141.3 us	143.5 us
Fib	142.5 us	0.96 us	0.85 us	142.0 us	141.6 us	144.2 us
Fib	143.7 us	1.34 us	1.25 us	143.4 us	141.7 us	145.9 us
Fib	142.1 us	0.45 us	0.38 us	142.1 us	141.5 us	142.9 us
Fib	144.3 us	2.08 us	1.94 us	143.7 us	142.1 us	148.1 us

AndyAyersMS · 2020-10-07T16:31:06Z

out of 603,240 total methods, 70,573 methods got 32-byte aligned

If we assume half of these would have been 32 byte aligned anyways, then we have an additional 35K instances of 16 bytes wasted in the code segment, so around 560K bytes of extra allocation.

I'd like to know how this compares to the total amount of code. What was the total code size for the 603,240 methods?

kunalspathak · 2020-10-07T17:02:39Z

out of 603,240 total methods, 70,573 methods got 32-byte aligned

If we assume half of these would have been 32 byte aligned anyways, then we have an additional 35K instances of 16 bytes wasted in the code segment, so around 560K bytes of extra allocation.

That makes sense.

I'd like to know how this compares to the total amount of code. What was the total code size for the 603,240 methods?

The total code size is 51377607 bytes or around 51MB.

AndyAyersMS · 2020-10-07T17:39:28Z

So perhaps as much as 1% of the code space is now padding. I suppose it's possible the code heap can fit some smaller methods into the larger padding voids.

Would be nice at some point to actually measure the heap sizes instead of trying to do these back of the envelope estimations.

Let's take this change and keep an eye on the perf lab runs to see if we do indeed see increased stability.

kunalspathak · 2020-10-07T18:01:42Z

Would be nice at some point to actually measure the heap sizes instead of trying to do these back of the envelope estimations.

Yes, I wanted to get the heap size number for this one and spent some time debugging the allocCode and allocCodeRaw but in order to get accurate numbers from the appropriate heap (specially around allocCodeRaw and UnlockedAllocMemForCode_NoThrow) needed some extra special-casing because those methods are also called from runtime which I thought of deferring it given that I could get rough ballpark using other way. But it is on my TODO list to get the heap size.

Sergio0694 · 2020-10-07T18:15:51Z

This is pretty cool! 😄

@AndyAyersMS Do you think this could be related to the .NET perf regession from #36907?
Since your last hypothesis was about those switch branches in the hot path possibly not being aligned on .NET 5?

AndyAyersMS · 2020-10-07T18:30:27Z

This work is a prerequisite for aligning branches -- we plan to explore doing that soon.

I'd expect this change might stabilize performance of your code, but would not necessarily optimize it.

Sergio0694 · 2020-10-07T18:34:30Z

Ah, I see, thank you for the clarification! 😊

Was just curious whether this was at least in part related to that, happy to hear branch alignment work is planned soon!

In dotnet#2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.

fix condition to align method at 32 bytes

ab9ce53

Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 30, 2020

This was referenced Sep 30, 2020

OSX deprovision jaredpar/runfo#41

Closed

Test initial triage jaredpar/runfo#50

Closed

Test initial triage jaredpar/runfo#51

Closed

BruceForstall approved these changes Oct 6, 2020

View reviewed changes

AndyAyersMS approved these changes Oct 7, 2020

View reviewed changes

kunalspathak merged commit 03a0931 into dotnet:master Oct 7, 2020

This was referenced Oct 9, 2020

Investigate impact of method entry alignment #9912

Open

Stabilize performance measurement #43227

Open

This was referenced Oct 21, 2020

[Perf -89%] Benchstone.BenchI.IniArray.Test DrewScoggins/performance-2#2267

Open

[Perf -64%] Span.Sorting.QuickSortSpan DrewScoggins/performance-2#2319

Open

kunalspathak mentioned this pull request Nov 13, 2020

33% performance degradation in .NET 5 #44457

Closed

ghost locked as resolved and limited conversation to collaborators Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix condition to align methods to 32 bytes boundary #42909

Fix condition to align methods to 32 bytes boundary #42909

kunalspathak commented Sep 30, 2020

kunalspathak commented Sep 30, 2020

AndyAyersMS commented Sep 30, 2020

kunalspathak commented Oct 1, 2020 •

edited

Loading

kunalspathak commented Oct 1, 2020

AndyAyersMS commented Oct 1, 2020

danmoseley commented Oct 1, 2020

adamsitnik commented Oct 1, 2020

adamsitnik commented Oct 1, 2020

kunalspathak commented Oct 6, 2020

kunalspathak commented Oct 6, 2020

AndyAyersMS commented Oct 7, 2020

kunalspathak commented Oct 7, 2020

AndyAyersMS commented Oct 7, 2020

kunalspathak commented Oct 7, 2020

Sergio0694 commented Oct 7, 2020

AndyAyersMS commented Oct 7, 2020

Sergio0694 commented Oct 7, 2020

Fix condition to align methods to 32 bytes boundary #42909

Fix condition to align methods to 32 bytes boundary #42909

Conversation

kunalspathak commented Sep 30, 2020

kunalspathak commented Sep 30, 2020

AndyAyersMS commented Sep 30, 2020

kunalspathak commented Oct 1, 2020 • edited Loading

Stability

Performance

kunalspathak commented Oct 1, 2020

AndyAyersMS commented Oct 1, 2020

danmoseley commented Oct 1, 2020

adamsitnik commented Oct 1, 2020

adamsitnik commented Oct 1, 2020

kunalspathak commented Oct 6, 2020

kunalspathak commented Oct 6, 2020

AndyAyersMS commented Oct 7, 2020

kunalspathak commented Oct 7, 2020

AndyAyersMS commented Oct 7, 2020

kunalspathak commented Oct 7, 2020

Sergio0694 commented Oct 7, 2020

AndyAyersMS commented Oct 7, 2020

Sergio0694 commented Oct 7, 2020

kunalspathak commented Oct 1, 2020 •

edited

Loading