Enforce code alignment #756

adamsitnik · 2018-05-15T11:05:59Z

Repro: https://github.com/Mike-EEE/DotNetCore.CodeAlignment
Discussion: https://github.com/dotnet/coreclr/issues/17932

AFAIK this is not possible as of today. At least I don't know yet how to enforce it.

mattwarren · 2018-05-15T11:13:02Z

I've been following the same issue.

I was thinking that instead of enforcing code alignment, maybe it would be better for BDN to report on it, as part of the analysis. For example, "the code for this method is NOT aligned, which may cause performance issues".

adamsitnik · 2018-05-15T11:14:46Z

We could report it today, I know that some of our users use DisassemblyDiagnoser to find out the start address of given method.

mikedn · 2018-05-15T12:04:11Z

At best you can report method alignment but that's not very relevant as what matters is loop head alignment (and sometimes the more general branch target alignment).

svick · 2018-05-15T12:58:18Z

Would running the benchmark several times, each time in a new process, help? That way, the final result should roughly represent the "average" alignment, at least assuming the CLR is sufficiently random about where it places code.

mikedn · 2018-05-15T13:05:27Z

at least assuming the CLR is sufficiently random about where it places code.

If only it was random 😁. Experiments show that it can be pretty stubborn about placing code - it's all determined by the order methods are being compiled. You'd need to introduce some randomness in this, possibly by having a particular BDN method that is called or not depending on a random value.

Mike-E-angelo · 2018-05-15T13:41:14Z

I am all for reporting! That would be amazing. At least something to show awareness so that someone like me doesn't go off and spend eternity chasing perceived ghosts. 😆

If we can somehow do better than that, then even better. 👍 But reporting for me would work very well in the interim in that I would know there is some funny business going on. I could then test on the other machines that I have to sort of get a better picture.

To provide more context in my case, I am using baselines to compare results of similar methods. When this issue doesn't impact results, I am seeing differences like 1-2%. When this issue does impact, that jumps to 10-20% (!).

If I was able to see a significant difference, along with a warning of some sort, then I would know to not only curb my enthusiasm (😄) but to also try other machines to see what they say as well.

So the Scaled column and getting an accurate sense of relative performance comparison is of primary importance to me here, rather than the Mean column for absolute comparison (although that is valuable, too, of course).

adamsitnik · 2018-05-15T14:15:06Z

At best you can report method alignment but that's not very relevant as what matters is loop head alignment (and sometimes the more general branch target alignment).

I tried to run some benchmarks with following config (you can pass env vars with BDN):

class Program
{
    static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Program).GetTypeInfo().Assembly).Run(args, 
            DefaultConfig.Instance
                .With(Job.Default.With(CsProjCoreToolchain.NetCoreApp21).With(new[] { new EnvironmentVariable("COMPlus_JitAlignLoops", "1") }).WithId("1"))
                .With(Job.Default.With(CsProjCoreToolchain.NetCoreApp21).With(new[] { new EnvironmentVariable("COMPlus_JitAlignLoops", "0") }).WithId("0")));
    }
}

[DisassemblyDiagnoser]
public class Jit_LoopUnrolling
{
    [Benchmark]
    public int Sum()
    {
        int sum = 0;
        for (int i = 0; i < 1024; i++)
            sum += i;
        return sum;
    }
}

I saw the difference in the disassm:

00007fff`71aecaf0 BenchmarkDotNet.Samples.JIT.Jit_LoopUnrolling.Sum()
00007fff`71aecaf0 33c0            xor     eax,eax
00007fff`71aecaf2 33d2            xor     edx,edx
00007fff`71aecaf4 03c2            add     eax,edx
00007fff`71aecaf6 ffc2            inc     edx
00007fff`71aecaf8 81fa00040000    cmp     edx,400h
00007fff`71aecafe 7cf4            jl      00007fff`71aecaf4
00007fff`71aecb00 c3              ret

00007fff`71b0cc50 BenchmarkDotNet.Samples.JIT.Jit_LoopUnrolling.Sum()
00007fff`71b0cc50 33c0            xor     eax,eax
00007fff`71b0cc52 33d2            xor     edx,edx
00007fff`71b0cc54 0f1f4000        nop     dword ptr [rax]
00007fff`71b0cc58 0f1f840000000000 nop     dword ptr [rax+rax]
00007fff`71b0cc60 03c2            add     eax,edx
00007fff`71b0cc62 ffc2            inc     edx
00007fff`71b0cc64 81fa00040000    cmp     edx,400h
00007fff`71b0cc6a 7cf4            jl      00007fff`71b0cc60
00007fff`71b0cc6c c3              ret

But the difference was very small (4ns)

Method	Job	EnvironmentVariables	Mean	Error	StdDev
Sum	0	COMPlus_JitAlignLoops=0	281.6 ns	5.338 ns	4.993 ns
Sum	1	COMPlus_JitAlignLoops=1	285.3 ns	5.484 ns	5.632 ns

adamsitnik · 2018-05-15T14:19:18Z

Would running the benchmark several times, each time in a new process, help?

it's possible today with job.WithLaunchCount(nrOfProcesses)

Experiments show that it can be pretty stubborn about placing code - it's all determined by the order methods are being compiled.

It's an advantage for us because we can get very similar results for same code.

You'd need to introduce some randomness in this, possibly by having a particular BDN method that is called or not depending on a random value.

This would be definitely possible, however I am not sure if people would be willing to pay the price (waiting much longer for the results).

adamsitnik · 2018-05-15T14:24:25Z

@AndreyAkinshin what do you think? Maybe we should expose a mode for running the benchmarks twice: once for COMPlus_JitAlignLoops=1, once COMPlus_JitAlignLoops=0?

Btw the stabilizer project mentioned by @AndyAyersMS went even one step further:

Programs built with Stabilizer run with randomly-placed functions, stack frames, and heap objects. Functions and stack frames are moved repeatedly during execution. A random memory layout eliminates the effect of layout on performance, and repeated randomization leads to normally-distributed execution times.

Mike-E-angelo · 2018-05-15T15:45:08Z

I am not sure if people would be willing to pay the price (waiting much longer for the results).

I for one would certainly not mind waiting if it is a setting/attribute that we opt-in to enable. Along such lines...

Maybe we should expose a mode for running the benchmarks twice: once for COMPlus_JitAlignLoops=1, once COMPlus_JitAlignLoops=0

THIS! So, I think this setting should:

(Btw totally open to feedback here, and writing off the top of my head):

Have the ability to opt-in to enable.
Have an Automatic or Detect mode that is the default value. This is where Benchmark.NET will detect if there is an alignment issue. If so, then it performs two different runs as suggested. Otherwise, it's Always.
Have a way to select the output to render: Fastest that chooses the faster of the two results, Slowest chooses the slower of the two, Average is the average, and Both shows both results. The trick here is that Both obviously shows two rows vs. one row with everything else.

I have been spending the past hour and it looks like we have caught our gigantic ghost of a whale here. Using the configuration that @adamsitnik provided and getting out the in-process mode as @mikedn suggested, I was able to consistently tease apart our two modes using the code from the repro repo:

Method	EnvironmentVariables	Mean	Error	StdDev
First	COMPlus_JitAlignLoops=0	21.83 us	13.686 us	0.7733 us
First	COMPlus_JitAlignLoops=1	19.19 us	4.377 us	0.2473 us

Method	EnvironmentVariables	Mean	Error	StdDev
Second	COMPlus_JitAlignLoops=0	19.66 us	1.356 us	0.0766 us
Second	COMPlus_JitAlignLoops=1	21.88 us	2.699 us	0.1525 us

Method	EnvironmentVariables	Mean	Error	StdDev	Scaled	ScaledSD
First	COMPlus_JitAlignLoops=0	21.81 us	3.170 us	0.1791 us	1.00	0.00
Second	COMPlus_JitAlignLoops=0	19.94 us	5.967 us	0.3372 us	0.91	0.01

First	COMPlus_JitAlignLoops=1	19.31 us	5.633 us	0.3183 us	1.00	0.00
Second	COMPlus_JitAlignLoops=1	21.20 us	1.586 us	0.0896 us	1.10	0.02

SO PRETTY! ❤️ Notice the Scaled column as I was discussing earlier. Huge difference here between the different results, considering.

In any case, this is a much better alternative in the interim. And here I was about to throw in the towel! 🤣

https://youtu.be/qorFS7X1i4Q?t=2m21s

(Feeling very much like this due to having seen things that aren't really there. 😉)

Mike-E-angelo · 2018-05-15T16:02:17Z

(BTW, I realize that I just went a little crazy with that feature request, but I wanted to underscore here that I am finally unblocked from this issue using the configuration above -- immediately. As far as I am concerned, anything else is an added bonus here. Thanks again to everyone for your help. 👍)

Mike-E-angelo · 2018-05-16T12:55:37Z

FWIW I was also able to verify this behavior is consistent with my alternate repro suite which features additional cases. All four of my available machines were able to produce the two different values by way of COMPlus_JitAlignLoops per above.

(Guess I should have done this before I started dancing atop the heads of people. 😆)

AndyAyersMS · 2018-05-16T17:53:54Z

For occasional one-off experiment COMPlus_JitAlignLoops can be helpful in trying to confirm or refute that code alignment is a cause of unexpected performance fluctuations.

But, I would not advocate using COMPlus_JitAlignLoops as part of a regular performance program or as any kind of default setting just yet -- it's something we never test, may not always work as expected, will be ignored on some architectures, has some inherent limitations (we don't constrain method entry alignment, and don't recognize all loops), does not use optimally encoded NOP sequences, won't fix issues in prejitted code, and is only supported by RyuJit.

adamsitnik · 2018-05-16T17:56:31Z

@AndyAyersMS thanks for the warning!

Before we introduce any new JIT-related changes to BDN I will ask you first for validation of the idea.

AndreyAkinshin · 2018-05-16T20:47:38Z

I completely agree with @AndyAyersMS: it's not a good idea to enable COMPlus_JitAlignLoops by default. However, we have a problem that should be solved. I have seen perf fluctuation problems because of the "random" alignment algorithm in RyuJIT before, but I didn't know about COMPlus_JitAlignLoops. Some of these problems were solved via UnrollFactor=16 for nanobenchmarks (I believe that such cases are also suffering from different branch prediction strategies on Haswell+, but it's another topic for discussion). It seems that unrolling doesn't solve all the problems, so we have to come up with something else.
BenchmarkDotNet should help people to write reproducible benchmarks. Unfortunately, we can't guarantee it because of some JIT "features." But we can tell people about these problems. I suggest to do the following things:

Detect cases when this phenomenon significantly affects results. E.g., if we know that benchmarks use RyuJIT AND there are two benchmarks with a small difference between them AND (probably some other diagnostics), we can print a warning under the summary table and suggest to use COMPlus_JitAlignLoops in order to stabilize results.
@adamsitnik, what do you think? Does it make any sense?
Since JitAlignLoops is very important JitAlignLoops for performance, it makes sense to include information about it in the summary info (e.g. RyuJIT[AlignLoops]). We can use the same mechanism for other important flags (e.g., COMPLUS_EXPERIMENTAL_TieredCompilation / COMPlus_TieredCompilation).
@adamsitnik, @AndyAyersMS what is the most reliable way to check these features? I know that we can manually check environment variables (probably we have to also check registry flags, runtimeconfig.json, MSBuild properties, or something else). However, it would be better to have a reliable API for that.

adamsitnik · 2019-07-22T11:30:42Z

I am currently working on the .NET Core 2.2 vs 3.0 comparison and the loop alignment has hit me again.

I was thinking about the stabilizer project mentioned by @AndyAyersMS to me a long time ago and I was wondering if re-jitting the benchmark method after every iteration would help?

@AndyAyersMS : would it be possible to implement and expose an API similar to RuntimeHelpers.PrepareMethod that would always re-compile the provided method as Tier 1 and store it in a different address?

AndyAyersMS · 2019-07-24T17:45:56Z

It might possible to force rejitting, sure. One simpleminded way to do this would be to attach as a profiler and pretend to modify the IL (note this might have other undesirable impacts). Getting the transient closure of all methods invoked by the benchmark to rejit would be tricky.

If you run each trial in a new process you should get some code and heap jittering via ASLR. I don't know how well-distributed this would end up being.

We could also force 32 bit alignment of Tier1 methods (on x86/x64) which should reduce some of the variability. Also, from what I understand, these code alignment issues aren't as common on Arm64 so it would also be interesting (once we have enough data) to look at how variance levels differ across Arm64/x64.

One randomization thing you could do fairly easily is jitter the stack. You could have the benchmark invoker go through a proxy layer that does smallish different-sized locallocs before invoking the method, and make sure that proxy layer is not an initlocals method (so the localloc region is not zeroed). Or say have a bunch of different proxies that have local structs of different sizes.

Stack address related perf artifacts are somewhat rare, but they do exist (eg 4K aliasing from say a stack slot and a static).

adamsitnik · 2020-10-23T21:41:45Z

We have all agreed that BenchmarkDotNet should not use any custom settings and force the code alignment on its own.

Currently, the JIT Team is experimenting with method and loop alignment for .NET 6 to stabilize the benchmark results. This work can be tracked at dotnet/runtime#43227

I am going to close this issue as there are no actionable work items on the BDN side.

Thank you all for the great discussion!

adamsitnik added enhancement Area:Engine labels May 15, 2018

adamsitnik added this to the v0.11.x milestone May 15, 2018

adamsitnik mentioned this issue Sep 6, 2018

The execution time of micro benchmark is not consistent #868

Closed

adamsitnik mentioned this issue Aug 14, 2019

update BenchmarkDotNet to latest version to take advantage of improvements dotnet/performance#511

Merged

adamsitnik mentioned this issue Aug 11, 2020

Make memory alignment more random #1513

Closed

adamsitnik mentioned this issue Oct 23, 2020

Hight Volatility & Strange4 Disasm #1124

Closed

adamsitnik closed this as completed Oct 23, 2020

adamsitnik added discussion and removed Area:Engine enhancement labels Oct 23, 2020

adamsitnik mentioned this issue Oct 24, 2020

Performance changes when I use WithId #1342

Closed

adamsitnik mentioned this issue Jun 28, 2021

unknown Bug/NotBug about span benchmarks #1733

Closed

AndreyAkinshin removed this from the v0.13.x milestone Feb 16, 2022

adamsitnik mentioned this issue Oct 6, 2022

2X different results from identical copy-pasted code #2138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce code alignment #756

Enforce code alignment #756

adamsitnik commented May 15, 2018

mattwarren commented May 15, 2018

adamsitnik commented May 15, 2018

mikedn commented May 15, 2018

svick commented May 15, 2018

mikedn commented May 15, 2018

Mike-E-angelo commented May 15, 2018

adamsitnik commented May 15, 2018

adamsitnik commented May 15, 2018

adamsitnik commented May 15, 2018

Mike-E-angelo commented May 15, 2018

Mike-E-angelo commented May 15, 2018

Mike-E-angelo commented May 16, 2018

AndyAyersMS commented May 16, 2018

adamsitnik commented May 16, 2018

AndreyAkinshin commented May 16, 2018

adamsitnik commented Jul 22, 2019

AndyAyersMS commented Jul 24, 2019

adamsitnik commented Oct 23, 2020

Enforce code alignment #756

Enforce code alignment #756

Comments

adamsitnik commented May 15, 2018

mattwarren commented May 15, 2018

adamsitnik commented May 15, 2018

mikedn commented May 15, 2018

svick commented May 15, 2018

mikedn commented May 15, 2018

Mike-E-angelo commented May 15, 2018

adamsitnik commented May 15, 2018

adamsitnik commented May 15, 2018

adamsitnik commented May 15, 2018

Mike-E-angelo commented May 15, 2018

Mike-E-angelo commented May 15, 2018

Mike-E-angelo commented May 16, 2018

AndyAyersMS commented May 16, 2018

adamsitnik commented May 16, 2018

AndreyAkinshin commented May 16, 2018

adamsitnik commented Jul 22, 2019

AndyAyersMS commented Jul 24, 2019

adamsitnik commented Oct 23, 2020