Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce code alignment #756

Closed
adamsitnik opened this issue May 15, 2018 · 18 comments
Closed

Enforce code alignment #756

adamsitnik opened this issue May 15, 2018 · 18 comments

Comments

@adamsitnik
Copy link
Member

Repro: https://github.com/Mike-EEE/DotNetCore.CodeAlignment
Discussion: https://github.com/dotnet/coreclr/issues/17932

AFAIK this is not possible as of today. At least I don't know yet how to enforce it.

@adamsitnik adamsitnik added this to the v0.11.x milestone May 15, 2018
@mattwarren
Copy link
Contributor

I've been following the same issue.

I was thinking that instead of enforcing code alignment, maybe it would be better for BDN to report on it, as part of the analysis. For example, "the code for this method is NOT aligned, which may cause performance issues".

@adamsitnik
Copy link
Member Author

We could report it today, I know that some of our users use DisassemblyDiagnoser to find out the start address of given method.

@mikedn
Copy link

mikedn commented May 15, 2018

At best you can report method alignment but that's not very relevant as what matters is loop head alignment (and sometimes the more general branch target alignment).

@svick
Copy link
Contributor

svick commented May 15, 2018

Would running the benchmark several times, each time in a new process, help? That way, the final result should roughly represent the "average" alignment, at least assuming the CLR is sufficiently random about where it places code.

@mikedn
Copy link

mikedn commented May 15, 2018

at least assuming the CLR is sufficiently random about where it places code.

If only it was random 😁. Experiments show that it can be pretty stubborn about placing code - it's all determined by the order methods are being compiled. You'd need to introduce some randomness in this, possibly by having a particular BDN method that is called or not depending on a random value.

@Mike-E-angelo
Copy link

I am all for reporting! That would be amazing. At least something to show awareness so that someone like me doesn't go off and spend eternity chasing perceived ghosts. 😆

If we can somehow do better than that, then even better. 👍 But reporting for me would work very well in the interim in that I would know there is some funny business going on. I could then test on the other machines that I have to sort of get a better picture.

To provide more context in my case, I am using baselines to compare results of similar methods. When this issue doesn't impact results, I am seeing differences like 1-2%. When this issue does impact, that jumps to 10-20% (!).

If I was able to see a significant difference, along with a warning of some sort, then I would know to not only curb my enthusiasm (😄) but to also try other machines to see what they say as well.

So the Scaled column and getting an accurate sense of relative performance comparison is of primary importance to me here, rather than the Mean column for absolute comparison (although that is valuable, too, of course).

@adamsitnik
Copy link
Member Author

At best you can report method alignment but that's not very relevant as what matters is loop head alignment (and sometimes the more general branch target alignment).

I tried to run some benchmarks with following config (you can pass env vars with BDN):

class Program
{
    static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Program).GetTypeInfo().Assembly).Run(args, 
            DefaultConfig.Instance
                .With(Job.Default.With(CsProjCoreToolchain.NetCoreApp21).With(new[] { new EnvironmentVariable("COMPlus_JitAlignLoops", "1") }).WithId("1"))
                .With(Job.Default.With(CsProjCoreToolchain.NetCoreApp21).With(new[] { new EnvironmentVariable("COMPlus_JitAlignLoops", "0") }).WithId("0")));
    }
}

[DisassemblyDiagnoser]
public class Jit_LoopUnrolling
{
    [Benchmark]
    public int Sum()
    {
        int sum = 0;
        for (int i = 0; i < 1024; i++)
            sum += i;
        return sum;
    }
}

I saw the difference in the disassm:

00007fff`71aecaf0 BenchmarkDotNet.Samples.JIT.Jit_LoopUnrolling.Sum()
00007fff`71aecaf0 33c0            xor     eax,eax
00007fff`71aecaf2 33d2            xor     edx,edx
00007fff`71aecaf4 03c2            add     eax,edx
00007fff`71aecaf6 ffc2            inc     edx
00007fff`71aecaf8 81fa00040000    cmp     edx,400h
00007fff`71aecafe 7cf4            jl      00007fff`71aecaf4
00007fff`71aecb00 c3              ret
00007fff`71b0cc50 BenchmarkDotNet.Samples.JIT.Jit_LoopUnrolling.Sum()
00007fff`71b0cc50 33c0            xor     eax,eax
00007fff`71b0cc52 33d2            xor     edx,edx
00007fff`71b0cc54 0f1f4000        nop     dword ptr [rax]
00007fff`71b0cc58 0f1f840000000000 nop     dword ptr [rax+rax]
00007fff`71b0cc60 03c2            add     eax,edx
00007fff`71b0cc62 ffc2            inc     edx
00007fff`71b0cc64 81fa00040000    cmp     edx,400h
00007fff`71b0cc6a 7cf4            jl      00007fff`71b0cc60
00007fff`71b0cc6c c3              ret

But the difference was very small (4ns)

Method Job EnvironmentVariables Mean Error StdDev
Sum 0 COMPlus_JitAlignLoops=0 281.6 ns 5.338 ns 4.993 ns
Sum 1 COMPlus_JitAlignLoops=1 285.3 ns 5.484 ns 5.632 ns

@adamsitnik
Copy link
Member Author

Would running the benchmark several times, each time in a new process, help?

it's possible today with job.WithLaunchCount(nrOfProcesses)

Experiments show that it can be pretty stubborn about placing code - it's all determined by the order methods are being compiled.

It's an advantage for us because we can get very similar results for same code.

You'd need to introduce some randomness in this, possibly by having a particular BDN method that is called or not depending on a random value.

This would be definitely possible, however I am not sure if people would be willing to pay the price (waiting much longer for the results).

@adamsitnik
Copy link
Member Author

@AndreyAkinshin what do you think? Maybe we should expose a mode for running the benchmarks twice: once for COMPlus_JitAlignLoops=1, once COMPlus_JitAlignLoops=0?

Btw the stabilizer project mentioned by @AndyAyersMS went even one step further:

Programs built with Stabilizer run with randomly-placed functions, stack frames, and heap objects. Functions and stack frames are moved repeatedly during execution. A random memory layout eliminates the effect of layout on performance, and repeated randomization leads to normally-distributed execution times.

@Mike-E-angelo
Copy link

I am not sure if people would be willing to pay the price (waiting much longer for the results).

I for one would certainly not mind waiting if it is a setting/attribute that we opt-in to enable. Along such lines...

Maybe we should expose a mode for running the benchmarks twice: once for COMPlus_JitAlignLoops=1, once COMPlus_JitAlignLoops=0

THIS! So, I think this setting should:

(Btw totally open to feedback here, and writing off the top of my head):

  • Have the ability to opt-in to enable.
  • Have an Automatic or Detect mode that is the default value. This is where Benchmark.NET will detect if there is an alignment issue. If so, then it performs two different runs as suggested. Otherwise, it's Always.
  • Have a way to select the output to render: Fastest that chooses the faster of the two results, Slowest chooses the slower of the two, Average is the average, and Both shows both results. The trick here is that Both obviously shows two rows vs. one row with everything else.

I have been spending the past hour and it looks like we have caught our gigantic ghost of a whale here. Using the configuration that @adamsitnik provided and getting out the in-process mode as @mikedn suggested, I was able to consistently tease apart our two modes using the code from the repro repo:

Method EnvironmentVariables Mean Error StdDev
First COMPlus_JitAlignLoops=0 21.83 us 13.686 us 0.7733 us
First COMPlus_JitAlignLoops=1 19.19 us 4.377 us 0.2473 us
Method EnvironmentVariables Mean Error StdDev
Second COMPlus_JitAlignLoops=0 19.66 us 1.356 us 0.0766 us
Second COMPlus_JitAlignLoops=1 21.88 us 2.699 us 0.1525 us
Method EnvironmentVariables Mean Error StdDev Scaled ScaledSD
First COMPlus_JitAlignLoops=0 21.81 us 3.170 us 0.1791 us 1.00 0.00
Second COMPlus_JitAlignLoops=0 19.94 us 5.967 us 0.3372 us 0.91 0.01
First COMPlus_JitAlignLoops=1 19.31 us 5.633 us 0.3183 us 1.00 0.00
Second COMPlus_JitAlignLoops=1 21.20 us 1.586 us 0.0896 us 1.10 0.02

SO PRETTY! ❤️ Notice the Scaled column as I was discussing earlier. Huge difference here between the different results, considering.

In any case, this is a much better alternative in the interim. And here I was about to throw in the towel! 🤣

https://youtu.be/qorFS7X1i4Q?t=2m21s

(Feeling very much like this due to having seen things that aren't really there. 😉)

@Mike-E-angelo
Copy link

(BTW, I realize that I just went a little crazy with that feature request, but I wanted to underscore here that I am finally unblocked from this issue using the configuration above -- immediately. As far as I am concerned, anything else is an added bonus here. Thanks again to everyone for your help. 👍)

@Mike-E-angelo
Copy link

FWIW I was also able to verify this behavior is consistent with my alternate repro suite which features additional cases. All four of my available machines were able to produce the two different values by way of COMPlus_JitAlignLoops per above.

(Guess I should have done this before I started dancing atop the heads of people. 😆)

@AndyAyersMS
Copy link
Member

For occasional one-off experiment COMPlus_JitAlignLoops can be helpful in trying to confirm or refute that code alignment is a cause of unexpected performance fluctuations.

But, I would not advocate using COMPlus_JitAlignLoops as part of a regular performance program or as any kind of default setting just yet -- it's something we never test, may not always work as expected, will be ignored on some architectures, has some inherent limitations (we don't constrain method entry alignment, and don't recognize all loops), does not use optimally encoded NOP sequences, won't fix issues in prejitted code, and is only supported by RyuJit.

@adamsitnik
Copy link
Member Author

@AndyAyersMS thanks for the warning!

Before we introduce any new JIT-related changes to BDN I will ask you first for validation of the idea.

@AndreyAkinshin
Copy link
Member

I completely agree with @AndyAyersMS: it's not a good idea to enable COMPlus_JitAlignLoops by default. However, we have a problem that should be solved. I have seen perf fluctuation problems because of the "random" alignment algorithm in RyuJIT before, but I didn't know about COMPlus_JitAlignLoops. Some of these problems were solved via UnrollFactor=16 for nanobenchmarks (I believe that such cases are also suffering from different branch prediction strategies on Haswell+, but it's another topic for discussion). It seems that unrolling doesn't solve all the problems, so we have to come up with something else.
BenchmarkDotNet should help people to write reproducible benchmarks. Unfortunately, we can't guarantee it because of some JIT "features." But we can tell people about these problems. I suggest to do the following things:

  1. Detect cases when this phenomenon significantly affects results. E.g., if we know that benchmarks use RyuJIT AND there are two benchmarks with a small difference between them AND (probably some other diagnostics), we can print a warning under the summary table and suggest to use COMPlus_JitAlignLoops in order to stabilize results.
    @adamsitnik, what do you think? Does it make any sense?
  2. Since JitAlignLoops is very important JitAlignLoops for performance, it makes sense to include information about it in the summary info (e.g. RyuJIT[AlignLoops]). We can use the same mechanism for other important flags (e.g., COMPLUS_EXPERIMENTAL_TieredCompilation / COMPlus_TieredCompilation).
    @adamsitnik, @AndyAyersMS what is the most reliable way to check these features? I know that we can manually check environment variables (probably we have to also check registry flags, runtimeconfig.json, MSBuild properties, or something else). However, it would be better to have a reliable API for that.

@adamsitnik
Copy link
Member Author

I am currently working on the .NET Core 2.2 vs 3.0 comparison and the loop alignment has hit me again.

I was thinking about the stabilizer project mentioned by @AndyAyersMS to me a long time ago and I was wondering if re-jitting the benchmark method after every iteration would help?

@AndyAyersMS : would it be possible to implement and expose an API similar to RuntimeHelpers.PrepareMethod that would always re-compile the provided method as Tier 1 and store it in a different address?

@AndyAyersMS
Copy link
Member

It might possible to force rejitting, sure. One simpleminded way to do this would be to attach as a profiler and pretend to modify the IL (note this might have other undesirable impacts). Getting the transient closure of all methods invoked by the benchmark to rejit would be tricky.

If you run each trial in a new process you should get some code and heap jittering via ASLR. I don't know how well-distributed this would end up being.

We could also force 32 bit alignment of Tier1 methods (on x86/x64) which should reduce some of the variability. Also, from what I understand, these code alignment issues aren't as common on Arm64 so it would also be interesting (once we have enough data) to look at how variance levels differ across Arm64/x64.

One randomization thing you could do fairly easily is jitter the stack. You could have the benchmark invoker go through a proxy layer that does smallish different-sized locallocs before invoking the method, and make sure that proxy layer is not an initlocals method (so the localloc region is not zeroed). Or say have a bunch of different proxies that have local structs of different sizes.

Stack address related perf artifacts are somewhat rare, but they do exist (eg 4K aliasing from say a stack slot and a static).

@adamsitnik
Copy link
Member Author

We have all agreed that BenchmarkDotNet should not use any custom settings and force the code alignment on its own.

Currently, the JIT Team is experimenting with method and loop alignment for .NET 6 to stabilize the benchmark results. This work can be tracked at dotnet/runtime#43227

I am going to close this issue as there are no actionable work items on the BDN side.

Thank you all for the great discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants