[ARM64] Performance regression: PerfLabTests.CastingPerf2.CastingPerf.IntObj #41706

adamsitnik · 2020-09-01T21:32:32Z

After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found out that PerfLabTests.CastingPerf2.CastingPerf.IntObj has regressed x2.

It looks like these are ARM64 specific regressions, I was not able to reproduce it for ARM (the 32-bit variant).

Repro

git clone https://github.com/dotnet/performance.git
py ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'PerfLabTests.CastingPerf2.CastingPerf.IntObj'

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 16.04
Unknown processor
  [Host]     : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT
  Job-PVNQZA : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT
  Job-PXIHWO : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT

Method	Toolchain	Mean	Ratio
IntObj	netcoreapp3.1	466.4 us	1.00
IntObj	netcoreapp5.0	1,001.1 us	2.15

cc @kunalspathak

category:cq
theme:needs-triage
skill-level:expert
cost:large

The text was updated successfully, but these errors were encountered:

JulieLeeMSFT · 2020-09-01T23:24:52Z

@CarolEidt please help look into this.

CC @dotnet/jit-contrib

CarolEidt · 2020-09-02T00:19:58Z

@adamsitnik - could you clarify how to run these on one of the Ubuntu arm64 systems? The command you give above appears to be using windows pathnames, and if I try this:

python scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'PerfLabTests.CastingPerf2.CastingPerf.IntObj'

I get a syntax error on line 39

AndyAyersMS · 2020-09-02T00:31:47Z

Assuming you have the right builds of 3.1 and 5.0 installed and dotnet is on your path, and you have cloned the perf repo (to say ~/repos/performance, then

cd ~/repos/performance/src/benchmarks/micro
dotnet run -c Release -f net5.0 -- -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'PerfLabTests.CastingPerf2.CastingPerf.IntObj'

should work. Add -d to get the BDN produced disassembly.

CarolEidt · 2020-09-02T00:32:21Z

Thanks @AndyAyersMS !

AndyAyersMS · 2020-09-02T00:33:03Z

Also this may be touching on the relatively new cast caching, so cc @VSadov.

VSadov · 2020-09-02T03:18:24Z

If this is a simple unbox, it should be jit-inlined.
Even if not inlined it should not hit cast cache, since unbox requires exact type match*, but it could be calling a managed helper now and there is a small penalty due to tiering/R2R indirection. It is not a lot, but may be noticeable on super fast casts.

I will take a look.

*enums match with underlying types as well and thus have some special handling, but that still does not need to use cache.

adamsitnik · 2020-09-02T08:42:39Z

could you clarify how to run these on one of the Ubuntu arm64 systems? The command you give above appears to be using windows pathname

please excuse me for that, I must have copy-pasted it from previous Windows specifc issue. I've fixed the description

I get a syntax error on line 39

If you append 3 to python it should use python 3.x and work:

python3 scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'PerfLabTests.CastingPerf2.CastingPerf.IntObj'

Add -d to get the BDN produced disassembly.

Unfortunately, the BDN disassembler does not support ARM. Some time ago we have switched to use Iced library and it does not support ARM yet (https://github.com/0xd4d/iced/issues/79, https://github.com/0xd4d/iced/issues/80)

CarolEidt · 2020-09-02T23:15:24Z

Looking at the code generation, the only difference is that in 5.0 several address constants have been CSE'd. The loop goes from 108 bytes down to 64 bytes, but in both cases is small enough that alignment could make a big difference. Unless there's objection I would suggest we close this, and reference #8108 once again. (Perhaps we should add support for a specified loop alignment, even if only for use in benchmarking to validate or invalidate hypotheses such as this.)

BEFORE:

G_M590_IG03:        ; offs=000028H, size=003CH, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz

IN0008: 000028      movz    x1, #0xd1ffab1e
IN0009: 00002C      movk    x1, #0xd1ffab1e LSL #16
IN000a: 000030      movk    x1, #0xd1ffab1e LSL #32
IN000b: 000034      ldr     x20, [x1]
IN000c: 000038      ldr     x1, [x20]
IN000d: 00003C      movz    x0, #0xd1ffab1e
IN000e: 000040      movk    x0, #0xd1ffab1e LSL #16
IN000f: 000044      movk    x0, #0xd1ffab1e LSL #32
IN0010: 000048      cmp     x1, x0
IN0011: 00004C      beq     G_M590_IG04
IN0012: 000050      mov     x1, x20
IN0013: 000054      movz    x0, #0xd1ffab1e
IN0014: 000058      movk    x0, #0xd1ffab1e LSL #16
IN0015: 00005C      movk    x0, #0xd1ffab1e LSL #32
IN0016: 000060      bl      CORINFO_HELP_UNBOX

G_M590_IG04:        ; offs=000064H, size=0030H, gcrefRegs=100000 {x20}, byrefRegs=0000 {}, byref, isz

IN0017: 000064      ldr     w0, [x20,#8]
IN0018: 000068      movz    x1, #0xd1ffab1e
IN0019: 00006C      movk    x1, #0xd1ffab1e LSL #16
IN001a: 000070      movk    x1, #0xd1ffab1e LSL #32
IN001b: 000074      str     w0, [x1]
IN001c: 000078      add     w19, w19, #1
IN001d: 00007C      movz    x0, #0xd1ffab1e
IN001e: 000080      movk    x0, #0xd1ffab1e LSL #16
IN001f: 000084      movk    x0, #0xd1ffab1e LSL #32
IN0020: 000088      ldr     w0, [x0]
IN0021: 00008C      cmp     w19, w0
IN0022: 000090      blt     G_M590_IG03

AFTER:

G_M50398_IG03:        ; offs=000038H, size=001CH, bbWeight=4    PerfScore 36.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz

IN000b: 000038      movz    x1, #0xd1ffab1e
IN000c: 00003C      movk    x1, #0xd1ffab1e LSL #16
IN000d: 000040      movk    x1, #0xd1ffab1e LSL #32
IN000e: 000044      ldr     x22, [x1]
IN000f: 000048      ldr     x1, [x22]
IN0010: 00004C      cmp     x1, x21
IN0011: 000050      beq     G_M50398_IG05

G_M50398_IG04:        ; offs=000054H, size=000CH, bbWeight=1    PerfScore 2.00, gcrefRegs=400000 {x22}, byrefRegs=0000 {}, byref

IN0012: 000054      mov     x1, x22
IN0013: 000058      mov     x0, x21
IN0014: 00005C      bl      CORINFO_HELP_UNBOX

G_M50398_IG05:        ; offs=000060H, size=0018H, bbWeight=4    PerfScore 36.00, gcrefRegs=400000 {x22}, byrefRegs=0000 {}, byref, isz

IN0015: 000060      ldr     w0, [x22,#8]
IN0016: 000064      str     w0, [x20,#8]
IN0017: 000068      add     w19, w19, #1
IN0018: 00006C      ldr     w0, [x20]
IN0019: 000070      cmp     w19, w0
IN001a: 000074      blt     G_M50398_IG03

AndyAyersMS · 2020-09-02T23:20:39Z

Did something change in the way CORINFO_HELP_UNBOX is implemented? Do you have a differential profile?

VSadov · 2020-09-02T23:23:50Z

The benchmark does a trivial unbox of an (object)1 to int:

https://github.com/dotnet/performance/blob/8aed638c9ee65c034fe0cca4ea2bdc3a68d2a6b5/src/benchmarks/micro/runtime/perflab/CastingPerf2.cs#L186

I think it should not call the helper and do the unbox completely inline.
It could be indeed an effect of loop alignment on a highly sensitive benchmark.

VSadov · 2020-09-02T23:31:12Z

CORINFO_HELP_UNBOX did actually change. The helper has been moved to managed code.

It should not be called in simple cases though. It basically should handle cast failures (throw) or rare cases such as unboxing an enum to underlying type.

AndyAyersMS · 2020-09-02T23:31:27Z

I can't find the comment just now, but I recall @sdmaclea saying that ARM64 didn't have the same kind of code alignment penalties that we see on xArch.

sdmaclea · 2020-09-02T23:48:33Z

I wouldn't expect significant alignment penalties. Instructions are always 4 bytes and 4 byte aligned by definition. If a loop crosses a cache line or page boundary there might be some perf difference, but I wouldn't expect much.

The branch predictor could possible be affected by the hash of the branch PC, but given there are only two branches in the loop I wouldn't expect that to be the issue.

The unboxing seems more likely to be the issue.

CarolEidt · 2020-09-03T17:00:57Z

I've not been able to figure out how to debug the benchmark as run by the perf harness, but I've extracted the benchmark method, and verified that the helper is never called. Unless somehow different code is generated, there must be something else going on.

CarolEidt · 2020-09-03T18:36:48Z

Based on the results here it looks like there is some modality that seems likely to be microarchitectural. I note that the oscillation seemed to increase around the time of #39096 which enabled the CSE of these large constants, though it doesn't seem to coincide precisely.

sdmaclea · 2020-09-03T18:38:28Z

I would guess those branches are not well predicted. So minimizing branch mispredict recovery time would be important.

I am wondering if the uArch modality has to do with fetch group alignment... Like you first suggested...

sdmaclea · 2020-09-03T18:41:48Z

I haven't looked at the source but looking at the disassembly, it looks to me like a lot of G_M50398_IG03 is const and could be hoisted out of the loop. Especially if the register alocator could free another preserved register for the dst of IN000f: 000048 ldr x1, [x22]

CarolEidt · 2020-09-03T23:04:21Z

it looks to me like a lot of G_M50398_IG03 is const and could be hoisted out of the loop.

We improved the CSE'ing of constants in .NET 5, so the "AFTER" loop has only one large constant. It's not hoisted out of the loop because it is marked as being dependent on the class constructor and therefore not hoistable. This is the address of a class static, so it's unclear to me why it's not hoistable. @briansull @AndyAyersMS - can you enlighten me why such a constant would not be hoistable?

CarolEidt · 2020-09-03T23:05:41Z

@TamarChristinaArm - can you shed any light on what might case the above "AFTER" loop to be slower than the "BEFORE" loop? (e.g. loop alignment, cache effects, etc.)?

CarolEidt · 2020-09-04T00:06:41Z

I should also note that, when I extracted the benchmark method and added a timer using Stopwatch the performance was quite basically the same between 3.1 and 5.0.

TamarChristinaArm · 2020-09-04T11:06:59Z

@TamarChristinaArm - can you shed any light on what might case the above "AFTER" loop to be slower than the "BEFORE" loop? (e.g. loop alignment, cache effects, etc.)?

@CarolEidt that to me looks like what you suspected being loop alignment. Most uArch will have alignment requirement for branch targets (for performance not correctness), newer Cortex-A cores generally prefer 32-byte alignments for branch targets, see for instance Neoverse-N1 optimization guide section 4.8 Branch instruction alignment for some of the requirements. I believe your CIs run XGene? In GCC these are the alignment requirements we have for it.

What we've observed is that for small loops the alignment makes a big difference and while both loops have misaligned targets the second loop size is much smaller so would be more sensitive.

CarolEidt · 2020-09-04T14:11:47Z

Thanks so much @TamarChristinaArm - I propose that we either close this or mark it "Future" and take it into consideration when/if we address #8108.

Thoughts @adamsitnik ?

AndyAyersMS · 2020-09-04T16:49:17Z

I can try and replicate what I did for #2249 to confirm some of the issues we're seeing now are indeed alignment. In the mean time you might be able to use perf stat to confirm what you're seeing are IPC issues and not additional instructions. See #41741 (comment) for one example of this.

In the short run we should figure out the proper method entry alignment -- we can't do anything about internal alignments until we fix this. Then we can at least enable the (optional) loop alignment for arm (implement emitLoopAlign and any supporting bits). Perhaps extending #2249 to arm64 is the simplest thing?

@TamarChristinaArm is there a recommended set of NOP sequences of varying length, or is there some other practice to ensure alignment?

CarolEidt · 2020-09-04T20:30:44Z

@AndyAyersMS - it would be interesting to try implementing #2249 for arm64.
We've marked this 6.0.0, though, as it seems that modifying alignment would require significant perf analysis to ensure that we understand the impact.

CarolEidt · 2020-09-04T21:06:43Z

Here's the output from perf stat -e "branch-misses,cache-misses,cpu-cycles,instructions,stalled-cycles-frontend,stalled-cycles-backend" (the first one is 5.0 and the second is 3.1):

Performance counter stats for '/home/robox/cteidt/performance/tools/dotnet/arm64/dotnet run --project /home/robox/cteidt/performance/src/benchmarks/micro/MicroBenchmarks.csproj --configuration Release --framework netcoreapp5.0 --no-restore --no-build -- --filter PerfLabTests.CastingPerf2.CastingPerf.IntObj --packages /home/robox/cteidt/performance/artifacts/packages --runtimes netcoreapp5.0 --cli /home/robox/cteidt/performance/tools/dotnet/arm64/dotnet':

       927,236,914      branch-misses
       435,682,900      cache-misses
    90,726,403,850      cpu-cycles
    59,721,463,264      instructions              #    0.66  insn per cycle
                                                  #    0.67  stalled cycles per insn
    40,144,534,076      stalled-cycles-frontend   #   44.25% frontend cycles idle
    26,028,533,828      stalled-cycles-backend    #   28.69% backend cycles idle

      36.046207216 seconds time elapsed

Performance counter stats for '/home/robox/cteidt/performance/tools/dotnet/arm64/dotnet run --project /home/robox/cteidt/performance/src/benchmarks/micro/MicroBenchmarks.csproj --configuration Release --framework netcoreapp5.0 --no-restore --no-build -- --filter PerfLabTests.CastingPerf2.CastingPerf.IntObj --packages /home/robox/cteidt/performance/artifacts/packages --runtimes netcoreapp3.1 --cli /home/robox/cteidt/performance/tools/dotnet/arm64/dotnet':

       746,009,625      branch-misses
       347,766,489      cache-misses
    76,174,119,876      cpu-cycles
    84,227,660,637      instructions              #    1.11  insn per cycle
                                                  #    0.38  stalled cycles per insn
    32,418,821,468      stalled-cycles-frontend   #   42.56% frontend cycles idle
    14,048,337,909      stalled-cycles-backend    #   18.44% backend cycles idle

TamarChristinaArm · 2020-09-07T13:24:31Z

@TamarChristinaArm is there a recommended set of NOP sequences of varying length, or is there some other practice to ensure alignment?

No, we just emit multiple NOPs. For the larger alignment constraints like 32 we only align it if it means adding less than 16 bytes of padding. For the smaller ones we generally always align it.

BruceForstall · 2020-11-10T21:32:32Z

It appears the consensus here is the issue is loop alignment, for which we already have linked issues tracking the work, so I'm going to close this. If that's incorrect, then feel free to re-open with a clear note about what unique work/issue this will address.

kunalspathak · 2020-11-10T22:37:10Z

On my Windows x64, when I tested this benchmark with my changes in #44370, I did see this benchmark improved.

Faster	base/diff	Base Median (ns)	Diff Median (ns)	Modality
PerfLabTests.CastingPerf.IntObj	1.04	368774.36	355509.29

adamsitnik · 2020-11-12T09:26:20Z

On my Windows x64

@kunalspathak This particular regression was specific to ARM64 (not x64). Is there any chance you could check the ARM results?

kunalspathak · 2020-11-12T19:59:32Z

On my Windows x64

@kunalspathak This particular regression was specific to ARM64 (not x64). Is there any chance you could check the ARM results?

Ah, that's true. Currently my loop alignment is for xarch, but once I get it for arm, I will verify this.

adamsitnik added arch-arm64 tenet-performance Performance related issue area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Sep 1, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Sep 1, 2020

JulieLeeMSFT assigned CarolEidt Sep 1, 2020

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 1, 2020

JulieLeeMSFT added this to the 5.0.0 milestone Sep 1, 2020

adamsitnik mentioned this issue Sep 4, 2020

.NET 5.0 Microbenchmarks Performance Study Report #41871

Closed

21 tasks

JulieLeeMSFT modified the milestones: 5.0.0, 6.0.0 Sep 4, 2020

adamsitnik mentioned this issue Sep 9, 2020

Ensure building dotnet/runtime works on Windows ARM64 #42008

Closed

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 10, 2020

BruceForstall closed this as completed Nov 10, 2020

ghost locked as resolved and limited conversation to collaborators Dec 12, 2020

JulieLeeMSFT added this to .NET Core CodeGen Jun 5, 2024

JulieLeeMSFT moved this to Done in .NET Core CodeGen Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM64] Performance regression: PerfLabTests.CastingPerf2.CastingPerf.IntObj #41706

[ARM64] Performance regression: PerfLabTests.CastingPerf2.CastingPerf.IntObj #41706

adamsitnik commented Sep 1, 2020 •

edited by BruceForstall

Loading

JulieLeeMSFT commented Sep 1, 2020

CarolEidt commented Sep 2, 2020

AndyAyersMS commented Sep 2, 2020

CarolEidt commented Sep 2, 2020

AndyAyersMS commented Sep 2, 2020

VSadov commented Sep 2, 2020

adamsitnik commented Sep 2, 2020

CarolEidt commented Sep 2, 2020

AndyAyersMS commented Sep 2, 2020

VSadov commented Sep 2, 2020

VSadov commented Sep 2, 2020

AndyAyersMS commented Sep 2, 2020

sdmaclea commented Sep 2, 2020

CarolEidt commented Sep 3, 2020

CarolEidt commented Sep 3, 2020

sdmaclea commented Sep 3, 2020

sdmaclea commented Sep 3, 2020 •

edited

Loading

CarolEidt commented Sep 3, 2020

CarolEidt commented Sep 3, 2020

CarolEidt commented Sep 4, 2020

TamarChristinaArm commented Sep 4, 2020

CarolEidt commented Sep 4, 2020

AndyAyersMS commented Sep 4, 2020

CarolEidt commented Sep 4, 2020

CarolEidt commented Sep 4, 2020

TamarChristinaArm commented Sep 7, 2020

BruceForstall commented Nov 10, 2020

kunalspathak commented Nov 10, 2020

adamsitnik commented Nov 12, 2020

kunalspathak commented Nov 12, 2020

[ARM64] Performance regression: PerfLabTests.CastingPerf2.CastingPerf.IntObj #41706

[ARM64] Performance regression: PerfLabTests.CastingPerf2.CastingPerf.IntObj #41706

Comments

adamsitnik commented Sep 1, 2020 • edited by BruceForstall Loading

Repro

JulieLeeMSFT commented Sep 1, 2020

CarolEidt commented Sep 2, 2020

AndyAyersMS commented Sep 2, 2020

CarolEidt commented Sep 2, 2020

AndyAyersMS commented Sep 2, 2020

VSadov commented Sep 2, 2020

adamsitnik commented Sep 2, 2020

CarolEidt commented Sep 2, 2020

AndyAyersMS commented Sep 2, 2020

VSadov commented Sep 2, 2020

VSadov commented Sep 2, 2020

AndyAyersMS commented Sep 2, 2020

sdmaclea commented Sep 2, 2020

CarolEidt commented Sep 3, 2020

CarolEidt commented Sep 3, 2020

sdmaclea commented Sep 3, 2020

sdmaclea commented Sep 3, 2020 • edited Loading

CarolEidt commented Sep 3, 2020

CarolEidt commented Sep 3, 2020

CarolEidt commented Sep 4, 2020

TamarChristinaArm commented Sep 4, 2020

CarolEidt commented Sep 4, 2020

AndyAyersMS commented Sep 4, 2020

CarolEidt commented Sep 4, 2020

CarolEidt commented Sep 4, 2020

TamarChristinaArm commented Sep 7, 2020

BruceForstall commented Nov 10, 2020

kunalspathak commented Nov 10, 2020

adamsitnik commented Nov 12, 2020

kunalspathak commented Nov 12, 2020

adamsitnik commented Sep 1, 2020 •

edited by BruceForstall

Loading

sdmaclea commented Sep 3, 2020 •

edited

Loading