-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM64] Performance regression: PerfLabTests.CastingPerf2.CastingPerf.IntObj #41706
Comments
@CarolEidt please help look into this. CC @dotnet/jit-contrib |
@adamsitnik - could you clarify how to run these on one of the Ubuntu arm64 systems? The command you give above appears to be using windows pathnames, and if I try this:
I get a syntax error on line 39 |
Assuming you have the right builds of 3.1 and 5.0 installed and
should work. Add |
Thanks @AndyAyersMS ! |
Also this may be touching on the relatively new cast caching, so cc @VSadov. |
If this is a simple unbox, it should be jit-inlined. I will take a look. *enums match with underlying types as well and thus have some special handling, but that still does not need to use cache. |
please excuse me for that, I must have copy-pasted it from previous Windows specifc issue. I've fixed the description
If you append 3 to python3 scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'PerfLabTests.CastingPerf2.CastingPerf.IntObj'
Unfortunately, the BDN disassembler does not support ARM. Some time ago we have switched to use |
Looking at the code generation, the only difference is that in 5.0 several address constants have been CSE'd. The loop goes from 108 bytes down to 64 bytes, but in both cases is small enough that alignment could make a big difference. Unless there's objection I would suggest we close this, and reference #8108 once again. (Perhaps we should add support for a specified loop alignment, even if only for use in benchmarking to validate or invalidate hypotheses such as this.)
|
Did something change in the way |
The benchmark does a trivial unbox of an I think it should not call the helper and do the unbox completely inline. |
It should not be called in simple cases though. It basically should handle cast failures (throw) or rare cases such as unboxing an enum to underlying type. |
I can't find the comment just now, but I recall @sdmaclea saying that ARM64 didn't have the same kind of code alignment penalties that we see on xArch. |
I wouldn't expect significant alignment penalties. Instructions are always 4 bytes and 4 byte aligned by definition. If a loop crosses a cache line or page boundary there might be some perf difference, but I wouldn't expect much. The branch predictor could possible be affected by the hash of the branch PC, but given there are only two branches in the loop I wouldn't expect that to be the issue. The unboxing seems more likely to be the issue. |
I've not been able to figure out how to debug the benchmark as run by the perf harness, but I've extracted the benchmark method, and verified that the helper is never called. Unless somehow different code is generated, there must be something else going on. |
I would guess those branches are not well predicted. So minimizing branch mispredict recovery time would be important. I am wondering if the uArch modality has to do with fetch group alignment... Like you first suggested... |
I haven't looked at the source but looking at the disassembly, it looks to me like a lot of |
We improved the CSE'ing of constants in .NET 5, so the "AFTER" loop has only one large constant. It's not hoisted out of the loop because it is marked as being dependent on the class constructor and therefore not hoistable. This is the address of a class static, so it's unclear to me why it's not hoistable. @briansull @AndyAyersMS - can you enlighten me why such a constant would not be hoistable? |
@TamarChristinaArm - can you shed any light on what might case the above "AFTER" loop to be slower than the "BEFORE" loop? (e.g. loop alignment, cache effects, etc.)? |
I should also note that, when I extracted the benchmark method and added a timer using |
@CarolEidt that to me looks like what you suspected being loop alignment. Most uArch will have alignment requirement for branch targets (for performance not correctness), newer Cortex-A cores generally prefer 32-byte alignments for branch targets, see for instance Neoverse-N1 optimization guide section What we've observed is that for small loops the alignment makes a big difference and while both loops have misaligned targets the second loop size is much smaller so would be more sensitive. |
Thanks so much @TamarChristinaArm - I propose that we either close this or mark it "Future" and take it into consideration when/if we address #8108. Thoughts @adamsitnik ? |
I can try and replicate what I did for #2249 to confirm some of the issues we're seeing now are indeed alignment. In the mean time you might be able to use In the short run we should figure out the proper method entry alignment -- we can't do anything about internal alignments until we fix this. Then we can at least enable the (optional) loop alignment for arm (implement @TamarChristinaArm is there a recommended set of NOP sequences of varying length, or is there some other practice to ensure alignment? |
@AndyAyersMS - it would be interesting to try implementing #2249 for arm64. |
Here's the output from
|
No, we just emit multiple |
It appears the consensus here is the issue is loop alignment, for which we already have linked issues tracking the work, so I'm going to close this. If that's incorrect, then feel free to re-open with a clear note about what unique work/issue this will address. |
On my Windows x64, when I tested this benchmark with my changes in #44370, I did see this benchmark improved.
|
@kunalspathak This particular regression was specific to ARM64 (not x64). Is there any chance you could check the ARM results? |
Ah, that's true. Currently my loop alignment is for xarch, but once I get it for arm, I will verify this. |
After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found out that
PerfLabTests.CastingPerf2.CastingPerf.IntObj
has regressed x2.It looks like these are ARM64 specific regressions, I was not able to reproduce it for ARM (the 32-bit variant).
Repro
cc @kunalspathak
category:cq
theme:needs-triage
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: