[Arm64] Planned JIT work in .NET 6 #43629
Labels
arch-arm64
area-CodeGen-coreclr
CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Bottom Up Work
Not part of a theme, epic, or user story
User Story
A single user-facing feature. Can be grouped under an epic.
Milestone
Background
In .NET 5, the .NET team made a non-trivial effort to bring parity between Arm64 and X86 platforms support. As an example, we added 384 methods to System.Runtime.Intrinsics.Arm allowing our customers to use Advanced SIMD instructions on Arm64, optimized libraries code using these intrinsics, and made the Arm64 targeted performance improvements in the CodeGen.
In .NET 6 we will continue the effort. In particular, as a part of .NET 6 planning the JIT team identified the following items as our next short-term goals:
Conditional instructions/branch elimination
One of the examples of such code transformations can be found in LLVM that transforms
cbz
/cbnz
/tbz
/tbnz
instructions into a conditional branch (b.cond
). For example, you can compare the outputs of the latest clang compiling the C++ snippetwith such optimization disabled
-O2 -mllvm -aarch64-enable-cond-br-tune=false
and with the optimization enabled
-O2 -mllvm -aarch64-enable-cond-br-tune=true
and w10, w9, w8; cbz w10, .LBB0_2
has been replaced withtst w9, w8; b.eq .LBB0_2
that freedw10
register.The JIT team will research the optimization area and make decision on what optimizations can be implemented in .NET 6.
Some related issues:
Presumably, some parts of the analysis can be implemented in platform agnostic way and benefit both Arm64 and X86 platforms.
Next steps:
Hardware Intrinsics on Arm64
Consider optimizing more intrinsics that have move/copy semantics #40489 Consider optimizing more intrinsics that have move/copy semantics as it is done in 833aaba Stretch Goal
Investigate unnecessary vector register moves #33975 Investigate unnecessary vector register moves around helper calls. As one of the potential solutions we can implement custom helpers in assembly and guarantee calling conventions not altering upper 64bit of SIMD registers (Investigate unnecessary vector register moves #33975 (comment)) Stretch Goal
JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase #13617 JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase Stretch Goal
Use cmeq, cmge, cmgt (zero) when one of the operands is Vector64/128<T>.Zero #33972 Use
cmeq
,cmge
,cmgt
(zero) when one of the operands isVector64/128<T>.Zero
Stretch GoalTableVectorLookup
andTableVectorExtension
intrinsics (multiple register) Moved to 7.0LoadPairVector64
andLoadPairVector128
Moved to 7.0[Arm64] AdvSIMD LoadPairVector64 and LoadPairVector128 #45020Superseded by [Arm64] Implement LoadPairVector64 and LoadPairVector128 #52424MultiplyHigh
Closed by [Arm64] Implement MultiplyHigh #47362LoadVector64
,LoadVector128
andStore
with multiple registers (ld1-4
,st1-4
). Note that implementing these and multiple register variants ofTableVectorLookup
andTableVectorExtension
intrinsics would require extensive changes to LSRA. ** Moved to 7.0**ARMv8.3-CompNum
,ARMv8.2-I8MM
andARMv8.2-FP16
ISAs Stretch GoalAtomic instructions
Currently, JIT emits
ARMv8.1-LSE
atomic instructions in the following cases:CodeGen::genLockedInstructions(GenTreeOp* treeNode)
forInterlocked.Add
andInterlocked.Exchange
methodsCodeGen::genCodeForCmpXchg(GenTreeCmpXchg* treeNode)
forInterlocked.CompareExchange
methodInterlocked
class Consider making Interlocked.And and friends into JIT intrinsics #32239 "Consider making Interlocked.And and friends into JIT intrinsics" As per Consider making Interlocked.And and friends into JIT intrinsics #32239 (comment) there is a potential to generated better code by using Armv8.1. Closed by [RyuJIT] Implement Interlocked.And and Interlocked.Or for arm64-v8.1 #46253 - thank you @EgorBo!Another potential work is to support
ARMv8.4-LSE
atomic instructions in the JIT.Examples of Arm64 specific JIT backlog issues
In .NET 5 we implemented stack probing procedure on all platforms except Arm64 (Implement stack probing using helpers coreclr#26807 and Implement stack probing using helpers coreclr#27184). This has solved some of the issues with stack unwinding Cannot unwind stack when stack probing hits the stack limit on Unix #11495 and allowed to implement Display stack trace at stack overflow #32167 "Display stack trace at stack overflow". In .NET 6 we should close the gap on Arm64 and address [Arm64] Implement stack probing using helper #13519 "[Arm64] Implement stack probing using helper"
Moved to 7.0 Blocked by [Arm64] Extend Compiler::lvaFrameAddress() and JIT to allow using SP as base register #47810
We saw huge improvement in .NET 5 from Ben's work in Use xmm for stack prolog zeroing rather than rep stos #32538 "Use xmm for stack prolog zeroing rather than rep stos"
We should consider implementing a similar idea and use SIMD registers for prolog zeroing on Arm64. We can employ the fact that AdvSimd
st1
instruction can store up to 4 128 bit SIMD registers to memory effectively allowing to write up to 64 bytes of zeroes to memory in one instruction. This work is tracked in [Arm64] Use stp and str (SIMD) for stack prolog zeroing #43789. Closed by [Arm64] Use SIMD register to zero init frame #46609Use stp (SIMD) in
genCodeForInitBlkUnroll
andgenCodeForCpBlkUnroll
[Arm64] Use stp (SIMD) in genCodeForInitBlkUnroll and genCodeForCpBlkUnroll #48934 Stretch GoalStretch goal
Peephole optimization opportunities:
As a follow up to [RyuJIT] Implement Interlocked.And and Interlocked.Or for arm64-v8.1 #46253 we should measure performance impact of the 8.4 interlocked instructuctions going forward in .NET 7 and see if we can benefit from using those in .NET.
Note: For all the above peephole work items, there is a pre-requisite work-item that is needed to enable the codegen to update previously emitted instruction. There is no separate tracking issue for it, and one of the first optimization we do will have to do that infrastructure work first.
@dotnet/jit-contrib @TamarChristinaArm @tannergooding
category:planning
theme:planning
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: