[Arm64] Planned JIT work in .NET 6 #43629

echesakov · 2020-10-20T02:20:34Z

Background

In .NET 5, the .NET team made a non-trivial effort to bring parity between Arm64 and X86 platforms support. As an example, we added 384 methods to System.Runtime.Intrinsics.Arm allowing our customers to use Advanced SIMD instructions on Arm64, optimized libraries code using these intrinsics, and made the Arm64 targeted performance improvements in the CodeGen.

In .NET 6 we will continue the effort. In particular, as a part of .NET 6 planning the JIT team identified the following items as our next short-term goals:

Conditional instructions/branch elimination

One of the examples of such code transformations can be found in LLVM that transforms cbz/cbnz/tbz/tbnz instructions into a conditional branch (b.cond). For example, you can compare the outputs of the latest clang compiling the C++ snippet

void TransformsIntoCondBr(int& op1, int& op2) {
    if (op1 & op2) {
        op1 = op2;
    } else {
        op2 = op1;
    }
}

with such optimization disabled
-O2 -mllvm -aarch64-enable-cond-br-tune=false

TransformsIntoCondBr(int&, int&):           // @TransformsIntoCondBr(int&, int&)
        ldr     w8, [x0]
        ldr     w9, [x1]
        and     w10, w9, w8
        cbz     w10, .LBB0_2
        str     w9, [x0]
        ret
.LBB0_2:
        str     w8, [x1]
        ret

and with the optimization enabled
-O2 -mllvm -aarch64-enable-cond-br-tune=true

TransformsIntoCondBr(int&, int&):           // @TransformsIntoCondBr(int&, int&)
        ldr     w8, [x0]
        ldr     w9, [x1]
        tst     w9, w8
        b.eq    .LBB0_2
        str     w9, [x0]
        ret
.LBB0_2:
        str     w8, [x1]
        ret

and w10, w9, w8; cbz w10, .LBB0_2 has been replaced with tst w9, w8; b.eq .LBB0_2 that freed w10 register.

The JIT team will research the optimization area and make decision on what optimizations can be implemented in .NET 6.

Some related issues:

RyuJit: avoid conditional jumps using cmov and similar instructions #6749 RyuJit: avoid conditional jumps using cmov and similar instructions
Branchless code generation for ternaries #32632 Branchless code generation for ternaries
RyuJIT: Optimize "X / POW2_CNS" via cmovns #41549 RyuJIT: Optimize "X / POW2_CNS" via cmovns
[RyuJIT][arm64] Optimize "x<0" and "x>=0" #43440 [RyuJIT][arm64] Optimize "x<0" and "x>=0"

Presumably, some parts of the analysis can be implemented in platform agnostic way and benefit both Arm64 and X86 platforms.

Next steps:

Identify the optimizations and estimate their potential impact
See what could be implemented in platform agnostic way and do this as a next step
Implement Arm64 specific optimizations

Hardware Intrinsics on Arm64

We need to address the known inefficiencies/suboptimal code generation:

Consider optimizing more intrinsics that have move/copy semantics #40489 Consider optimizing more intrinsics that have move/copy semantics as it is done in 833aaba Stretch Goal
Investigate unnecessary vector register moves #33975 Investigate unnecessary vector register moves around helper calls. As one of the potential solutions we can implement custom helpers in assembly and guarantee calling conventions not altering upper 64bit of SIMD registers (Investigate unnecessary vector register moves #33975 (comment)) Stretch Goal
JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase #13617 JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase Stretch Goal
Use cmeq, cmge, cmgt (zero) when one of the operands is Vector64/128<T>.Zero #33972 Use cmeq, cmge, cmgt (zero) when one of the operands is Vector64/128<T>.Zero Stretch Goal

Implementation of new APIs is also on the table. The following are some instances of the proposed work:

API Proposal : Arm TableVectorLookup and TableVectorExtension intrinsics #1277 TableVectorLookup and TableVectorExtension intrinsics (multiple register) Moved to 7.0
[Arm64] LoadPairVector64 and LoadPairVector128 #39243 LoadPairVector64 and LoadPairVector128 Moved to 7.0 ~~[Arm64] AdvSIMD LoadPairVector64 and LoadPairVector128 #45020~~ Superseded by [Arm64] Implement LoadPairVector64 and LoadPairVector128 #52424
[Arm64] MultiplyHigh #43106 MultiplyHigh Closed by [Arm64] Implement MultiplyHigh #47362
LoadVector64, LoadVector128 and Store with multiple registers (ld1-4, st1-4). Note that implementing these and multiple register variants of TableVectorLookup and TableVectorExtension intrinsics would require extensive changes to LSRA. ** Moved to 7.0**
We consider implementing Hardware Intrinsics for ARMv8.3-CompNum, ARMv8.2-I8MM and ARMv8.2-FP16 ISAs Stretch Goal

Atomic instructions

Currently, JIT emits ARMv8.1-LSE atomic instructions in the following cases:

CodeGen::genLockedInstructions(GenTreeOp* treeNode) for Interlocked.Add and Interlocked.Exchange methods
CodeGen::genCodeForCmpXchg(GenTreeCmpXchg* treeNode) for Interlocked.CompareExchange method

There is a proposal to extend this functionality for other methods in Interlocked class Consider making Interlocked.And and friends into JIT intrinsics #32239 "Consider making Interlocked.And and friends into JIT intrinsics" As per Consider making Interlocked.And and friends into JIT intrinsics #32239 (comment) there is a potential to generated better code by using Armv8.1. Closed by [RyuJIT] Implement Interlocked.And and Interlocked.Or for arm64-v8.1 #46253 - thank you @EgorBo!

Another potential work is to support ARMv8.4-LSE atomic instructions in the JIT.

Examples of Arm64 specific JIT backlog issues

In .NET 5 we implemented stack probing procedure on all platforms except Arm64 (Implement stack probing using helpers coreclr#26807 and Implement stack probing using helpers coreclr#27184). This has solved some of the issues with stack unwinding Cannot unwind stack when stack probing hits the stack limit on Unix #11495 and allowed to implement Display stack trace at stack overflow #32167 "Display stack trace at stack overflow". In .NET 6 we should close the gap on Arm64 and address [Arm64] Implement stack probing using helper #13519 "[Arm64] Implement stack probing using helper"
Moved to 7.0 Blocked by [Arm64] Extend Compiler::lvaFrameAddress() and JIT to allow using SP as base register #47810
We saw huge improvement in .NET 5 from Ben's work in Use xmm for stack prolog zeroing rather than rep stos #32538 "Use xmm for stack prolog zeroing rather than rep stos"
We should consider implementing a similar idea and use SIMD registers for prolog zeroing on Arm64. We can employ the fact that AdvSimd st1 instruction can store up to 4 128 bit SIMD registers to memory effectively allowing to write up to 64 bytes of zeroes to memory in one instruction. This work is tracked in [Arm64] Use stp and str (SIMD) for stack prolog zeroing #43789. Closed by [Arm64] Use SIMD register to zero init frame #46609
Use stp (SIMD) in genCodeForInitBlkUnroll and genCodeForCpBlkUnroll [Arm64] Use stp (SIMD) in genCodeForInitBlkUnroll and genCodeForCpBlkUnroll #48934 Stretch Goal

Stretch goal

Peephole optimization opportunities:
- ARM64: Optimize redundant memory loads with mov #35141 : Optimize redundant memory loads with mov
- ARM64 Redundant load/stores for methods that operates/returns structs #35071 : Redundant load/stores for methods that operates/returns structs
- ARM64: Optimize pair of "str wzr, [reg]" to "str xzr" #35136 : Optimize pair of "str wzr, [reg]" to "str xzr"
- ARM64: Optimize pair of "str reg, [fp]" to stp #35134 : Optimize pair of "str reg, [fp]" to stp
- ARM64: Optimize pair of "str reg, [reg]" to stp #35133 : Optimize pair of "str reg, [reg]" to stp
- ARM64: Optimize pair of "ldr reg, [reg]" to ldp #35132 : Optimize pair of "ldr reg, [reg]" to ldp
- ARM64: Optimize pair of "ldr reg, [fp]" to ldp #35130 : Optimize pair of "ldr reg, [fp]" to ldp
- ARM64: Redundant movs done for zero extend the register #35254 : Redundant movs done for zero extend the register
As a follow up to [RyuJIT] Implement Interlocked.And and Interlocked.Or for arm64-v8.1 #46253 we should measure performance impact of the 8.4 interlocked instructuctions going forward in .NET 7 and see if we can benefit from using those in .NET.

Note: For all the above peephole work items, there is a pre-requisite work-item that is needed to enable the codegen to update previously emitted instruction. There is no separate tracking issue for it, and one of the first optimization we do will have to do that infrastructure work first.

@dotnet/jit-contrib @TamarChristinaArm @tannergooding

category:planning
theme:planning
skill-level:expert
cost:large

The text was updated successfully, but these errors were encountered:

JulieLeeMSFT · 2021-01-14T19:50:15Z

Performance improvemnet work on the TechEmpower Cached Queries benchmark:
#46970

echesakov · 2021-07-08T20:27:10Z

Closing the epic as we are getting closer to .NET 6 feature complete date.

I opened #55364 and #55365 to track down future work for the following two sets of items specified here - "Conditional instructions/branch elimination" and "Peephole optimization opportunities".

I also made sure that all the hardware intrinsics work items mentioned here belong to Hardware Intrinsics GitHub project.

echesakov added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Team Epic labels Oct 20, 2020

echesakov added this to the 6.0.0 milestone Oct 20, 2020

echesakov self-assigned this Oct 20, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Oct 20, 2020

echesakov removed the untriaged New issue has not been triaged by the area owner label Oct 20, 2020

JulieLeeMSFT added User Story A single user-facing feature. Can be grouped under an epic. Bottom Up Work Not part of a theme, epic, or user story and removed Team Epic labels Nov 16, 2020

JulieLeeMSFT mentioned this issue Jan 28, 2021

What's new in .NET 6 Preview 1 dotnet/core#5853

Closed

JulieLeeMSFT mentioned this issue Feb 24, 2021

What's new in .NET 6 Preview 2 dotnet/core#5889

Closed

echesakov closed this as completed Jul 8, 2021

ghost locked as resolved and limited conversation to collaborators Aug 7, 2021

JulieLeeMSFT added this to .NET Core CodeGen Jun 5, 2024

JulieLeeMSFT moved this to Done in .NET Core CodeGen Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Arm64] Planned JIT work in .NET 6 #43629

[Arm64] Planned JIT work in .NET 6 #43629

echesakov commented Oct 20, 2020 •

edited

Loading

JulieLeeMSFT commented Jan 14, 2021

echesakov commented Jul 8, 2021 •

edited by JulieLeeMSFT

Loading

[Arm64] Planned JIT work in .NET 6 #43629

[Arm64] Planned JIT work in .NET 6 #43629

Comments

echesakov commented Oct 20, 2020 • edited Loading

Background

Conditional instructions/branch elimination

Hardware Intrinsics on Arm64

Atomic instructions

Examples of Arm64 specific JIT backlog issues

Stretch goal

JulieLeeMSFT commented Jan 14, 2021

echesakov commented Jul 8, 2021 • edited by JulieLeeMSFT Loading

echesakov commented Oct 20, 2020 •

edited

Loading

echesakov commented Jul 8, 2021 •

edited by JulieLeeMSFT

Loading