Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement BinaryPrimitives.ReverseEndianness for arm64 using rev #34617

Merged
merged 5 commits into from
Apr 24, 2020

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Apr 7, 2020

@Gnbrkm41 noticed it's not currently optimized to rev.

static uint Test1(uint a) => BinaryPrimitives.ReverseEndianness(a);
static ulong Test2(ulong a) => BinaryPrimitives.ReverseEndianness(a);

Old codegen:

; Assembly listing for method Test1(int):int
G_M29289_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
G_M29289_IG02:
        12009C01          and     w1, w0, #0xff00ff
        13812021          ror     w1, w1, #8
        12089C00          and     w0, w0, #0xff00ff00
        13806000          ror     w0, w0, #24
        0B000020          add     w0, w1, w0
G_M29289_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr



; Assembly listing for method Test2(long):long
G_M43753_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
G_M43753_IG02:
        2A0003E1          mov     w1, w0
        12009C22          and     w2, w1, #0xff00ff
        13822042          ror     w2, w2, #8
        12089C21          and     w1, w1, #0xff00ff00
        13816021          ror     w1, w1, #24
        0B010041          add     w1, w2, w1
        2A0103E1          mov     w1, w1
        D3607C21          lsl     x1, x1, #32
        D360FC00          lsr     x0, x0, #32
        12009C02          and     w2, w0, #0xff00ff
        13822042          ror     w2, w2, #8
        12089C00          and     w0, w0, #0xff00ff00
        13806000          ror     w0, w0, #24
        0B000040          add     w0, w2, w0
        2A0003E0          mov     w0, w0
        8B010000          add     x0, x0, x1
G_M43753_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr

New codegen:

; Assembly listing for method Test1(int):int
G_M29289_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
G_M29289_IG02:
        5AC00800          rev     w0, w0
G_M29289_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr



; Assembly listing for method Test2(long):long
G_M43753_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
G_M43753_IG02:
        DAC00C00          rev     x0, x0
G_M43753_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr

int16 is not implemented yet, perhaps should be implemented in C# instead? To call the int-overload and apply a shift.

Tests already exist: https://github.com/dotnet/runtime/blob/master/src/coreclr/tests/src/JIT/Intrinsics/BinaryPrimitivesReverseEndianness.cs

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 7, 2020
@tannergooding
Copy link
Member

int16 is not implemented yet, perhaps should be implemented in C# instead? To call the int-overload and apply a shift.

Would REV + shift or just a REV16 be better?

@TamarChristinaArm, could you advise? Is there also a good location to find this information or cycle approximations in general?

@EgorBo
Copy link
Member Author

EgorBo commented Apr 7, 2020

int16 is not implemented yet, perhaps should be implemented in C# instead? To call the int-overload and apply a shift.

Would REV + shift or just a REV16 be better?

@TamarChristinaArm, could you advise? Is there also a good location to find this information or cycle approximations in general?

Was confused by the clang's codegen: https://godbolt.org/z/w_8LUC

@EgorBo
Copy link
Member Author

EgorBo commented Apr 7, 2020

New codegen for ushort overload:

static ushort Foo1(ushort a) => BinaryPrimitives.ReverseEndianness(a);
; Assembly listing for method Program:Foo1(ushort):ushort
; Emitting BLENDED_CODE for generic ARM64 CPU - Unix
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )  ushort  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M15320_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
                                                ;; bbWeight=1    PerfScore 1.50
G_M15320_IG02:
        53003C00          uxth    w0, w0
        5AC00400          rev16   w0, w0
        53003C00          uxth    w0, w0
                                                ;; bbWeight=1    PerfScore 1.50
G_M15320_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
                                                ;; bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 7.80, (MethodHash=a63ec427) for method Program:Foo1(ushort):ushort
; ============================================================

@EgorBo
Copy link
Member Author

EgorBo commented Apr 7, 2020

Here is what Mono-LLVM-JIT (LLVM 6.0) emits on my Cortex A57 machine:

static ushort Foo1(ushort a) => BinaryPrimitives.ReverseEndianness(a);
define monocc i16 @"Program:Foo1 (uint16)"(i16 %arg_a) #0 {
BB0:
  %rev = call i16 @llvm.bswap.i16(i16 %arg_a)
  ret i16 %rev
}
   0:	5ac00808 	rev	w8, w0
   4:	53107d00 	lsr	w0, w8, #16
   8:	d65f03c0 	ret

@TamarChristinaArm
Copy link
Contributor

int16 is not implemented yet, perhaps should be implemented in C# instead? To call the int-overload and apply a shift.

Would REV + shift or just a REV16 be better?

The REV16 is better, REV and REV16 have the same latency so adding a shift makes that sequence slower.

@TamarChristinaArm, could you advise? Is there also a good location to find this information or cycle approximations in general?

You can find these in the optimization guides, the one for Cortex-A55 is https://developer.arm.com/docs/epm128372/30/arm-cortex-a55-software-optimization-guide

@EgorBo
Copy link
Member Author

EgorBo commented Apr 13, 2020

I guess it's ready for review then

@tannergooding
Copy link
Member

CC. @CarolEidt, @echesakovMSFT, @dotnet/jit-contrib

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@EgorBo
Copy link
Member Author

EgorBo commented Apr 18, 2020

Don't know how to restart the "in progress" leg 😕

@sandreenko
Copy link
Contributor

AzDO shows that it has passed: https://dev.azure.com/dnceng/public/_build/results?buildId=605791&view=results

@EgorBo
Copy link
Member Author

EgorBo commented Apr 24, 2020

@BruceForstall can this be merged?

@echesakov
Copy link
Contributor

It looks that we never gonna see runtime (Installer Build and Test Windows_NT_arm Debug) leg completed :)

@BruceForstall BruceForstall merged commit 89d08de into dotnet:master Apr 24, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants