Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Redundant fmov's on arm64 for a simple function #58954

Closed
Tracked by #64820
EgorBo opened this issue Sep 10, 2021 · 9 comments · Fixed by #75202
Closed
Tracked by #64820

JIT: Redundant fmov's on arm64 for a simple function #58954

EgorBo opened this issue Sep 10, 2021 · 9 comments · Fixed by #75202
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Sep 10, 2021

Repro:

static float Lerp(float v0, float v1, float t) =>
    MathF.FusedMultiplyAdd(t, v1, 
        MathF.FusedMultiplyAdd(-t, v0, v0));

arm64:

; Method Program:Lerp(float,float,float):float
G_M22020_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50

G_M22020_IG02:              ;; offset=0008H
        1E204010          fmov    s16, s0
        1E204000          fmov    s0, s0
        1E204051          fmov    s17, s2
        1F11C000          fmsub   s0, s0, s17, s16
        1E204000          fmov    s0, s0
        1E204021          fmov    s1, s1
        1E204042          fmov    s2, s2
        1F020020          fmadd   s0, s1, s2, s0
						;; bbWeight=1    PerfScore 9.00

G_M22020_IG03:              ;; offset=0028H
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00
; Total bytes of code: 48

namely:

        1E204000          fmov    s0, s0
        1E204021          fmov    s1, s1
        1E204042          fmov    s2, s2

Expected codegen: https://godbolt.org/z/9e91zE3j3

@EgorBo EgorBo added the tenet-performance Performance related issue label Sep 10, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Sep 10, 2021
@ghost
Copy link

ghost commented Sep 10, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Repro:

static float Lerp(float v0, float v1, float t) =>
    MathF.FusedMultiplyAdd(t, v1, 
        MathF.FusedMultiplyAdd(-t, v0, v0));

arm64:

; Method Program:Lerp(float,float,float):float
G_M22020_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50

G_M22020_IG02:              ;; offset=0008H
        1E204010          fmov    s16, s0
        1E204000          fmov    s0, s0
        1E204051          fmov    s17, s2
        1F11C000          fmsub   s0, s0, s17, s16
        1E204000          fmov    s0, s0
        1E204021          fmov    s1, s1
        1E204042          fmov    s2, s2
        1F020020          fmadd   s0, s1, s2, s0
						;; bbWeight=1    PerfScore 9.00

G_M22020_IG03:              ;; offset=0028H
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00
; Total bytes of code: 48

namely:

        1E204000          fmov    s0, s0
        1E204021          fmov    s1, s1
        1E204042          fmov    s2, s2
Author: EgorBo
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@EgorBo EgorBo added this to the 7.0.0 milestone Sep 10, 2021
@EgorBo
Copy link
Member Author

EgorBo commented Sep 10, 2021

Looks like should be handled in emitter::IsRedundantMov

@EgorBo EgorBo added help wanted [up-for-grabs] Good issue for external contributors and removed help wanted [up-for-grabs] Good issue for external contributors labels Sep 10, 2021
@tannergooding
Copy link
Member

This is likely because fmov has a "side effect". In particular, float, double, and Vector64<T> are all the lower bits of a Vector128<T> and fmov will zero extend to fill in the remaining bits.

This is somewhat unfortunate as the side effect doesn't always matter, it only really matters when the user is explicitly aware they are operating on a Vector128<T> such as via the HWIntrinsics.

Because of this, I imagine the fix is slightly more complicated as it likely requires knowing that the source and destination are the same size vs knowing when you are using a hardware intrinsic instead.

@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 10, 2021
@am11
Copy link
Member

am11 commented Feb 26, 2022

@EgorBo, sorry I don't mean to distract you from fmov case, but I noticed something strange about IsRedundantMov. Do you know if/how it is possible that IsRedundantMov returns true in some cases but emitter still gives us redundant moves in optimized codegen? While looking at a nameof(Token) case, I noticed it happening on x64 with:

using System.Runtime.CompilerServices;
class C
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static string M1() => "C";
}

In this case, IsRedundantMov returns from:


but we still get:

; Assembly listing for method C:M1():System.String
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;# V00 OutArgs      [V00    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M9818_IG01:              ;; offset=0000H
                                                ;; bbWeight=1    PerfScore 0.00
G_M9818_IG02:              ;; offset=0000H
       48B8B0110068927F0000 mov      rax, 0x7F92680011B0      ; string handle
       488B00               mov      rax, gword ptr [rax]
                                                ;; bbWeight=1    PerfScore 2.25
G_M9818_IG03:              ;; offset=000DH
       C3                   ret      
                                                ;; bbWeight=1    PerfScore 1.00

; Total bytes of code 14, prolog size 0, PerfScore 4.65, instruction count 3, allocated bytes for code 14 (MethodHash=28ffd9a5) for method C:M1():System.String
; ============================================================

@EgorBo
Copy link
Member Author

EgorBo commented Feb 26, 2022

@EgorBo, sorry I don't mean to distract you from fmov case,

haha, feel free to ask any questions if you're interested in contributing to JIT 🙂

but we still get:

not sure I understand, I don't see any redundant mov in the output codegen
it is essentially

uint64_t rax = 0x7F92680011B0ULL;
return *((uint64_t*)(void*)rax);

@am11
Copy link
Member

am11 commented Feb 26, 2022

Maybe it's just the encoding difference, but I was expecting one mov like:

mov eax, [0x12e391a0]
ret

or

movabs  rax, gword ptr [ds:0x7F92680011B0]
ret

@EgorBo
Copy link
Member Author

EgorBo commented Feb 26, 2022

Maybe it's just the encoding difference, but I was expecting one mov like:

mov eax, [0x12e391a0]
ret

that can only be done (we call it "contained") when the constant (imm) fits into 4byte integer, see https://godbolt.org/z/v81ozcEKd (and even less on arm64)
Another alternative is RIP-relative addressing, not sure why it wasn't applied here.

@kunalspathak
Copy link
Member

Moving this to .NET 8

@kunalspathak kunalspathak modified the milestones: 7.0.0, 8.0.0 Jun 16, 2022
@kunalspathak
Copy link
Member

@TIHan - assigning this to you.

@kunalspathak kunalspathak assigned TIHan and unassigned kunalspathak Jun 16, 2022
@TIHan TIHan modified the milestones: 8.0.0, Future Jul 5, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 7, 2022
a74nh added a commit to a74nh/runtime that referenced this issue Sep 8, 2022
For every FMOV emitted by the hardware intrinsics that is currently
marked as not skippable: if the src and dest registers are the same
and the types match, then the instruction will have no effect and
can be safely marked as skippable.
kunalspathak pushed a commit that referenced this issue Sep 12, 2022
* Arm64: Skip redundant hwintrinsic float movs (#58954)

For every FMOV emitted by the hardware intrinsics that is currently
marked as not skippable: if the src and dest registers are the same
and the types match, then the instruction will have no effect and
can be safely marked as skippable.

* Use hardcoded canSkip values
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 12, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Oct 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants