Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RyuJIT] lack of escape analysis makes redundant memory-access of Vector128/256<T> fields #10767

Open
fiigii opened this issue Jul 25, 2018 · 3 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@fiigii
Copy link
Contributor

fiigii commented Jul 25, 2018

Related to https://github.com/dotnet/coreclr/issues/19116

The C# source code below is from dotnet/coreclr#18839

internal class VectorPacket256
{
    public Vector256<float> Xs;
    public Vector256<float> Ys;
    public Vector256<float> Zs;
    public Vector256<float> Lengths
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get
        {
            return Sqrt(DotProduct(this, this));
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Vector256<float> DotProduct(VectorPacket256 left, VectorPacket256 right)
    {
        var x2 = Multiply(left.Xs, right.Xs);
        var y2 = Multiply(left.Ys, right.Ys);
        var z2 = Multiply(left.Zs, right.Zs);
        return Add(Add(x2, y2), z2);
    }
    ...
}

get_Lengths will be compiled to

vmovupd ymm0, ymmword ptr [rax+0x8]
vmulps ymm0, ymm0, ymmword ptr [rax+0x8]
vmovupd ymm1, ymmword ptr [rax+0x28]
vmulps ymm1, ymm1, ymmword ptr [rax+0x28]
vmovupd ymm2, ymmword ptr [rax+0x48]
mov qword ptr [rbp-0x370], rax
vmulps ymm2, ymm2, ymmword ptr [rax+0x48]
vaddps ymm0, ymm0, ymm1
vaddps ymm0, ymm0, ymm2
vsqrtps ymm0, ymm0

Multiply(this.Xs, this.Xs) generates redundant memory access instructions for Vector256<float> field access. Actually, the codgen could be

vmovupd ymm0, ymmword ptr [rax+0x8]
vmulps ymm0, ymm0, ymm0
vmovupd ymm1, ymmword ptr [rax+0x28]
vmulps ymm1, ymm1, ymm1
vmovupd ymm2, ymmword ptr [rax+0x48]
vmulps ymm2, ymm2, ymm2
vaddps ymm0, ymm0, ymm1
vaddps ymm0, ymm0, ymm2
vsqrtps ymm0, ymm0

This redundant could be eliminated by escape analysis and unwarping VectorPacket256.

A worse example with Vector256<float> field assignments. In the code below, difColor is a local VectorPacket256 object (but dif is returned by another function)

                difColor.Xs = BlendVariable(difColor.Xs, dif.Xs, thingMask);
                difColor.Ys = BlendVariable(difColor.Ys, dif.Ys, thingMask);
                difColor.Zs = BlendVariable(difColor.Zs, dif.Zs, thingMask);

that is compiled to

mov r8, qword ptr [rsp+0x70]
vmovupd ymm0, ymmword ptr [r8+0x8]
vblendvps ymm0, ymm0, ymmword ptr [rax+0x8], ymm11
vmovupd ymmword ptr [r8+0x8], ymm0
vmovupd ymm0, ymmword ptr [r8+0x28]
vblendvps ymm0, ymm0, ymmword ptr [rax+0x28], ymm11
vmovupd ymmword ptr [r8+0x28], ymm0
vmovupd ymm0, ymmword ptr [r8+0x48]
vblendvps ymm0, ymm0, ymmword ptr [rax+0x48], ymm11
mov qword ptr [rsp+0x70], r8
vmovupd ymmword ptr [r8+0x48], ymm0

Ideally, this could be optimized to

vblendvps ymm1, ymm1, ymmword ptr [rax+0x8], ymm11
vblendvps ymm2, ymm2, ymmword ptr [rax+0x28], ymm11
vblendvps ymm3, ymm3, ymmword ptr [rax+0x48], ymm11

category:cq
theme:vector-codegen
skill-level:expert
cost:large

@fiigii
Copy link
Contributor Author

fiigii commented Jul 25, 2018

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 24, 2020
@AndyAyersMS
Copy link
Member

Codegen for get_Lengths -- redundant loads are gone:

; Method VectorPacket256:get_Lengths():System.Runtime.Intrinsics.Vector256`1[float]:this (FullOpts)
G_M26293_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M26293_IG02:  ;; offset=0x0000
       vmovups  ymm0, ymmword ptr [rcx]
       vmovaps  ymm1, ymm0
       vmovups  ymm2, ymmword ptr [rcx+0x20]
       vmovaps  ymm3, ymm2
       vmovups  ymm4, ymmword ptr [rcx+0x40]
       vmovaps  ymm5, ymm4
       vmulps   ymm0, ymm1, ymm0
       vmulps   ymm1, ymm3, ymm2
       vaddps   ymm0, ymm1, ymm0
       vmulps   ymm1, ymm5, ymm4
       vaddps   ymm0, ymm1, ymm0
       vsqrtps  ymm0, ymm0
       vmovups  ymmword ptr [rdx], ymm0
       mov      rax, rdx
						;; size=57 bbWeight=1 PerfScore 44.00

G_M26293_IG03:  ;; offset=0x0039
       vzeroupper 
       ret      
						;; size=4 bbWeight=1 PerfScore 2.00
; Total bytes of code: 61

However we've still got a few unnecessary moves.

@AndyAyersMS
Copy link
Member

Not sure where the second example (vblendvps) is in the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

No branches or pull requests

4 participants