Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the OBJ(ADDR) wrapping created for calls by the importer #51569

Closed
SingleAccretion opened this issue Apr 20, 2021 · 11 comments
Closed

Remove the OBJ(ADDR) wrapping created for calls by the importer #51569

SingleAccretion opened this issue Apr 20, 2021 · 11 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Apr 20, 2021

Reproduction:

public static int ObjAddrWrapping(Vector<int> d)
{
    d = Vector<int>.Zero;

    if (d == Vector<int>.Zero)
        return 2;

    return 3;
}

The nodes in question:

    [ 2]  13 (0x00d) call 0A000006
In Compiler::impImportCall: opcode is call, kind=0, callRetType is bool, structSize is 0
Named Intrinsic System.Numerics.Vector`1.op_Equality: Recognized
    [ 1]  18 (0x012) brfalse.s

STMT00001 (IL 0x007...  ???)
               [000011] ------------              *  JTRUE     void  
               [000010] ------------              \--*  EQ        int   
               [000008] ------------                 +--*  HWINTRINSIC bool   int op_Equality
               [000007] n-----------                 |  +--*  OBJ       simd16<System.Numerics.Vector`1[Int32]>
               [000006] ------------                 |  |  \--*  ADDR      byref 
               [000004] -------N----                 |  |     \--*  LCL_VAR   simd16<System.Numerics.Vector`1[Int32]> V00 arg0         
               [000005] ------------                 |  \--*  HWINTRINSIC simd16 int get_Zero
               [000009] ------------                 \--*  CNS_INT   int    0

This happens because impSIMDPopStack calls impNormStructVal which wraps local variables in the OBJ(ADDR) sequence, presumably to avoid calls with mistyped arguments leading to bad codegen. The supposition of this issue is that this wrapping can be avoided when possible, as it blocks downstream optimizations.

category:cq
theme:importer
skill-level:expert
cost:medium
impact:small

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Apr 20, 2021
@SingleAccretion SingleAccretion changed the title Importer creates OBJ(ADDR(LCL_VAR)) trees when importing some static intrinsics. Importer creates OBJ(ADDR(LCL_VAR)) trees when importing some static intrinsics Apr 20, 2021
@AndyAyersMS
Copy link
Member

@sandreenko perhaps this is something you can investigate?

@sandreenko sandreenko self-assigned this Apr 22, 2021
@sandreenko sandreenko added this to the Future milestone Apr 22, 2021
@sandreenko sandreenko removed the untriaged New issue has not been triaged by the area owner label Apr 22, 2021
@sandreenko
Copy link
Contributor

I have investigated this in the past, there are 2 issues:

  1. arm64 put arg is kind of primitive and can work only with "OBJ" nodes, not very hard to fix;
  2. morph calls "morphArguments" before it sees which are passed by ref so for:
ASG(LCL_VAR 16byte struct V01, CNST_INT 0)
CALL(LCL_VAR 16 byte struct V01)

if we don't have OBJ(ADDR()) sugar we do const propagation:

CALL(CNST_INT 0)

and then there is no handling that the argument has to be passed by reference, a right solution was described here #1231 (comment) but has not been implemented yet

what we want after assertion prop:

CALL(INIT_VAL(CNS_INT(0)))

see that it is passed by ref, spill it:

ASG(LCL_VAR 16 byte temp V02, INIT_VAL(CNS_INT(0)))
CALL(ADDR(V02))

of course, we can change fgMorphArgs to catch this case without introducing INIT_VAL(CNS_INT(0)) but fgMorphArgs is kind of fragile nowadays.

@SingleAccretion could you please show IR after morph and what optimization does not happen because of OBJ(ADDR())?

@SingleAccretion
Copy link
Contributor Author

SingleAccretion commented Apr 22, 2021

could you please show IR after morph

@sandreenko Of course.

*************** Finishing PHASE Morph - Global
Trees after Morph - Global

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       [000..014)-> BB03 ( cond )                     i 
BB02 [0001]  1       BB01                  1       [014..016)        (return)                     i 
BB03 [0002]  1       BB01                  1       [016..018)        (return)                     i 
-----------------------------------------------------------------------------------------------------------------------------------------

------------ BB01 [000..014) -> BB03 (cond), preds={} succs={BB02,BB03}

***** BB01
STMT00000 (IL 0x000...0x005)
               [000003] -A---+------              *  ASG       simd16 (copy)
               [000001] D----+-N----              +--*  LCL_VAR   simd16<System.Numerics.Vector`1[Int32]> V00 arg0         
               [000000] -----+------              \--*  HWINTRINSIC simd16 int get_Zero

***** BB01
STMT00001 (IL 0x007...0x012)
               [000011] -----+------              *  JTRUE     void  
               [000010] J----+-N----              \--*  EQ        int   
               [000008] -----+------                 +--*  HWINTRINSIC bool   int op_Equality
               [000007] n----+------                 |  +--*  OBJ       simd16<System.Numerics.Vector`1[Int32]>
               [000006] -----+------                 |  |  \--*  ADDR      byref 
               [000004] -----+-N----                 |  |     \--*  LCL_VAR   simd16<System.Numerics.Vector`1[Int32]> V00 arg0         
               [000005] -----+------                 |  \--*  HWINTRINSIC simd16 int get_Zero
               [000009] -----+------                 \--*  CNS_INT   int    0

------------ BB02 [014..016) (return), preds={BB01} succs={}

***** BB02
STMT00003 (IL 0x014...0x015)
               [000015] -----+------              *  RETURN    int   
               [000014] -----+------              \--*  CNS_INT   int    2

------------ BB03 [016..018) (return), preds={BB01} succs={}

***** BB03
STMT00002 (IL 0x016...0x017)
               [000013] -----+------              *  RETURN    int   
               [000012] -----+------              \--*  CNS_INT   int    3

-------------------------------------------------------------------------------------------------------------------

what optimization does not happen because of OBJ(ADDR())?

As far as I can tell, this doesn't impact anything today, because we don't really do folding for vectors, but it does mean the trees become somewhat opaque to VN.

N001 [000004]   LCL_VAR   V00 arg0         u:2 (last use) => $100 {HWI_Vector128_get_Zero()}
N002 [000006]   ADDR      => $180 {PtrToLoc($40, $0)}
N003 [000007]   OBJ       => $141 {141}
N004 [000005]   HWINTRINSIC => $100 {HWI_Vector128_get_Zero()}
N005 [000008]   HWINTRINSIC => $1c0 {HWI_Vector128_op_Equality($141, $100)}
N006 [000009]   CNS_INT   0 => $40 {IntCns 0}
N007 [000010]   EQ        => $200 {EQ($1c0, $40)}

***** BB01, STMT00001(after)
N008 ( 15, 13) [000011] ------------              *  JTRUE     void  
N007 ( 13, 11) [000010] J------N----              \--*  EQ        int    $200
N005 ( 11,  9) [000008] ------------                 +--*  HWINTRINSIC bool   int op_Equality $1c0
N003 (  9,  7) [000007] n-----------                 |  +--*  OBJ       simd16<System.Numerics.Vector`1[Int32]> $141
N002 (  3,  3) [000006] ------------                 |  |  \--*  ADDR      byref  $180
N001 (  3,  2) [000004] -------N----                 |  |     \--*  LCL_VAR   simd16<System.Numerics.Vector`1[Int32]> V00 arg0         u:2 (last use) $100
N004 (  1,  1) [000005] ------------                 |  \--*  HWINTRINSIC simd16 int get_Zero $100
N006 (  1,  1) [000009] ------------                 \--*  CNS_INT   int    0 $40

I think the issues you describe are much more general than this one, because it is quite specific to the exact setup I had for the reproduction (sorry for not calling this out, I did not have a good handle on this myself :(): parameter as a local + Unix ABI (so no implicit byrefs in the picture).

There is no such wrapping for locals:

public static int ObjAddrWrapping(int a)
{
    var c = Vector<int>.Zero;

    if (a is 0)
        return 0;
 
    if (c == Vector<int>.Zero)
        return 2;

    return 3;
}

is imported as one would expect:

    [ 2]  17 (0x011) call 0A000006
In Compiler::impImportCall: opcode is call, kind=0, callRetType is bool, structSize is 0
Named Intrinsic System.Numerics.Vector`1.op_Equality: Recognized
    [ 1]  22 (0x016) brfalse.s

STMT00002 (IL 0x00B...  ???)
               [000013] ------------              *  JTRUE     void  
               [000012] ------------              \--*  EQ        int   
               [000010] ------------                 +--*  HWINTRINSIC bool   int op_Equality
               [000008] ------------                 |  +--*  LCL_VAR   simd16<System.Numerics.Vector`1[Int32]> V01 loc0         
               [000009] ------------                 |  \--*  HWINTRINSIC simd16 int get_Zero
               [000011] ------------                 \--*  CNS_INT   int    0

And so it is really me noticing this unexpected IR shape.

ASG(LCL_VAR 16byte struct V01, CNST_INT 0)

By the way, I have been wondering about this (as it is the IR I am looking at in #51500) - what is the desirable shape (that we want to have for it moving forward) for this tree specifically when we have TYP_SIMD nodes on the lhs? GT_SIMD (that's what morph does today), HWINTRINSIC get_Zero, something else?

@tannergooding
Copy link
Member

This looks to also be the root cause of #50939

Take code such as:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static float Distance(Vector2 value1, Vector2 value2)
{
    float distanceSquared = DistanceSquared(value1, value2);
    return MathF.Sqrt(distanceSquared);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static float Distance2(Vector2 value1, Vector2 value2)
{
    Vector2 difference = value1 - value2;
    return difference.Length();
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static float DistanceSquared(Vector2 value1, Vector2 value2)
{
    Vector2 difference = value1 - value2;
    return difference.LengthSquared();
}

[MethodImpl(MethodImplOptions.AggressiveOptimization)]
static void Main(string[] args)
{
    var Vector2Value1 = new Vector2(-1.0f, 1.0f);
    var Vector2ValueInverted1 = new Vector2(1.0f, -1.0f);

    var result1 = Distance(Vector2Value1, Vector2ValueInverted1);
    Console.WriteLine(result1);

    var Vector2Value2 = new Vector2(-2.0f, 2.0f);
    var Vector2ValueInverted2 = new Vector2(2.0f, -2.0f);

    var result2 = Distance2(Vector2Value2, Vector2ValueInverted2);
    Console.WriteLine(result2);
}

JitDump.txt

Distance results in:

IN000e: 000049 vmovsd   qword ptr [V06 rsp+30H], xmm1
IN000f: 00004F vmovsd   qword ptr [V07 rsp+28H], xmm0
IN0010: 000055 vmovsd   xmm0, qword ptr [V06 rsp+30H]
IN0011: 00005B vmovsd   xmm1, qword ptr [V07 rsp+28H]
IN0012: 000061 vsubps   xmm0, xmm0, xmm1
IN0013: 000065 vdpps    xmm0, xmm0, xmm0, 49
IN0014: 00006B vsqrtss  xmm0, xmm0

While Distance2 results in:

IN0021: 0000AA vsubps   xmm0, xmm2, xmm0
IN0022: 0000AE vdpps    xmm0, xmm0, xmm0, 49
IN0023: 0000B4 vsqrtss  xmm0, xmm0

The ultimate difference between the two is that the former results in OBJ(ADDR(LCL_VAR)) nodes while the latter preserves everything as just LCL_VAR, which hinders our ability to write helper methods that are dependent on other helper methods.

@tannergooding
Copy link
Member

tannergooding commented Apr 23, 2021

@sandreenko, at least in the case I just listed, this looks to come down to ASG nodes being inserted across the inlining boundaries.

We then end up promoting the SIMD locals because we allow that for Vector2/3/4 in particular. It isn't clear to me why we allow promotion for Vector2/3/4 rather than keeping it in register and preferring extractps (or shufps on older hardware) via impSIMDGetFixed.

However, it also isn't clear why we aren't marking GT_ASG where the target is TYP_SIMD* as SetOpLclRelatedToSIMDIntrinsic. Because, for all intents and purposes simd = simd is a SIMD intrinsic, it is a movaps or movups (or movss for scalars).

@tannergooding
Copy link
Member

Inserting:

#if FEATURE_SIMD
    if (varTypeIsSIMD(dst->gtType))
    {
        // We want to track SIMD assignments as being intrinsics since they
        // are functionally SIMD `mov` instructions and are more efficient
        // when we don't promote, particularly when it occurs due to inlining

        SetOpLclRelatedToSIMDIntrinsic(dst);
        SetOpLclRelatedToSIMDIntrinsic(src);
    }
#endif // FEATURE_SIMD

in gtNewAssignNode looks to resolve the issue I just mentioned. It doesn't resolve the issue @SingleAccretion mentioned since OBJ(ADDR(LCL)) still exists.

We are currently generating OBJ(ADDR(LCL)) for args since they are passed "byref" (this even looks to be the case on Linux, where some of them are actually meant to be passed "byval") and for this.
Seemingly, we should be able to replace the OBJ(ADDR(LCL)) with a direct copy in many scenarios (particularly for inlining), so we preserve the "byval" semantics, but remove the indirection that exists here.

@tannergooding
Copy link
Member

tannergooding commented Apr 23, 2021

Seemingly, we should be able to replace the OBJ(ADDR(LCL)) with a direct copy in many scenarios (particularly for inlining), so we preserve the "byval" semantics, but remove the indirection that exists here.

One potential idea here is to modify impPopCallArgs to insert a copy, rather than normalizing to OBJ(ADDR(LCL)). We can then make the appropriate ABI decision (actually byval vs byref) later if the call isn't inlined and we need to actually pass the value. Otherwise, the required copy has already been made (preserving the "pass by value" semantics that are important across call boundaries) and we never end up with an indirection that may harm downstream optimizations.

@SingleAccretion SingleAccretion changed the title Importer creates OBJ(ADDR(LCL_VAR)) trees when importing some static intrinsics Remove the OBJ(ADDR) wrapping created for calls by the importer Dec 23, 2021
@SingleAccretion
Copy link
Contributor Author

Essentially, this is a subset of #11057, the source of this wrapping is impNormStructVal, the primary purpose - carrying the signature type to morph. There are interesting side effects too, for example, it prevents morph from creating R-value TYP_STRUCT LCL_FLD (not completely though - they can still be generated under GT_RETURNs).

@tannergooding
Copy link
Member

This was generally fixed by #81636

@SingleAccretion, do you want this kept open to track other OBJ wrapping we might be introducing?

@SingleAccretion
Copy link
Contributor Author

No, we can close this assuming the performance impact of #81636 has been evaluated as acceptable.

@tannergooding
Copy link
Member

Don't believe we've actually seen regressions in the perf triage yet. Will close and reopen if anything crops up.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

4 participants