Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT doesn't allow method prologues to have more than one instruction group #104585

Closed
tannergooding opened this issue Jul 9, 2024 · 6 comments
Closed
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@tannergooding
Copy link
Member

The JIT currently has a restriction that there can only be one IG for the method prologue, this is unlike funclets or the method epilogue which can extend across several.

This is normally not problematic, however there are many scenarios under which the method prologue can extend past the limits of a single group since a single group has a finite number of instructions it can hold.

An example of this is the following program:

using System.Numerics.Tensors;
using System.Runtime.CompilerServices;

internal class Program
{
    private static void Main(string[] args)
    {
        ReadOnlySpan<ulong> x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16];
        Console.WriteLine(Invoke(x, x));
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static ulong Invoke(ReadOnlySpan<ulong> x, ReadOnlySpan<ulong> y)
    {
        return TensorPrimitives.ProductOfDifferences<ulong>(x, y);
    }
}

If this is run under a checked JIT with DOTNET_ReadyToRun=0, DOTNET_TieredCompilation=0, and DOTNET_JitStressRegs=0x80 then it will trigger the following assert: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/emit.cpp#L9670-L9672

    /* Right now we don't allow multi-IG prologs */

    assert(emitCurIG != emitPrologIG);

This happens because we set genUseBlockInit = (genInitStkLclCnt > 4) and then in genZeroInitFrame use that to determine if we're zeroing using SIMD or using a what is basically sizeof(void*) stores of the native general purpose register

That itself seems "bad" from a performance perspective since it's not accounting for how big these 4 locals are and therefore whether block vs scalar zeroing is "better". But, independently it means that this code path is broken if the total number of store instructions required extends past the limits of a single IG as occurs if you have 4x TYP_SIMD64 as an example.

The JIT needs to be updated to support prologues that extend past 1 group to ensure that we are robust in the face of having more than EMIT_MAX_IG_INS_COUNT (which can be less in practice for large instrDesc, instructions, in the failure above we hit the limit at 61 instructions out of the maximum 256).

Additionally, it would probably be beneficial to have zeroing pick the optimal strategy based on number of bytes needing to be zeroed rather than number of locals.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 9, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jul 9, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@tannergooding
Copy link
Member Author

The failure, with locals, looks like:

# compCycleEstimate =    781, compSizeEstimate =   786 System.Numerics.Tensors.TensorPrimitives:<Aggregate>g__Vectorized512|68_3[ulong,System.Numerics.Tensors.TensorPrimitives+SubtractOperator`1[ulong],System.Numerics.Tensors.TensorPrimitives+MultiplyOperator`1[ulong]](byref,byref,ulong):ulong
; Final local variable assignments
;
;  V00 arg0         [V00,T03] ( 14,  9   )   byref  ->  rcx
;  V01 arg1         [V01,T04] ( 14,  9   )   byref  ->  rdx
;  V02 arg2         [V02,T02] ( 29, 29.50)    long  ->   r8
;  V03 loc0         [V03,T11] ( 33, 30   )  simd64  ->  mm19         <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V04 loc1         [V04,T42] (  5,  5   )  simd64  ->  mm20         <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V05 loc2         [V05,T47] (  5,  3   )  simd64  ->  [rsp+0x560]  <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V06 loc3         [V06,T05] (  7,  5   )    long  ->  r10
;  V07 loc4         [V07,T09] (  4,  2.94)    long  ->  r14
;* V08 loc5         [V08    ] (  0,  0   )    long  ->  zero-ref    single-def
;* V09 loc6         [V09    ] (  0,  0   )    long  ->  zero-ref    single-def
;  V10 loc7         [V10    ] (  2,  1   )   byref  ->  [rsp+0x558]  must-init pinned
;  V11 loc8         [V11    ] (  2,  1   )   byref  ->  [rsp+0x550]  must-init pinned
;  V12 loc9         [V12,T00] ( 14, 42   )    long  ->  [rsp+0x548]
;  V13 loc10        [V13,T01] ( 13, 41.50)    long  ->  rbx
;  V14 loc11        [V14,T12] (  6, 24   )  simd64  ->  mm22         <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V15 loc12        [V15,T13] (  6, 24   )  simd64  ->  mm23         <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V16 loc13        [V16,T14] (  6, 24   )  simd64  ->  mm24         <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V17 loc14        [V17,T15] (  6, 24   )  simd64  ->  mm25         <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V18 loc15        [V18    ] (  0,  0   )    long  ->  zero-ref    single-def
;  V19 loc16        [V19,T65] (  3,  1.50)  simd64  ->  [rsp+0x4E0]  <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V20 loc17        [V20,T66] (  3,  1.50)  simd64  ->  mm16         <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V21 loc18        [V21,T67] (  3,  1.50)  simd64  ->  [rsp+0x4A0]  spill-single-def <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V22 loc19        [V22,T68] (  3,  1.50)  simd64  ->  [rsp+0x460]  spill-single-def <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V23 loc20        [V23,T69] (  3,  1.50)  simd64  ->  [rsp+0x420]  spill-single-def <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V24 loc21        [V24,T70] (  3,  1.50)  simd64  ->  [rsp+0x3E0]  spill-single-def <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V25 loc22        [V25,T71] (  3,  1.50)  simd64  ->  [rsp+0x3A0]  spill-single-def <System.Runtime.Intrinsics.Vector512`1[ulong]>
;# V26 OutArgs      [V26    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V27 tmp1         [V27,T43] (  2,  4   )  simd64  ->  mm15         "impAppendStmt"
;* V28 tmp2         [V28    ] (  0,  0   )  struct (16) zero-ref    "dup spill" <System.ValueTuple`2[ulong,ulong]>
;* V29 tmp3         [V29    ] (  0,  0   )    long  ->  zero-ref    single-def
;  V30 tmp4         [V30,T48] (  2,  2   )  simd64  ->  mm17         "impAppendStmt"
;* V31 tmp5         [V31    ] (  0,  0   )  simd64  ->  zero-ref    "impAppendStmt"
;* V32 tmp6         [V32    ] (  0,  0   )  simd64  ->  zero-ref    "spilled call-like call argument"
;* V33 tmp7         [V33    ] (  0,  0   )  simd64  ->  zero-ref    "impAppendStmt"
;* V34 tmp8         [V34    ] (  0,  0   )  simd64  ->  zero-ref    "spilled call-like call argument"
;* V35 tmp9         [V35    ] (  0,  0   )  simd64  ->  zero-ref    "impAppendStmt"
;* V36 tmp10        [V36    ] (  0,  0   )  simd64  ->  zero-ref    "spilled call-like call argument"
;* V37 tmp11        [V37    ] (  0,  0   )  simd64  ->  zero-ref    "impAppendStmt"
;* V38 tmp12        [V38    ] (  0,  0   )  simd64  ->  zero-ref    "spilled call-like call argument"
;* V39 tmp13        [V39    ] (  0,  0   )  simd64  ->  zero-ref    "impAppendStmt"
;* V40 tmp14        [V40    ] (  0,  0   )  simd64  ->  zero-ref    "spilled call-like call argument"
;* V41 tmp15        [V41    ] (  0,  0   )  simd64  ->  zero-ref    "impAppendStmt"
;* V42 tmp16        [V42    ] (  0,  0   )  simd64  ->  zero-ref    "spilled call-like call argument"
;* V43 tmp17        [V43    ] (  0,  0   )  simd64  ->  zero-ref    "impAppendStmt"
;* V44 tmp18        [V44    ] (  0,  0   )  simd64  ->  zero-ref    "spilled call-like call argument"
;* V45 tmp19        [V45    ] (  0,  0   )  simd64  ->  zero-ref    "impAppendStmt"
;* V46 tmp20        [V46    ] (  0,  0   )  simd64  ->  zero-ref    "spilled call-like call argument"
;* V47 tmp21        [V47    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V48 tmp22        [V48    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V49 tmp23        [V49    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V50 tmp24        [V50    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V51 tmp25        [V51    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V52 tmp26        [V52    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V53 tmp27        [V53    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V54 tmp28        [V54    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V55 tmp29        [V55    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V56 tmp30        [V56    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V57 tmp31        [V57,T16] (  4, 16   )  simd64  ->  [rsp+0x360]  must-init ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V58 tmp32        [V58,T23] (  2, 16   )  simd32  ->  mm26         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V59 tmp33        [V59,T24] (  2, 16   )  simd32  ->  mm27         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V60 tmp34        [V60,T17] (  4, 16   )  simd64  ->  [rsp+0x320]  must-init ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V61 tmp35        [V61,T25] (  2, 16   )  simd32  ->  mm28         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V62 tmp36        [V62,T26] (  2, 16   )  simd32  ->  mm29         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V63 tmp37        [V63,T18] (  4, 16   )  simd64  ->  [rsp+0x2E0]  must-init ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V64 tmp38        [V64,T27] (  2, 16   )  simd32  ->  mm30         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V65 tmp39        [V65,T28] (  2, 16   )  simd32  ->  mm31         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V66 tmp40        [V66,T19] (  4, 16   )  simd64  ->  [rsp+0x2A0]  must-init ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V67 tmp41        [V67,T29] (  2, 16   )  simd32  ->  mm6         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V68 tmp42        [V68,T30] (  2, 16   )  simd32  ->  mm7         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;* V69 tmp43        [V69    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V70 tmp44        [V70    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V71 tmp45        [V71    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V72 tmp46        [V72    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V73 tmp47        [V73    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V74 tmp48        [V74    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V75 tmp49        [V75    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V76 tmp50        [V76    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V77 tmp51        [V77,T20] (  4, 16   )  simd64  ->  [rsp+0x260]  must-init ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V78 tmp52        [V78,T31] (  2, 16   )  simd32  ->  mm8         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V79 tmp53        [V79,T32] (  2, 16   )  simd32  ->  mm9         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V80 tmp54        [V80,T21] (  4, 16   )  simd64  ->  [rsp+0x220]  must-init ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V81 tmp55        [V81,T33] (  2, 16   )  simd32  ->  mm10         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V82 tmp56        [V82,T34] (  2, 16   )  simd32  ->  mm11         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V83 tmp57        [V83,T22] (  4, 16   )  simd64  ->  [rsp+0x1E0]  must-init ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V84 tmp58        [V84,T35] (  2, 16   )  simd32  ->  mm12         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V85 tmp59        [V85,T36] (  2, 16   )  simd32  ->  mm13         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V86 tmp60        [V86,T39] (  3, 12   )  simd64  ->  [rsp+0x1A0]  must-init ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V87 tmp61        [V87,T37] (  2, 16   )  simd32  ->  mm14         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V88 tmp62        [V88,T38] (  2, 16   )  simd32  ->  mm21         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;* V89 tmp63        [V89    ] (  0,  0   )  struct (16) zero-ref    "spilled call-like call argument" <System.ReadOnlySpan`1[ulong]>
;* V90 tmp64        [V90    ] (  0,  0   )     int  ->  zero-ref    "Inlining Arg"
;* V91 tmp65        [V91    ] (  0,  0   )  struct (16) zero-ref    "ReadOnlySpan<T> for CreateSpan<T>" <System.ReadOnlySpan`1[ulong]>
;* V92 tmp66        [V92    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg" <System.ReadOnlySpan`1[ulong]>
;* V93 tmp67        [V93    ] (  0,  0   )  simd64  ->  zero-ref    ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V94 tmp68        [V94,T44] (  2,  4   )  simd32  ->  mm0         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V95 tmp69        [V95,T45] (  2,  4   )  simd32  ->  mm1         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V96 tmp70        [V96,T07] (  3,  3   )    long  ->  rsi         single-def "Inline stloc first use temp"
;* V97 tmp71        [V97    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "NewObj constructor temp" <System.ValueTuple`2[ulong,ulong]>
;* V98 tmp72        [V98    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V99 tmp73        [V99    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V100 tmp74       [V100    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V101 tmp75       [V101    ] (  0,  0   )  simd64  ->  zero-ref    ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V102 tmp76       [V102,T49] (  2,  2   )  simd32  ->  [rsp+0x180]  "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V103 tmp77       [V103,T50] (  2,  2   )  simd32  ->  [rsp+0x160]  "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;* V104 tmp78       [V104    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V105 tmp79       [V105    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V106 tmp80       [V106    ] (  0,  0   )  simd64  ->  zero-ref    ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V107 tmp81       [V107,T51] (  2,  2   )  simd32  ->  [rsp+0x140]  "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V108 tmp82       [V108,T52] (  2,  2   )  simd32  ->  mm16         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;* V109 tmp83       [V109    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V110 tmp84       [V110    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V111 tmp85       [V111    ] (  0,  0   )  simd64  ->  zero-ref    ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V112 tmp86       [V112,T53] (  2,  2   )  simd32  ->  [rsp+0x120]  "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V113 tmp87       [V113,T54] (  2,  2   )  simd32  ->  [rsp+0x100]  "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;* V114 tmp88       [V114    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V115 tmp89       [V115    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V116 tmp90       [V116    ] (  0,  0   )  simd64  ->  zero-ref    ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V117 tmp91       [V117,T55] (  2,  2   )  simd32  ->  [rsp+0xE0]  "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V118 tmp92       [V118,T56] (  2,  2   )  simd32  ->  [rsp+0xC0]  "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;* V119 tmp93       [V119    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V120 tmp94       [V120    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V121 tmp95       [V121    ] (  0,  0   )  simd64  ->  zero-ref    ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V122 tmp96       [V122,T57] (  2,  2   )  simd32  ->  [rsp+0xA0]  "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V123 tmp97       [V123,T58] (  2,  2   )  simd32  ->  [rsp+0x80]  "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;* V124 tmp98       [V124    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V125 tmp99       [V125    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V126 tmp100      [V126    ] (  0,  0   )  simd64  ->  zero-ref    ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V127 tmp101      [V127,T59] (  2,  2   )  simd32  ->  [rsp+0x60]  "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V128 tmp102      [V128,T60] (  2,  2   )  simd32  ->  [rsp+0x40]  "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;* V129 tmp103      [V129    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V130 tmp104      [V130    ] (  0,  0   )  simd64  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;* V131 tmp105      [V131    ] (  0,  0   )  simd64  ->  zero-ref    ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V132 tmp106      [V132,T61] (  2,  2   )  simd32  ->  [rsp+0x20]  "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V133 tmp107      [V133,T62] (  2,  2   )  simd32  ->  [rsp+0x00]  "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;* V134 tmp108      [V134    ] (  0,  0   )  struct (16) zero-ref    "spilled call-like call argument" <System.ReadOnlySpan`1[ulong]>
;* V135 tmp109      [V135    ] (  0,  0   )     int  ->  zero-ref    "Inlining Arg"
;* V136 tmp110      [V136    ] (  0,  0   )  struct (16) zero-ref    "ReadOnlySpan<T> for CreateSpan<T>" <System.ReadOnlySpan`1[ulong]>
;* V137 tmp111      [V137    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg" <System.ReadOnlySpan`1[ulong]>
;* V138 tmp112      [V138    ] (  0,  0   )  simd64  ->  zero-ref    ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector512`1[ulong]>
;  V139 tmp113      [V139,T63] (  2,  2   )  simd32  ->  mm3         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V140 tmp114      [V140,T64] (  2,  2   )  simd32  ->  mm4         "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;  V141 tmp115      [V141,T40] (  3,  6   )  simd32  ->  mm2         "spilled call-like call argument"
;* V142 tmp116      [V142    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;* V143 tmp117      [V143    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector256`1[ulong]>
;* V144 tmp118      [V144    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
;* V145 tmp119      [V145    ] (  0,  0   )  simd16  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ulong]>
;* V146 tmp120      [V146    ] (  0,  0   )  simd16  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ulong]>
;  V147 tmp121      [V147,T41] (  3,  6   )  simd16  ->  mm3         "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ulong]>
;* V148 tmp122      [V148    ] (  0,  0   )  simd16  ->  zero-ref    "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ulong]>
;* V149 tmp123      [V149    ] (  0,  0   )    long  ->  zero-ref    single-def "field V28.Item1 (fldOffset=0x0)" P-INDEP
;* V150 tmp124      [V150    ] (  0,  0   )    long  ->  zero-ref    single-def "field V28.Item2 (fldOffset=0x8)" P-INDEP
;* V151 tmp125      [V151    ] (  0,  0   )   byref  ->  zero-ref    "field V89._reference (fldOffset=0x0)" P-INDEP
;* V152 tmp126      [V152    ] (  0,  0   )     int  ->  zero-ref    "field V89._length (fldOffset=0x8)" P-INDEP
;* V153 tmp127      [V153    ] (  0,  0   )   byref  ->  zero-ref    single-def "field V91._reference (fldOffset=0x0)" P-INDEP
;* V154 tmp128      [V154    ] (  0,  0   )     int  ->  zero-ref    single-def "field V91._length (fldOffset=0x8)" P-INDEP
;* V155 tmp129      [V155    ] (  0,  0   )   byref  ->  zero-ref    single-def "field V92._reference (fldOffset=0x0)" P-INDEP
;* V156 tmp130      [V156    ] (  0,  0   )     int  ->  zero-ref    "field V92._length (fldOffset=0x8)" P-INDEP
;* V157 tmp131      [V157    ] (  0,  0   )    long  ->  zero-ref    single-def "field V97.Item1 (fldOffset=0x0)" P-INDEP
;  V158 tmp132      [V158,T10] (  3,  2.50)    long  ->  rbp         single-def "field V97.Item2 (fldOffset=0x8)" P-INDEP
;* V159 tmp133      [V159    ] (  0,  0   )   byref  ->  zero-ref    "field V134._reference (fldOffset=0x0)" P-INDEP
;* V160 tmp134      [V160    ] (  0,  0   )     int  ->  zero-ref    "field V134._length (fldOffset=0x8)" P-INDEP
;* V161 tmp135      [V161    ] (  0,  0   )   byref  ->  zero-ref    single-def "field V136._reference (fldOffset=0x0)" P-INDEP
;* V162 tmp136      [V162    ] (  0,  0   )     int  ->  zero-ref    single-def "field V136._length (fldOffset=0x8)" P-INDEP
;* V163 tmp137      [V163    ] (  0,  0   )   byref  ->  zero-ref    single-def "field V137._reference (fldOffset=0x0)" P-INDEP
;* V164 tmp138      [V164    ] (  0,  0   )     int  ->  zero-ref    "field V137._length (fldOffset=0x8)" P-INDEP
;  V165 tmp139      [V165,T06] (  4,  4   )    long  ->   r9         "Cast away GC"
;  V166 tmp140      [V166,T08] (  3,  3   )    long  ->  r11         "Cast away GC"
;  V167 cse0        [V167,T46] (  4,  3.50)  simd64  ->  mm18         "CSE #02: aggressive"
;
; Lcl frame size = 1632
*************** Before prolog / epilog generation
G_M4303_IG01:        ; func=00, offs=0x000000, size=0x0000, bbWeight=1, gcrefRegs=0000 {} <-- Prolog IG
G_M4303_IG02:        ; offs=0x000000, size=0x0044, bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0006 {rcx rdx}, BB01 [0000], byref
G_M4303_IG03:        ; offs=0x000044, size=0x0050, bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0006 {rcx rdx}, BB02 [0001], BB03 [0002], byref
G_M4303_IG04:        ; offs=0x000094, size=0x0019, bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB04 [0248], byref, align
G_M4303_IG05:        ; offs=0x0000AD, size=0x01BD, bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB05 [0249], byref
G_M4303_IG06:        ; offs=0x00026A, size=0x00A0, bbWeight=4, loop=IG05, BB05 [0249], extend
G_M4303_IG07:        ; offs=0x00030A, size=0x0010, bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB06 [0250], byref
G_M4303_IG08:        ; offs=0x00031A, size=0x0008, bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0006 {rcx rdx}, BB06 [0250], byref
G_M4303_IG09:        ; offs=0x000322, size=0x00A3, bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0006 {rcx rdx}, BB07 [0006], byref
G_M4303_IG10:        ; offs=0x0003C5, size=0x000A, bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0006 {rcx rdx}, BB08 [0008], byref
G_M4303_IG11:        ; offs=0x0003CF, size=0x001B, bbWeight=0.44, gcrefRegs=0000 {}, byrefRegs=0006 {rcx rdx}, BB18 [0251], byref
G_M4303_IG12:        ; offs=0x0003EA, size=0x0050, bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0006 {rcx rdx}, BB09 [0010], byref
G_M4303_IG13:        ; offs=0x00043A, size=0x0050, bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0006 {rcx rdx}, BB10 [0011], byref
G_M4303_IG14:        ; offs=0x00048A, size=0x0069, bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0006 {rcx rdx}, BB11 [0012], byref
G_M4303_IG15:        ; offs=0x0004F3, size=0x0069, bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0006 {rcx rdx}, BB12 [0013], byref
G_M4303_IG16:        ; offs=0x00055C, size=0x0069, bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0006 {rcx rdx}, BB13 [0014], byref
G_M4303_IG17:        ; offs=0x0005C5, size=0x0069, bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0006 {rcx rdx}, BB14 [0015], byref
G_M4303_IG18:        ; offs=0x00062E, size=0x0069, bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0006 {rcx rdx}, BB15 [0016], byref
G_M4303_IG19:        ; offs=0x000697, size=0x0074, bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB16 [0218], byref
G_M4303_IG20:        ; offs=0x00070B, size=0x0034, bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB17 [0018], byref
G_M4303_IG21:        ; epilog placeholder, next placeholder=<END>, BB17 [0018], epilog, extend <-- First placeholder <-- Last placeholder
                     ;   PrevGCVars=00000000000000000000000000000000 {}, PrevGCrefRegs=0000 {}, PrevByrefRegs=0000 {}
                     ;   InitGCVars=00000000000000000000000000000000 {}, InitGCrefRegs=0000 {}, InitByrefRegs=0000 {}
Recording Var Locations at start of BB01
  V02(r8)  V00(rcx)  V01(rdx)  V57(mm0)  V60(mm1)  V63(mm2)  V66(mm3)  V77(mm4)  V80(mm5)  V83(mm16)  V86(mm17)
*************** In genFnProlog()
Added IP mapping to front: PROLOG (G_M4303_IG01,ins#0,ofs#0) label

__prolog:
Debug: New V00 debug range: first
Debug: New V01 debug range: first
Debug: New V02 debug range: first
Found 4 lvMustInit int-sized stack slots, frame offsets -1360 through -1376
IN0126:        push     r15
IN0127:        push     r14
IN0128:        push     r13
IN0129:        push     rdi
IN012a:        push     rsi
IN012b:        push     rbp
IN012c:        push     rbx
IN012d:        sub      rsp, 0x660
IN012e:        vmovaps  xmmword ptr [rsp+0x650], xmm6
IN012f:        vmovaps  xmmword ptr [rsp+0x640], xmm7
IN0130:        vmovaps  xmmword ptr [rsp+0x630], xmm8
IN0131:        vmovaps  xmmword ptr [rsp+0x620], xmm9
IN0132:        vmovaps  xmmword ptr [rsp+0x610], xmm10
IN0133:        vmovaps  xmmword ptr [rsp+0x600], xmm11
IN0134:        vmovaps  xmmword ptr [rsp+0x5F0], xmm12
IN0135:        vmovaps  xmmword ptr [rsp+0x5E0], xmm13
IN0136:        vmovaps  xmmword ptr [rsp+0x5D0], xmm14
IN0137:        vmovaps  xmmword ptr [rsp+0x5C0], xmm15
IN0138:        xor      rax, rax
IN0139:        mov      qword ptr [V10 rsp+0x558], rax
IN013a:        mov      qword ptr [V11 rsp+0x550], rax
IN013b:        mov      qword ptr [V57 rsp+0x360], rax
IN013c:        mov      qword ptr [V57+0x8 rsp+0x368], rax
IN013d:        mov      qword ptr [V57+0x10 rsp+0x370], rax
IN013e:        mov      qword ptr [V57+0x18 rsp+0x378], rax
IN013f:        mov      qword ptr [V57+0x20 rsp+0x380], rax
IN0140:        mov      qword ptr [V57+0x28 rsp+0x388], rax
IN0141:        mov      qword ptr [V57+0x30 rsp+0x390], rax
IN0142:        mov      qword ptr [V57+0x38 rsp+0x398], rax
IN0143:        mov      qword ptr [V60 rsp+0x320], rax
IN0144:        mov      qword ptr [V60+0x8 rsp+0x328], rax
IN0145:        mov      qword ptr [V60+0x10 rsp+0x330], rax
IN0146:        mov      qword ptr [V60+0x18 rsp+0x338], rax
IN0147:        mov      qword ptr [V60+0x20 rsp+0x340], rax
IN0148:        mov      qword ptr [V60+0x28 rsp+0x348], rax
IN0149:        mov      qword ptr [V60+0x30 rsp+0x350], rax
IN014a:        mov      qword ptr [V60+0x38 rsp+0x358], rax
IN014b:        mov      qword ptr [V63 rsp+0x2E0], rax
IN014c:        mov      qword ptr [V63+0x8 rsp+0x2E8], rax
IN014d:        mov      qword ptr [V63+0x10 rsp+0x2F0], rax
IN014e:        mov      qword ptr [V63+0x18 rsp+0x2F8], rax
IN014f:        mov      qword ptr [V63+0x20 rsp+0x300], rax
IN0150:        mov      qword ptr [V63+0x28 rsp+0x308], rax
IN0151:        mov      qword ptr [V63+0x30 rsp+0x310], rax
IN0152:        mov      qword ptr [V63+0x38 rsp+0x318], rax
IN0153:        mov      qword ptr [V66 rsp+0x2A0], rax
IN0154:        mov      qword ptr [V66+0x8 rsp+0x2A8], rax
IN0155:        mov      qword ptr [V66+0x10 rsp+0x2B0], rax
IN0156:        mov      qword ptr [V66+0x18 rsp+0x2B8], rax
IN0157:        mov      qword ptr [V66+0x20 rsp+0x2C0], rax
IN0158:        mov      qword ptr [V66+0x28 rsp+0x2C8], rax
IN0159:        mov      qword ptr [V66+0x30 rsp+0x2D0], rax
IN015a:        mov      qword ptr [V66+0x38 rsp+0x2D8], rax
IN015b:        mov      qword ptr [V77 rsp+0x260], rax
IN015c:        mov      qword ptr [V77+0x8 rsp+0x268], rax
IN015d:        mov      qword ptr [V77+0x10 rsp+0x270], rax
IN015e:        mov      qword ptr [V77+0x18 rsp+0x278], rax
IN015f:        mov      qword ptr [V77+0x20 rsp+0x280], rax
IN0160:        mov      qword ptr [V77+0x28 rsp+0x288], rax
IN0161:        mov      qword ptr [V77+0x30 rsp+0x290], rax
IN0162:        mov      qword ptr [V77+0x38 rsp+0x298], rax
ISSUE: <ASSERT> #186430 D:\Users\tagoo\source\repos\runtime2\src\coreclr\jit\emit.cpp (9672) - Assertion failed 'emitCurIG != emitPrologIG' in 'System.Numerics.Tensors.TensorPrimitives:<Aggregate>g__Vectorized512|68_3[ulong,System.Numerics.Tensors.TensorPrimitives+SubtractOperator`1[ulong],System.Numerics.Tensors.TensorPrimitives+MultiplyOperator`1[ulong]](byref,byref,ulong):ulong' during 'Generate code' (IL size 1535; hash 0xf258ef30; FullOpts)

ERROR: SuperPMI: Assert Failure (PID 1386868, Thread 1343412/147fb4)
Assertion failed 'emitCurIG != emitPrologIG' in 'System.Numerics.Tensors.TensorPrimitives:<Aggregate>g__Vectorized512|68_3[ulong,System.Numerics.Tensors.TensorPrimitives+SubtractOperator`1[ulong],System.Numerics.Tensors.TensorPrimitives+MultiplyOperator`1[ulong]](byref,byref,ulong):ulong' during 'Generate code' (IL size 1535; hash 0xf258ef30; FullOpts)


D:\Users\tagoo\source\repos\runtime2\src\coreclr\jit\emit.cpp, Line: 9672

This is triggering in some SPMI replay runs.

@tannergooding
Copy link
Member Author

It looks like part of the issue is that most of the locals aren't aren't impacting initStkLclCnt.

For example V57 is compInitMem, and varDsc->lvMustInit, and varDsc->lvTracked, and varDsc->lvOnFrame. However, we don't factor it in because varDsc->lvIsInReg() is true and varDsc->lvLiveInOutOfHndlr is false.

genZeroInitFrame on the other hand is just zeroing everything marked varDsc->lvMustInit and varDsc->lvOnFrame so we should probably be factoring all of those into the must init size or otherwise ensure they're correctly marked due to JitStressRegs to ensure the "right things" happen.

@jakobbotsch
Copy link
Member

jakobbotsch commented Jul 9, 2024

This is normally not problematic, however there are many scenarios under which the method prologue can extend past the limits of a single group since a single group has a finite number of instructions it can hold

What are those many scenarios? Normally the prolog size is limited naturally by various factors, hence we never needed this support. It would still be nice to support it, but more so due to the flexibility – see #12302. But I think that's a separate work item and not going to happen in .NET 9.

This happens because we set genUseBlockInit = (genInitStkLclCnt > 4) and then in genZeroInitFrame use that to determine if we're zeroing using SIMD or using a what is basically sizeof(void*) stores of the native general purpose register

That itself seems "bad" from a performance perspective since it's not accounting for how big these 4 locals are and therefore whether block vs scalar zeroing is "better". But, independently it means that this code path is broken if the total number of store instructions required extends past the limits of a single IG as occurs if you have 4x TYP_SIMD64 as an example.

genInitStkLclCnt is just poorly named -- it is actually the number of stack slots that we need to zero.

I think the bug here is just around the accounting that you mention, and not related to multi IG support for the prolog, which I think will not be necessary with the bug fixed.

@jakobbotsch
Copy link
Member

#104593 should fix the issue by making sure the "needs zeroing" logic is synced in the two places. Thanks for investigating!

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Jul 11, 2024
@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Jul 11, 2024
@jakobbotsch
Copy link
Member

Going to close this as a duplicate of #12302.

@jakobbotsch jakobbotsch closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

3 participants