Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Support delegate GDV guards in loop cloning #75140

Merged
merged 12 commits into from
Oct 20, 2022

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Sep 6, 2022

Add support for cloning loops based on delegate GDV guards. Mark delegate address loads as invariant to allow VN and CSE of them.

For:

[MethodImpl(MethodImplOptions.NoInlining)]
private static long Foo(Func<long, long> f)
{
    long result = 0;
    for (long i = 0; i < 100000; i++)
        result += f(i);

    return result;
}

Before (GDV guard for the delegate target on every iteration):

; Assembly listing for method Program:Foo(System.Func`2[long,long]):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with Dynamic PGO: edge weights are valid, and fgCalledCount is 3
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T03] (  5, 66783   )     ref  ->  rsi         class-hnd single-def
;  V01 loc0         [V01,T01] (  4,133563.67)    long  ->  rdi
;  V02 loc1         [V02,T00] (  6,267125   )    long  ->  rbx
;  V03 OutArgs      [V03    ] (  1,     1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V04 tmp1         [V04,T02] (  3,133562   )    long  ->  rbp         "guarded devirt return temp"
;* V05 tmp2         [V05    ] (  0,     0   )     ref  ->  zero-ref    class-hnd "guarded devirt this exact temp"
;
; Lcl frame size = 40

G_M6926_IG01:              ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC28             sub      rsp, 40
       488BF1               mov      rsi, rcx
                                                ;; size=11 bbWeight=1    PerfScore 4.50
G_M6926_IG02:              ;; offset=000BH
       33FF                 xor      edi, edi
       33DB                 xor      ebx, ebx
                                                ;; size=4 bbWeight=1    PerfScore 0.50
G_M6926_IG03:              ;; offset=000FH
       48B8084A3BB2FD7F0000 mov      rax, 0x7FFDB23B4A08      ; code for <>c:<Main>b__0_0
       48394618             cmp      qword ptr [rsi+18H], rax
       751F                 jne      SHORT G_M6926_IG07
       486BEB2A             imul     rbp, rbx, 42
                                                ;; size=20 bbWeight=66781    PerfScore 417381.25
G_M6926_IG04:              ;; offset=0023H
       4803FD               add      rdi, rbp
       48FFC3               inc      rbx
       4881FBA0860100       cmp      rbx, 0x186A0
       7CDD                 jl       SHORT G_M6926_IG03
                                                ;; size=15 bbWeight=66781    PerfScore 116866.75
G_M6926_IG05:              ;; offset=0032H
       488BC7               mov      rax, rdi
                                                ;; size=3 bbWeight=0.67 PerfScore 0.17
G_M6926_IG06:              ;; offset=0035H
       4883C428             add      rsp, 40
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret
                                                ;; size=9 bbWeight=0.67 PerfScore 2.17
G_M6926_IG07:              ;; offset=003EH
       488B4E08             mov      rcx, gword ptr [rsi+08H]
       488BD3               mov      rdx, rbx
       FF5618               call     [rsi+18H]System.Func`2[long,long]:Invoke(long):long:this
       488BE8               mov      rbp, rax
       EBD6                 jmp      SHORT G_M6926_IG04
                                                ;; size=15 bbWeight=0    PerfScore 0.00

; Total bytes of code 77, prolog size 11, PerfScore 534263.03, instruction count 28, allocated bytes for code 77 (MethodHash=f7dde4f1) for method Program:Foo(System.Func`2[long,long]):long
; ============================================================

After (loop cloned with no GDV guard checks inside the loop):

; Assembly listing for method Program:Foo(System.Func`2[long,long]):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with Dynamic PGO: edge weights are invalid, and fgCalledCount is 3
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T03] (  7,   671.83)     ref  ->  rsi         class-hnd single-def
;  V01 loc0         [V01,T01] (  6,133567.00)    long  ->  rdi
;  V02 loc1         [V02,T00] ( 10,267131.67)    long  ->  rbx
;  V03 OutArgs      [V03    ] (  1,     1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V04 tmp1         [V04,T02] (  5,133565.33)    long  ->  rbp         "guarded devirt return temp"
;* V05 tmp2         [V05    ] (  0,     0   )     ref  ->  zero-ref    class-hnd "guarded devirt this exact temp"
;
; Lcl frame size = 40

G_M6926_IG01:              ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC28             sub      rsp, 40
       488BF1               mov      rsi, rcx
                                                ;; size=11 bbWeight=1    PerfScore 4.50
G_M6926_IG02:              ;; offset=000BH
       33FF                 xor      edi, edi
       33DB                 xor      ebx, ebx
       4885F6               test     rsi, rsi
       7425                 je       SHORT G_M6926_IG05
       48B8084A3CB2FD7F0000 mov      rax, 0x7FFDB23C4A08      ; code for <>c:<Main>b__0_0
       48394618             cmp      qword ptr [rsi+18H], rax
       7515                 jne      SHORT G_M6926_IG05
                                                ;; size=25 bbWeight=1    PerfScore 6.00
G_M6926_IG03:              ;; offset=0024H
       486BEB2A             imul     rbp, rbx, 42
       4803FD               add      rdi, rbp
       48FFC3               inc      rbx
       4881FBA0860100       cmp      rbx, 0x186A0
       7CED                 jl       SHORT G_M6926_IG03
                                                ;; size=19 bbWeight=66114.84 PerfScore 247930.65
G_M6926_IG04:              ;; offset=0037H
       EB23                 jmp      SHORT G_M6926_IG07
                                                ;; size=2 bbWeight=1    PerfScore 2.00
G_M6926_IG05:              ;; offset=0039H
       48B8084A3CB2FD7F0000 mov      rax, 0x7FFDB23C4A08      ; code for <>c:<Main>b__0_0
       48394618             cmp      qword ptr [rsi+18H], rax
       751F                 jne      SHORT G_M6926_IG09
       486BEB2A             imul     rbp, rbx, 42
                                                ;; size=20 bbWeight=667.83 PerfScore 4173.92
G_M6926_IG06:              ;; offset=004DH
       4803FD               add      rdi, rbp
       48FFC3               inc      rbx
       4881FBA0860100       cmp      rbx, 0x186A0
       7CDD                 jl       SHORT G_M6926_IG05
                                                ;; size=15 bbWeight=667.83 PerfScore 1168.70
G_M6926_IG07:              ;; offset=005CH
       488BC7               mov      rax, rdi
                                                ;; size=3 bbWeight=0.67 PerfScore 0.17
G_M6926_IG08:              ;; offset=005FH
       4883C428             add      rsp, 40
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret
                                                ;; size=9 bbWeight=0.67 PerfScore 2.17
G_M6926_IG09:              ;; offset=0068H
       488B4E08             mov      rcx, gword ptr [rsi+08H]
       488BD3               mov      rdx, rbx
       FF5618               call     [rsi+18H]System.Func`2[long,long]:Invoke(long):long:this
       488BE8               mov      rbp, rax
       EBD6                 jmp      SHORT G_M6926_IG06
                                                ;; size=15 bbWeight=0    PerfScore 0.00

; Total bytes of code 119, prolog size 11, PerfScore 253300.00, instruction count 39, allocated bytes for code 119 (MethodHash=f7dde4f1) for method Program:Foo(System.Func`2[long,long]):long
; ============================================================

Note that we are not eliminating the GDV check in the cold loop. In cases where we the only cloning condition we have is the GDV guard we should do that. But I'm not sure if I should fold that into this PR too.

Micro benchmark:

[Benchmark]
public long Bench()
{
    long sum = 0;
    for (int i = 0; i < 30; i++)
        sum += SumVals(j => j * 42);

    return sum;
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static long SumVals(Func<long, long> f)
{
    long result = 0;
    for (long i = 0; i < 10000000; i++)
        result += f(i);

    return result;
}
Method Job EnvironmentVariables Toolchain Mean Error StdDev Ratio
Bench Job-LKWCTI Empty main 443.48 ms 0.382 ms 0.319 ms 3.50
Bench Job-MGSBRD DOTNET_TieredPGO=1 main 126.67 ms 0.331 ms 0.309 ms 1.00
Bench Job-RKOAHT DOTNET_TieredPGO=1 pr 62.92 ms 0.287 ms 0.255 ms 0.50

cc @dotnet/jit-contrib @davidfowl

* Support cloning loops based on delegate GDV guards
* Do flow-graph opts directly in loop cloning instead of relying on RBO
  to clean it up (for both type and delegate GDV)
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 6, 2022
@ghost ghost assigned jakobbotsch Sep 6, 2022
@ghost
Copy link

ghost commented Sep 6, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details
  • Support cloning loops based on delegate GDV guards
  • Do flow-graph opts directly in loop cloning instead of relying on RBO
    to clean it up (for both type and delegate GDV)

For:

[MethodImpl(MethodImplOptions.NoInlining)]
private static long Foo(Func<long, long> f)
{
    long result = 0;
    for (long i = 0; i < 100000; i++)
        result += f(i);

    return result;
}

Before:

; Assembly listing for method Program:Foo(System.Func`2[long,long]):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with Dynamic PGO: edge weights are valid, and fgCalledCount is 3
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T03] (  5, 66783   )     ref  ->  rsi         class-hnd single-def
;  V01 loc0         [V01,T01] (  4,133563.67)    long  ->  rdi
;  V02 loc1         [V02,T00] (  6,267125   )    long  ->  rbx
;  V03 OutArgs      [V03    ] (  1,     1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V04 tmp1         [V04,T02] (  3,133562   )    long  ->  rbp         "guarded devirt return temp"
;* V05 tmp2         [V05    ] (  0,     0   )     ref  ->  zero-ref    class-hnd "guarded devirt this exact temp"
;
; Lcl frame size = 40

G_M6926_IG01:              ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC28             sub      rsp, 40
       488BF1               mov      rsi, rcx
                                                ;; size=11 bbWeight=1    PerfScore 4.50
G_M6926_IG02:              ;; offset=000BH
       33FF                 xor      edi, edi
       33DB                 xor      ebx, ebx
                                                ;; size=4 bbWeight=1    PerfScore 0.50
G_M6926_IG03:              ;; offset=000FH
       48B8084A3BB2FD7F0000 mov      rax, 0x7FFDB23B4A08      ; code for <>c:<Main>b__0_0
       48394618             cmp      qword ptr [rsi+18H], rax
       751F                 jne      SHORT G_M6926_IG07
       486BEB2A             imul     rbp, rbx, 42
                                                ;; size=20 bbWeight=66781    PerfScore 417381.25
G_M6926_IG04:              ;; offset=0023H
       4803FD               add      rdi, rbp
       48FFC3               inc      rbx
       4881FBA0860100       cmp      rbx, 0x186A0
       7CDD                 jl       SHORT G_M6926_IG03
                                                ;; size=15 bbWeight=66781    PerfScore 116866.75
G_M6926_IG05:              ;; offset=0032H
       488BC7               mov      rax, rdi
                                                ;; size=3 bbWeight=0.67 PerfScore 0.17
G_M6926_IG06:              ;; offset=0035H
       4883C428             add      rsp, 40
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret
                                                ;; size=9 bbWeight=0.67 PerfScore 2.17
G_M6926_IG07:              ;; offset=003EH
       488B4E08             mov      rcx, gword ptr [rsi+08H]
       488BD3               mov      rdx, rbx
       FF5618               call     [rsi+18H]System.Func`2[long,long]:Invoke(long):long:this
       488BE8               mov      rbp, rax
       EBD6                 jmp      SHORT G_M6926_IG04
                                                ;; size=15 bbWeight=0    PerfScore 0.00

; Total bytes of code 77, prolog size 11, PerfScore 534263.03, instruction count 28, allocated bytes for code 77 (MethodHash=f7dde4f1) for method Program:Foo(System.Func`2[long,long]):long
; ============================================================

After:

; Assembly listing for method Program:Foo(System.Func`2[long,long]):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with Dynamic PGO: edge weights are invalid, and fgCalledCount is 3
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T03] (  7,   671.83)     ref  ->  rsi         class-hnd single-def
;  V01 loc0         [V01,T01] (  6,133567.00)    long  ->  rdi
;  V02 loc1         [V02,T00] ( 10,267131.67)    long  ->  rbx
;  V03 OutArgs      [V03    ] (  1,     1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V04 tmp1         [V04,T02] (  5,133565.33)    long  ->  rbp         "guarded devirt return temp"
;* V05 tmp2         [V05    ] (  0,     0   )     ref  ->  zero-ref    class-hnd "guarded devirt this exact temp"
;
; Lcl frame size = 40

G_M6926_IG01:              ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC28             sub      rsp, 40
       488BF1               mov      rsi, rcx
                                                ;; size=11 bbWeight=1    PerfScore 4.50
G_M6926_IG02:              ;; offset=000BH
       33FF                 xor      edi, edi
       33DB                 xor      ebx, ebx
       4885F6               test     rsi, rsi
       7425                 je       SHORT G_M6926_IG05
       48B8084A3CB2FD7F0000 mov      rax, 0x7FFDB23C4A08      ; code for <>c:<Main>b__0_0
       48394618             cmp      qword ptr [rsi+18H], rax
       7515                 jne      SHORT G_M6926_IG05
                                                ;; size=25 bbWeight=1    PerfScore 6.00
G_M6926_IG03:              ;; offset=0024H
       486BEB2A             imul     rbp, rbx, 42
       4803FD               add      rdi, rbp
       48FFC3               inc      rbx
       4881FBA0860100       cmp      rbx, 0x186A0
       7CED                 jl       SHORT G_M6926_IG03
                                                ;; size=19 bbWeight=66114.84 PerfScore 247930.65
G_M6926_IG04:              ;; offset=0037H
       EB23                 jmp      SHORT G_M6926_IG07
                                                ;; size=2 bbWeight=1    PerfScore 2.00
G_M6926_IG05:              ;; offset=0039H
       48B8084A3CB2FD7F0000 mov      rax, 0x7FFDB23C4A08      ; code for <>c:<Main>b__0_0
       48394618             cmp      qword ptr [rsi+18H], rax
       751F                 jne      SHORT G_M6926_IG09
       486BEB2A             imul     rbp, rbx, 42
                                                ;; size=20 bbWeight=667.83 PerfScore 4173.92
G_M6926_IG06:              ;; offset=004DH
       4803FD               add      rdi, rbp
       48FFC3               inc      rbx
       4881FBA0860100       cmp      rbx, 0x186A0
       7CDD                 jl       SHORT G_M6926_IG05
                                                ;; size=15 bbWeight=667.83 PerfScore 1168.70
G_M6926_IG07:              ;; offset=005CH
       488BC7               mov      rax, rdi
                                                ;; size=3 bbWeight=0.67 PerfScore 0.17
G_M6926_IG08:              ;; offset=005FH
       4883C428             add      rsp, 40
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret
                                                ;; size=9 bbWeight=0.67 PerfScore 2.17
G_M6926_IG09:              ;; offset=0068H
       488B4E08             mov      rcx, gword ptr [rsi+08H]
       488BD3               mov      rdx, rbx
       FF5618               call     [rsi+18H]System.Func`2[long,long]:Invoke(long):long:this
       488BE8               mov      rbp, rax
       EBD6                 jmp      SHORT G_M6926_IG06
                                                ;; size=15 bbWeight=0    PerfScore 0.00

; Total bytes of code 119, prolog size 11, PerfScore 253300.00, instruction count 39, allocated bytes for code 119 (MethodHash=f7dde4f1) for method Program:Foo(System.Func`2[long,long]):long
; ============================================================

Note that we are not eliminating the GDV check in the cold loop. In cases where we the only cloning condition we have is the GDV guard we should do that. But I'm not sure if I should fold that into this PR too.

Micro benchmark:

[Benchmark]
public long Bench()
{
    long sum = 0;
    for (int i = 0; i < 30; i++)
        sum += SumVals(j => j * 42);

    return sum;
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static long SumVals(Func<long, long> f)
{
    long result = 0;
    for (long i = 0; i < 10000000; i++)
        result += f(i);

    return result;
}
Method Job EnvironmentVariables Toolchain Mean Error StdDev Ratio
Bench Job-LKWCTI Empty main 443.48 ms 0.382 ms 0.319 ms 3.50
Bench Job-MGSBRD DOTNET_TieredPGO=1 main 126.67 ms 0.331 ms 0.309 ms 1.00
Bench Job-RKOAHT DOTNET_TieredPGO=1 pr 62.92 ms 0.287 ms 0.255 ms 0.50

cc @dotnet/jit-contrib @davidfowl

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@@ -594,7 +594,7 @@ class IndirectCallTransformer
// TODO-GDV: Consider duplicating the store at the end of the
// cold case for the previous GDV. Then we can reuse the target
// if the second check of a chained GDV fails.
bool reuseTarget = (origCall->gtCallMoreFlags & GTF_CALL_M_GUARDED_DEVIRT_CHAIN) == 0;
bool reuseTarget = false; //(origCall->gtCallMoreFlags & GTF_CALL_M_GUARDED_DEVIRT_CHAIN) == 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be disabled just for method GDV?

Copy link
Member Author

@jakobbotsch jakobbotsch Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean for vtable GDV only, and not for delegate GDV? This code does not run for class-based GDV.

As you can probably guess it's necessary because otherwise we cannot recognize the test in loop cloning. However, it does introduce an extra load on the cold path. Actually this seems fine on xarch as we can contain the load in the hot compare, so in a simple case of:

private static Func<long, long> s_f;

[MethodImpl(MethodImplOptions.NoInlining)]
private static long Foo()
{
    long result = 0;
    for (long i = 0; i < 100000; i++)
        result += s_f(i);

    return result;
}

The diff looks like:

@@ -33,19 +32,18 @@ G_M54051_IG02:
        mov      rbx, 0xD1FFAB1E      ; data for Program:s_f
                                                 ;; size=14 bbWeight=1    PerfScore 0.75
 G_M54051_IG03:
-       mov      rcx, gword ptr [rbx]
-       mov      r8, qword ptr [rcx+18H]
+       mov      r8, gword ptr [rbx]
        mov      rax, 0xD1FFAB1E      ; code for <>c:<Main>b__0_0
-       cmp      r8, rax
+       cmp      qword ptr [r8+18H], rax
        jne      SHORT G_M54051_IG07
        imul     rbp, rdi, 42

 G_M54051_IG04:
        add      rsi, rbp
        inc      rdi
        cmp      rdi, 0x186A0
        jl       SHORT G_M54051_IG03

 G_M54051_IG05:
        mov      rax, rsi
                                                 ;; size=3 bbWeight=0.67 PerfScore 0.17
@@ -58,12 +56,12 @@ G_M54051_IG06:
        ret
                                                 ;; size=9 bbWeight=0.67 PerfScore 2.17
 G_M54051_IG07:
-       mov      rcx, gword ptr [rcx+08H]
+       mov      rcx, gword ptr [r8+08H]
        mov      rdx, rdi
-       call     r8
+       call     [r8+18H]System.Func`2[long,long]:Invoke(long):long:this
        mov      rbp, rax
        jmp      SHORT G_M54051_IG04
-                                                ;; size=15 bbWeight=0    PerfScore 0.00
+                                                ;; size=16 bbWeight=0    PerfScore 0.00

On arm it's worse (depending on register pressure) since we have to load the delegate address into a register anyway, so we just stop reusing this register on the cold path:

@@ -43,19 +42,19 @@ G_M54051_IG02:
             movk    x23, #0xD1FFAB1E LSL #32
                                                 ;; size=40 bbWeight=1    PerfScore 5.00
 G_M54051_IG03:
-            ldr     x1, [x22]
-            ldr     x2, [x1, #0x18]
-            cmp     x2, x23
+            ldr     x2, [x22]
+            ldr     x0, [x2, #0x18]
+            cmp     x0, x23
             bne     G_M54051_IG07
             mov     x0, #42
             mul     x24, x20, x0

 G_M54051_IG04:
             add     x19, x19, x24
             add     x20, x20, #1
             cmp     x20, x21
             blt     G_M54051_IG03

 G_M54051_IG05:
             mov     x0, x19
                                                 ;; size=4 bbWeight=0.67 PerfScore 0.33
@@ -67,12 +66,13 @@ G_M54051_IG06:
             ret     lr
                                                 ;; size=20 bbWeight=0.67 PerfScore 3.33
 G_M54051_IG07:
-            ldr     x0, [x1, #0x08]
+            ldr     x0, [x2, #0x08]
             mov     x1, x20
+            ldr     x2, [x2, #0x18]
             blr     x2
             mov     x24, x0
             b       G_M54051_IG04
-                                                ;; size=20 bbWeight=0    PerfScore 0.00
+                                                ;; size=24 bbWeight=0    PerfScore 0.00

My opinion here is that this should ultimately be left up to CSE to decide. We may want to look at whether there is some CSE heuristics we should change for ARM given that less containment is possible there, so creating a CSE def in a hot block for a CSE use in a cold block may be more attractive (edit: opened #75253 for this).

This also does put delegate GDV in line with virtual class GDV. We do not reuse the method table indir in the cold case for virtual class GDVs either, even though we could.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

@jakobbotsch jakobbotsch added this to the 8.0.0 milestone Sep 15, 2022
@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member Author

Lots of libraries-pgo test failures, but I am not sure if they are related. Looks like we have a bunch of the same failures in main that need to be sorted through.

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr libraries-pgo

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Oct 18, 2022

libraries-pgo failures look like #77144, #77143 and #75827. NAOT build failure is #76801 according to build analysis.

@jakobbotsch jakobbotsch marked this pull request as ready for review October 18, 2022 09:30
@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @AndyAyersMS

@AndyAyersMS
Copy link
Member

Are any of the SPMI diffs showing new cloning? I would have expected we have enough data there to inspire at least a few cases.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM.

@BruceForstall you might want to review as well.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Oct 19, 2022

Are any of the SPMI diffs showing new cloning? I would have expected we have enough data there to inspire at least a few cases.

This is the one example I found:

foreach (TSource element in source)
{
d.Add(keySelector(element), elementSelector(element));
}

Loop cloning triggers for both a type test and the two delegate calls, but RBO is not able to clean up any of the cases:

Dominator BB24 of BB30 has relop with same liberal VN
N006 (  8, 15) [000595] J--X---N---NE        int    $14f
N004 (  4,  4) [000593] #--X-------                         ├──▌  IND       long   $202
N003 (  2,  2) [000592] -------N---                         │  └──▌  ADD       byref  $341
N001 (  1,  1) [000590] -----------                         │     ├──▌  LCL_VAR   ref    V03 arg2         u:1 $c2
N002 (  1,  1) [000591] -----------                         │     └──▌  CNS_INT   long   24 $300
N005 (  3, 10) [000594] H----------                         └──▌  CNS_INT(h) long   0x7fff74602f00 ftn $2c2
 Redundant compare; current relop:
N006 (  8, 15) [000197] J--X---N---NE        int    $14f
N004 (  4,  4) [000195] #--X-------                         ├──▌  IND       long   $202
N003 (  2,  2) [000194] -------N---                         │  └──▌  ADD       byref  $341
N001 (  1,  1) [000192] -----------                         │     ├──▌  LCL_VAR   ref    V03 arg2         u:1 $c2
N002 (  1,  1) [000193] -----------                         │     └──▌  CNS_INT   long   24 $300
N005 (  3, 10) [000196] H----------                         └──▌  CNS_INT(h) long   0x7fff74602f00 ftn $2c2
Fall through successor BB25 of BB24 reaches, relop [000197] must be false
Current relop has exception side effect and is in a try, so we won't optimize

...

Dominator BB23 of BB27 has relop with same liberal VN
N006 (  8, 15) [000588] J--X---N---NE        int    $14d
N004 (  4,  4) [000586] #--X-------                         ├──▌  IND       long   $201
N003 (  2,  2) [000585] -------N---                         │  └──▌  ADD       byref  $340
N001 (  1,  1) [000583] -----------                         │     ├──▌  LCL_VAR   ref    V02 arg1         u:1 $c1
N002 (  1,  1) [000584] -----------                         │     └──▌  CNS_INT   long   24 $300
N005 (  3, 10) [000587] H----------                         └──▌  CNS_INT(h) long   0x7fff74602ee8 ftn $2c1
 Redundant compare; current relop:
N006 (  8, 15) [000171] J--X---N---NE        int    $14d
N004 (  4,  4) [000169] #--X-------                         ├──▌  IND       long   $201
N003 (  2,  2) [000168] -------N---                         │  └──▌  ADD       byref  $340
N001 (  1,  1) [000166] -----------                         │     ├──▌  LCL_VAR   ref    V02 arg1         u:1 $c1
N002 (  1,  1) [000167] -----------                         │     └──▌  CNS_INT   long   24 $300
N005 (  3, 10) [000170] H----------                         └──▌  CNS_INT(h) long   0x7fff74602ee8 ftn $2c1
Fall through successor BB24 of BB23 reaches, relop [000171] must be false
Current relop has exception side effect and is in a try, so we won't optimize

...

Dominator BB25 of BB34 has relop with same liberal VN
N004 (  7, 13) [000600] J--X---N---NE        int    $151
N002 (  3,  2) [000598] #--X-------                         ├──▌  IND       long   $203
N001 (  1,  1) [000597] -----------                         │  └──▌  LCL_VAR   ref    V10 loc5         u:1 $c6
N003 (  3, 10) [000599] H----------                         └──▌  CNS_INT(h) long   0x7fff7466ce98 class $2c3
 Redundant compare; current relop:
N004 (  7, 13) [000224] J--X---N---NE        int    $151
N002 (  3,  2) [000222] #--X-------                         ├──▌  IND       long   $203
N001 (  1,  1) [000221] -----------                         │  └──▌  LCL_VAR   ref    V10 loc5         u:1 $c6
N003 (  3, 10) [000223] H----------                         └──▌  CNS_INT(h) long   0x7fff7466ce98 class $2c3
Fall through successor BB26 of BB25 reaches, relop [000224] must be false
Current relop has exception side effect and is in a try, so we won't optimize

I'm not sure if the intention is to rely on early prop to remove these flags, but I guess the simplest way is to just have loop cloning remove these flags on its own.

This prevents RBO from optimizing them away if the loop is inside an EH
handler (e.g. foreach loop).
@jakobbotsch
Copy link
Member Author

@AndyAyersMS Can you please look at the latest commit, which makes the indirs nonfaulting in the "hot" loop for both type tests and delegate address tests?

Also, I checked via an assert and there are 13 instances that trigger loop cloning with delegate address guards in the aspnet collection.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Oct 19, 2022

An interesting case we get in the asp.net collection are the two inner loops in
System.Collections.Generic.ArraySortHelper`1[int]:PickPivotAndPartition(System.Span`1[int],System.Comparison`1[int]):int:

while (comparer(keys[++left], pivot) < 0) ;
while (comparer(pivot, keys[--right]) < 0) ;

We hit missing information after the cloning, so I cannot see the exact diff. I'll try to measure how it affects performance of sorting.

@AndyAyersMS
Copy link
Member

I'm not sure if the intention is to rely on early prop to remove these flags, but I guess the simplest way is to just have loop cloning remove these flags on its own

Fixing it here is good, but do we know why early prop can't do this too?

@jakobbotsch
Copy link
Member Author

Fixing it here is good, but do we know why early prop can't do this too?

Early prop only looks for GT_NULLCHECK nodes, I think. It does not know about the dominating compare that cloning has added.

@jakobbotsch
Copy link
Member Author

If we get #71707 in and start running early prop on all blocks we might be able to expand it to look for the relops too.

@jakobbotsch
Copy link
Member Author

Effect on sorting seems pretty much within variance:

private static readonly int[] s_nums = Enumerable.Range(0, 1000000).ToArray();

[Benchmark]
public int Random()
{
    Random rand = new Random(5);
    int[] nums = s_nums;
    for (int i = 0; i < nums.Length; i++)
        nums[i] = rand.Next(50);

    Array.Sort(nums, (a, b) => b - a);
    return nums.Length;
}

[Benchmark]
public int Ascending()
{
    int[] nums = s_nums;
    for (int i = 0; i < nums.Length; i++)
        nums[i] = i;

    Array.Sort(nums, (a, b) => b - a);
    return nums.Length;
}

[Benchmark]
public int Descending()
{
    int[] nums = s_nums;
    for (int i = 0; i < nums.Length; i++)
        nums[i] = nums.Length - i;

    Array.Sort(nums, (a, b) => b - a);
    return nums.Length;
}

[Benchmark]
public int Equal()
{
    int[] nums = s_nums;
    for (int i = 0; i < nums.Length; i++)
        nums[i] = 0;

    Array.Sort(nums, (a, b) => b - a);
    return nums.Length;
}
Method Job EnvironmentVariables Toolchain Mean Error StdDev Ratio
Random Job-VJODMV Empty main 45.048 ms 0.2455 ms 0.2297 ms 1.00
Random Job-GFIHSH DOTNET_TieredPGO=1,DOTNET_ReadyToRun=0 main-FullPGO 30.161 ms 0.2018 ms 0.1789 ms 0.67
Random Job-ZZPKHX DOTNET_TieredPGO=1,DOTNET_ReadyToRun=0 pr-FullPGO 30.692 ms 0.1856 ms 0.1736 ms 0.68
Ascending Job-VJODMV Empty main 47.859 ms 0.0860 ms 0.0763 ms 1.00
Ascending Job-GFIHSH DOTNET_TieredPGO=1,DOTNET_ReadyToRun=0 main-FullPGO 16.344 ms 0.0323 ms 0.0302 ms 0.34
Ascending Job-ZZPKHX DOTNET_TieredPGO=1,DOTNET_ReadyToRun=0 pr-FullPGO 16.357 ms 0.0441 ms 0.0391 ms 0.34
Descending Job-VJODMV Empty main 24.597 ms 0.0492 ms 0.0460 ms 1.00
Descending Job-GFIHSH DOTNET_TieredPGO=1,DOTNET_ReadyToRun=0 main-FullPGO 8.838 ms 0.0414 ms 0.0346 ms 0.36
Descending Job-ZZPKHX DOTNET_TieredPGO=1,DOTNET_ReadyToRun=0 pr-FullPGO 8.714 ms 0.0272 ms 0.0254 ms 0.35
Equal Job-VJODMV Empty main 24.191 ms 0.0317 ms 0.0265 ms 1.00
Equal Job-GFIHSH DOTNET_TieredPGO=1,DOTNET_ReadyToRun=0 main-FullPGO 11.398 ms 0.0191 ms 0.0179 ms 0.47
Equal Job-ZZPKHX DOTNET_TieredPGO=1,DOTNET_ReadyToRun=0 pr-FullPGO 11.581 ms 0.0259 ms 0.0230 ms 0.48

I do notice that we end up moving some of the hot basic blocks inside the loop to the end of the function, so might be we could do better with some smarter flow graph linearization. And of course, if we cloned the outer loop instead of the two inner loops it would probably be better too.

@jakobbotsch jakobbotsch merged commit 17d613e into dotnet:main Oct 20, 2022
@jakobbotsch jakobbotsch deleted the delegate-gdv-loop-cloning branch October 20, 2022 09:20
@ghost ghost locked as resolved and limited conversation to collaborators Nov 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants