-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Support delegate GDV guards in loop cloning #75140
JIT: Support delegate GDV guards in loop cloning #75140
Conversation
* Support cloning loops based on delegate GDV guards * Do flow-graph opts directly in loop cloning instead of relying on RBO to clean it up (for both type and delegate GDV)
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue Details
For: [MethodImpl(MethodImplOptions.NoInlining)]
private static long Foo(Func<long, long> f)
{
long result = 0;
for (long i = 0; i < 100000; i++)
result += f(i);
return result;
} Before: ; Assembly listing for method Program:Foo(System.Func`2[long,long]):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with Dynamic PGO: edge weights are valid, and fgCalledCount is 3
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T03] ( 5, 66783 ) ref -> rsi class-hnd single-def
; V01 loc0 [V01,T01] ( 4,133563.67) long -> rdi
; V02 loc1 [V02,T00] ( 6,267125 ) long -> rbx
; V03 OutArgs [V03 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V04 tmp1 [V04,T02] ( 3,133562 ) long -> rbp "guarded devirt return temp"
;* V05 tmp2 [V05 ] ( 0, 0 ) ref -> zero-ref class-hnd "guarded devirt this exact temp"
;
; Lcl frame size = 40
G_M6926_IG01: ;; offset=0000H
57 push rdi
56 push rsi
55 push rbp
53 push rbx
4883EC28 sub rsp, 40
488BF1 mov rsi, rcx
;; size=11 bbWeight=1 PerfScore 4.50
G_M6926_IG02: ;; offset=000BH
33FF xor edi, edi
33DB xor ebx, ebx
;; size=4 bbWeight=1 PerfScore 0.50
G_M6926_IG03: ;; offset=000FH
48B8084A3BB2FD7F0000 mov rax, 0x7FFDB23B4A08 ; code for <>c:<Main>b__0_0
48394618 cmp qword ptr [rsi+18H], rax
751F jne SHORT G_M6926_IG07
486BEB2A imul rbp, rbx, 42
;; size=20 bbWeight=66781 PerfScore 417381.25
G_M6926_IG04: ;; offset=0023H
4803FD add rdi, rbp
48FFC3 inc rbx
4881FBA0860100 cmp rbx, 0x186A0
7CDD jl SHORT G_M6926_IG03
;; size=15 bbWeight=66781 PerfScore 116866.75
G_M6926_IG05: ;; offset=0032H
488BC7 mov rax, rdi
;; size=3 bbWeight=0.67 PerfScore 0.17
G_M6926_IG06: ;; offset=0035H
4883C428 add rsp, 40
5B pop rbx
5D pop rbp
5E pop rsi
5F pop rdi
C3 ret
;; size=9 bbWeight=0.67 PerfScore 2.17
G_M6926_IG07: ;; offset=003EH
488B4E08 mov rcx, gword ptr [rsi+08H]
488BD3 mov rdx, rbx
FF5618 call [rsi+18H]System.Func`2[long,long]:Invoke(long):long:this
488BE8 mov rbp, rax
EBD6 jmp SHORT G_M6926_IG04
;; size=15 bbWeight=0 PerfScore 0.00
; Total bytes of code 77, prolog size 11, PerfScore 534263.03, instruction count 28, allocated bytes for code 77 (MethodHash=f7dde4f1) for method Program:Foo(System.Func`2[long,long]):long
; ============================================================ After: ; Assembly listing for method Program:Foo(System.Func`2[long,long]):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with Dynamic PGO: edge weights are invalid, and fgCalledCount is 3
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T03] ( 7, 671.83) ref -> rsi class-hnd single-def
; V01 loc0 [V01,T01] ( 6,133567.00) long -> rdi
; V02 loc1 [V02,T00] ( 10,267131.67) long -> rbx
; V03 OutArgs [V03 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V04 tmp1 [V04,T02] ( 5,133565.33) long -> rbp "guarded devirt return temp"
;* V05 tmp2 [V05 ] ( 0, 0 ) ref -> zero-ref class-hnd "guarded devirt this exact temp"
;
; Lcl frame size = 40
G_M6926_IG01: ;; offset=0000H
57 push rdi
56 push rsi
55 push rbp
53 push rbx
4883EC28 sub rsp, 40
488BF1 mov rsi, rcx
;; size=11 bbWeight=1 PerfScore 4.50
G_M6926_IG02: ;; offset=000BH
33FF xor edi, edi
33DB xor ebx, ebx
4885F6 test rsi, rsi
7425 je SHORT G_M6926_IG05
48B8084A3CB2FD7F0000 mov rax, 0x7FFDB23C4A08 ; code for <>c:<Main>b__0_0
48394618 cmp qword ptr [rsi+18H], rax
7515 jne SHORT G_M6926_IG05
;; size=25 bbWeight=1 PerfScore 6.00
G_M6926_IG03: ;; offset=0024H
486BEB2A imul rbp, rbx, 42
4803FD add rdi, rbp
48FFC3 inc rbx
4881FBA0860100 cmp rbx, 0x186A0
7CED jl SHORT G_M6926_IG03
;; size=19 bbWeight=66114.84 PerfScore 247930.65
G_M6926_IG04: ;; offset=0037H
EB23 jmp SHORT G_M6926_IG07
;; size=2 bbWeight=1 PerfScore 2.00
G_M6926_IG05: ;; offset=0039H
48B8084A3CB2FD7F0000 mov rax, 0x7FFDB23C4A08 ; code for <>c:<Main>b__0_0
48394618 cmp qword ptr [rsi+18H], rax
751F jne SHORT G_M6926_IG09
486BEB2A imul rbp, rbx, 42
;; size=20 bbWeight=667.83 PerfScore 4173.92
G_M6926_IG06: ;; offset=004DH
4803FD add rdi, rbp
48FFC3 inc rbx
4881FBA0860100 cmp rbx, 0x186A0
7CDD jl SHORT G_M6926_IG05
;; size=15 bbWeight=667.83 PerfScore 1168.70
G_M6926_IG07: ;; offset=005CH
488BC7 mov rax, rdi
;; size=3 bbWeight=0.67 PerfScore 0.17
G_M6926_IG08: ;; offset=005FH
4883C428 add rsp, 40
5B pop rbx
5D pop rbp
5E pop rsi
5F pop rdi
C3 ret
;; size=9 bbWeight=0.67 PerfScore 2.17
G_M6926_IG09: ;; offset=0068H
488B4E08 mov rcx, gword ptr [rsi+08H]
488BD3 mov rdx, rbx
FF5618 call [rsi+18H]System.Func`2[long,long]:Invoke(long):long:this
488BE8 mov rbp, rax
EBD6 jmp SHORT G_M6926_IG06
;; size=15 bbWeight=0 PerfScore 0.00
; Total bytes of code 119, prolog size 11, PerfScore 253300.00, instruction count 39, allocated bytes for code 119 (MethodHash=f7dde4f1) for method Program:Foo(System.Func`2[long,long]):long
; ============================================================ Note that we are not eliminating the GDV check in the cold loop. In cases where we the only cloning condition we have is the GDV guard we should do that. But I'm not sure if I should fold that into this PR too. Micro benchmark: [Benchmark]
public long Bench()
{
long sum = 0;
for (int i = 0; i < 30; i++)
sum += SumVals(j => j * 42);
return sum;
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static long SumVals(Func<long, long> f)
{
long result = 0;
for (long i = 0; i < 10000000; i++)
result += f(i);
return result;
}
cc @dotnet/jit-contrib @davidfowl
|
5cefd6c
to
a5afa33
Compare
Rely on RBO like type test opt.
@@ -594,7 +594,7 @@ class IndirectCallTransformer | |||
// TODO-GDV: Consider duplicating the store at the end of the | |||
// cold case for the previous GDV. Then we can reuse the target | |||
// if the second check of a chained GDV fails. | |||
bool reuseTarget = (origCall->gtCallMoreFlags & GTF_CALL_M_GUARDED_DEVIRT_CHAIN) == 0; | |||
bool reuseTarget = false; //(origCall->gtCallMoreFlags & GTF_CALL_M_GUARDED_DEVIRT_CHAIN) == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be disabled just for method GDV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean for vtable GDV only, and not for delegate GDV? This code does not run for class-based GDV.
As you can probably guess it's necessary because otherwise we cannot recognize the test in loop cloning. However, it does introduce an extra load on the cold path. Actually this seems fine on xarch as we can contain the load in the hot compare, so in a simple case of:
private static Func<long, long> s_f;
[MethodImpl(MethodImplOptions.NoInlining)]
private static long Foo()
{
long result = 0;
for (long i = 0; i < 100000; i++)
result += s_f(i);
return result;
}
The diff looks like:
@@ -33,19 +32,18 @@ G_M54051_IG02:
mov rbx, 0xD1FFAB1E ; data for Program:s_f
;; size=14 bbWeight=1 PerfScore 0.75
G_M54051_IG03:
- mov rcx, gword ptr [rbx]
- mov r8, qword ptr [rcx+18H]
+ mov r8, gword ptr [rbx]
mov rax, 0xD1FFAB1E ; code for <>c:<Main>b__0_0
- cmp r8, rax
+ cmp qword ptr [r8+18H], rax
jne SHORT G_M54051_IG07
imul rbp, rdi, 42
G_M54051_IG04:
add rsi, rbp
inc rdi
cmp rdi, 0x186A0
jl SHORT G_M54051_IG03
G_M54051_IG05:
mov rax, rsi
;; size=3 bbWeight=0.67 PerfScore 0.17
@@ -58,12 +56,12 @@ G_M54051_IG06:
ret
;; size=9 bbWeight=0.67 PerfScore 2.17
G_M54051_IG07:
- mov rcx, gword ptr [rcx+08H]
+ mov rcx, gword ptr [r8+08H]
mov rdx, rdi
- call r8
+ call [r8+18H]System.Func`2[long,long]:Invoke(long):long:this
mov rbp, rax
jmp SHORT G_M54051_IG04
- ;; size=15 bbWeight=0 PerfScore 0.00
+ ;; size=16 bbWeight=0 PerfScore 0.00
On arm it's worse (depending on register pressure) since we have to load the delegate address into a register anyway, so we just stop reusing this register on the cold path:
@@ -43,19 +42,19 @@ G_M54051_IG02:
movk x23, #0xD1FFAB1E LSL #32
;; size=40 bbWeight=1 PerfScore 5.00
G_M54051_IG03:
- ldr x1, [x22]
- ldr x2, [x1, #0x18]
- cmp x2, x23
+ ldr x2, [x22]
+ ldr x0, [x2, #0x18]
+ cmp x0, x23
bne G_M54051_IG07
mov x0, #42
mul x24, x20, x0
G_M54051_IG04:
add x19, x19, x24
add x20, x20, #1
cmp x20, x21
blt G_M54051_IG03
G_M54051_IG05:
mov x0, x19
;; size=4 bbWeight=0.67 PerfScore 0.33
@@ -67,12 +66,13 @@ G_M54051_IG06:
ret lr
;; size=20 bbWeight=0.67 PerfScore 3.33
G_M54051_IG07:
- ldr x0, [x1, #0x08]
+ ldr x0, [x2, #0x08]
mov x1, x20
+ ldr x2, [x2, #0x18]
blr x2
mov x24, x0
b G_M54051_IG04
- ;; size=20 bbWeight=0 PerfScore 0.00
+ ;; size=24 bbWeight=0 PerfScore 0.00
My opinion here is that this should ultimately be left up to CSE to decide. We may want to look at whether there is some CSE heuristics we should change for ARM given that less containment is possible there, so creating a CSE def in a hot block for a CSE use in a cold block may be more attractive (edit: opened #75253 for this).
This also does put delegate GDV in line with virtual class GDV. We do not reuse the method table indir in the cold case for virtual class GDVs either, even though we could.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good.
/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo |
Azure Pipelines successfully started running 2 pipeline(s). |
Lots of libraries-pgo test failures, but I am not sure if they are related. Looks like we have a bunch of the same failures in main that need to be sorted through. |
/azp run runtime-coreclr libraries-pgo |
Azure Pipelines successfully started running 1 pipeline(s). |
cc @dotnet/jit-contrib PTAL @AndyAyersMS |
Are any of the SPMI diffs showing new cloning? I would have expected we have enough data there to inspire at least a few cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM.
@BruceForstall you might want to review as well.
This is the one example I found: runtime/src/libraries/System.Linq/src/System/Linq/ToCollection.cs Lines 161 to 164 in 57bfe47
Loop cloning triggers for both a type test and the two delegate calls, but RBO is not able to clean up any of the cases: Dominator BB24 of BB30 has relop with same liberal VN
N006 ( 8, 15) [000595] J--X---N--- ▌ NE int $14f
N004 ( 4, 4) [000593] #--X------- ├──▌ IND long $202
N003 ( 2, 2) [000592] -------N--- │ └──▌ ADD byref $341
N001 ( 1, 1) [000590] ----------- │ ├──▌ LCL_VAR ref V03 arg2 u:1 $c2
N002 ( 1, 1) [000591] ----------- │ └──▌ CNS_INT long 24 $300
N005 ( 3, 10) [000594] H---------- └──▌ CNS_INT(h) long 0x7fff74602f00 ftn $2c2
Redundant compare; current relop:
N006 ( 8, 15) [000197] J--X---N--- ▌ NE int $14f
N004 ( 4, 4) [000195] #--X------- ├──▌ IND long $202
N003 ( 2, 2) [000194] -------N--- │ └──▌ ADD byref $341
N001 ( 1, 1) [000192] ----------- │ ├──▌ LCL_VAR ref V03 arg2 u:1 $c2
N002 ( 1, 1) [000193] ----------- │ └──▌ CNS_INT long 24 $300
N005 ( 3, 10) [000196] H---------- └──▌ CNS_INT(h) long 0x7fff74602f00 ftn $2c2
Fall through successor BB25 of BB24 reaches, relop [000197] must be false
Current relop has exception side effect and is in a try, so we won't optimize
...
Dominator BB23 of BB27 has relop with same liberal VN
N006 ( 8, 15) [000588] J--X---N--- ▌ NE int $14d
N004 ( 4, 4) [000586] #--X------- ├──▌ IND long $201
N003 ( 2, 2) [000585] -------N--- │ └──▌ ADD byref $340
N001 ( 1, 1) [000583] ----------- │ ├──▌ LCL_VAR ref V02 arg1 u:1 $c1
N002 ( 1, 1) [000584] ----------- │ └──▌ CNS_INT long 24 $300
N005 ( 3, 10) [000587] H---------- └──▌ CNS_INT(h) long 0x7fff74602ee8 ftn $2c1
Redundant compare; current relop:
N006 ( 8, 15) [000171] J--X---N--- ▌ NE int $14d
N004 ( 4, 4) [000169] #--X------- ├──▌ IND long $201
N003 ( 2, 2) [000168] -------N--- │ └──▌ ADD byref $340
N001 ( 1, 1) [000166] ----------- │ ├──▌ LCL_VAR ref V02 arg1 u:1 $c1
N002 ( 1, 1) [000167] ----------- │ └──▌ CNS_INT long 24 $300
N005 ( 3, 10) [000170] H---------- └──▌ CNS_INT(h) long 0x7fff74602ee8 ftn $2c1
Fall through successor BB24 of BB23 reaches, relop [000171] must be false
Current relop has exception side effect and is in a try, so we won't optimize
...
Dominator BB25 of BB34 has relop with same liberal VN
N004 ( 7, 13) [000600] J--X---N--- ▌ NE int $151
N002 ( 3, 2) [000598] #--X------- ├──▌ IND long $203
N001 ( 1, 1) [000597] ----------- │ └──▌ LCL_VAR ref V10 loc5 u:1 $c6
N003 ( 3, 10) [000599] H---------- └──▌ CNS_INT(h) long 0x7fff7466ce98 class $2c3
Redundant compare; current relop:
N004 ( 7, 13) [000224] J--X---N--- ▌ NE int $151
N002 ( 3, 2) [000222] #--X------- ├──▌ IND long $203
N001 ( 1, 1) [000221] ----------- │ └──▌ LCL_VAR ref V10 loc5 u:1 $c6
N003 ( 3, 10) [000223] H---------- └──▌ CNS_INT(h) long 0x7fff7466ce98 class $2c3
Fall through successor BB26 of BB25 reaches, relop [000224] must be false
Current relop has exception side effect and is in a try, so we won't optimize I'm not sure if the intention is to rely on early prop to remove these flags, but I guess the simplest way is to just have loop cloning remove these flags on its own. |
This prevents RBO from optimizing them away if the loop is inside an EH handler (e.g. foreach loop).
@AndyAyersMS Can you please look at the latest commit, which makes the indirs nonfaulting in the "hot" loop for both type tests and delegate address tests? Also, I checked via an assert and there are 13 instances that trigger loop cloning with delegate address guards in the aspnet collection. |
An interesting case we get in the asp.net collection are the two inner loops in runtime/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/ArraySortHelper.cs Lines 197 to 198 in 31d4e3a
We hit missing information after the cloning, so I cannot see the exact diff. I'll try to measure how it affects performance of sorting. |
Fixing it here is good, but do we know why early prop can't do this too? |
Early prop only looks for |
If we get #71707 in and start running early prop on all blocks we might be able to expand it to look for the relops too. |
Effect on sorting seems pretty much within variance: private static readonly int[] s_nums = Enumerable.Range(0, 1000000).ToArray();
[Benchmark]
public int Random()
{
Random rand = new Random(5);
int[] nums = s_nums;
for (int i = 0; i < nums.Length; i++)
nums[i] = rand.Next(50);
Array.Sort(nums, (a, b) => b - a);
return nums.Length;
}
[Benchmark]
public int Ascending()
{
int[] nums = s_nums;
for (int i = 0; i < nums.Length; i++)
nums[i] = i;
Array.Sort(nums, (a, b) => b - a);
return nums.Length;
}
[Benchmark]
public int Descending()
{
int[] nums = s_nums;
for (int i = 0; i < nums.Length; i++)
nums[i] = nums.Length - i;
Array.Sort(nums, (a, b) => b - a);
return nums.Length;
}
[Benchmark]
public int Equal()
{
int[] nums = s_nums;
for (int i = 0; i < nums.Length; i++)
nums[i] = 0;
Array.Sort(nums, (a, b) => b - a);
return nums.Length;
}
I do notice that we end up moving some of the hot basic blocks inside the loop to the end of the function, so might be we could do better with some smarter flow graph linearization. And of course, if we cloned the outer loop instead of the two inner loops it would probably be better too. |
Add support for cloning loops based on delegate GDV guards. Mark delegate address loads as invariant to allow VN and CSE of them.
For:
Before (GDV guard for the delegate target on every iteration):
After (loop cloned with no GDV guard checks inside the loop):
Note that we are not eliminating the GDV check in the cold loop. In cases where we the only cloning condition we have is the GDV guard we should do that. But I'm not sure if I should fold that into this PR too.
Micro benchmark:
cc @dotnet/jit-contrib @davidfowl