-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack allocate unescaped boxes #103361
Stack allocate unescaped boxes #103361
Conversation
…-lclmorph Minor fixes
For case from #9118: ; Assembly listing for method C:Main():int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 1 inlinees with PGO data; 3 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
; V00 OutArgs [V00 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V01 tmp1 [V01,T00] ( 0, 0 ) ref -> zero-ref class-hnd exact single-def "NewObj constructor temp" <ObjectEqualityComparer`1[int]>
;* V02 tmp2 [V02 ] ( 0, 0 ) ubyte -> zero-ref "Inline return value spill temp"
;* V03 tmp3 [V03 ] ( 0, 0 ) int -> zero-ref ld-addr-op "Inlining Arg"
;* V04 tmp4 [V04 ] ( 0, 0 ) long -> zero-ref class-hnd exact "Single-def Box Helper" <System.Int32>
;* V05 tmp5 [V05 ] ( 0, 0 ) ubyte -> zero-ref "Inline return value spill temp"
;* V06 tmp6 [V06 ] ( 0, 0 ) long -> zero-ref class-hnd exact "Inlining Arg" <System.Int32>
;* V07 tmp7 [V07 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] "stack allocated boxed value class temp" <System.Collections.Generic.KeyValuePair`2[long,int]>
;* V08 tmp8 [V08 ] ( 0, 0 ) long -> zero-ref "V07.[000..008)"
;* V09 tmp9 [V09 ] ( 0, 0 ) int -> zero-ref "V07.[008..012)"
;
; Lcl frame size = 40
G_M46144_IG01: ;; offset=0x0000
sub rsp, 40
;; size=4 bbWeight=1 PerfScore 0.25
G_M46144_IG02: ;; offset=0x0004
mov eax, 100
;; size=5 bbWeight=1 PerfScore 0.25
G_M46144_IG03: ;; offset=0x0009
add rsp, 40
ret
;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code 14, prolog size 4, PerfScore 1.75, instruction count 4, allocated bytes for code 14 (MethodHash=cd934bbf) for method C:Main():int (FullOpts)
|
Current code for #10195 ; Assembly listing for method Program:Pattern[Dog](Dog) (FullOpts)
; Emitting BLENDED_CODE for X86 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) struct ( 8) rcx single-def <Dog>
;* V01 loc0 [V01 ] ( 0, 0 ) long -> zero-ref class-hnd exact <Dog>
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1 [V03 ] ( 0, 0 ) long -> zero-ref class-hnd exact "Single-def Box Helper" <Dog>
;* V04 tmp2 [V04 ] ( 0, 0 ) long -> zero-ref "Inlining Arg"
; V05 tmp3 [V05 ] ( 2, 2 ) struct (16) [rsp+0x08] do-not-enreg[XSF] must-init addr-exposed "stack allocated boxed value class temp" <System.Collections.Generic.KeyValuePair`2[long,Dog]>
;
; Lcl frame size = 24
G_M46299_IG01: ;; offset=0x0000
sub rsp, 24
xor eax, eax
mov qword ptr [rsp+0x08], rax
mov qword ptr [rsp+0x10], rax
;; size=16 bbWeight=1 PerfScore 2.50
G_M46299_IG02: ;; offset=0x0010
mov rax, 0x7FFEDABE4C08 ; Dog
mov qword ptr [rsp+0x08], rax
mov byte ptr [rsp+0x10], cl
;; size=19 bbWeight=1 PerfScore 2.25
G_M46299_IG03: ;; offset=0x0023
add rsp, 24
ret
;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code 40, prolog size 16, PerfScore 6.00, instruction count 9, allocated bytes for code 40 (MethodHash=a9204b24) for method Program:Pattern[Dog](Dog) (FullOpts)
; ============================================================ No more boxing, but room for improvement... |
Any idea why we end up with address exposure for |
I think it's that local morph needs to deal with a nullcheck... will know shortly |
src/coreclr/vm/jitinterface.cpp
Outdated
TypeHandle mt(CoreLibBinder::GetElementType(ELEMENT_TYPE_I)); | ||
TypeHandle boxedFields[2] = {mt, VMClsHnd}; | ||
Instantiation boxedFieldsInst((TypeHandle*)&boxedFields, 2); | ||
TypeHandle genericKvp = CoreLibBinder::GetClass(CLASS__KEYVALUEPAIRGENERIC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work well. KeyValuePair is auto-layout when the type contains references. auto-layout moves references to the front.
For example, KeyValuePair<IntPtr, string>
has string at offset 0 and IntPtr at offset 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it probably works ok for valuetypes currently, but it is fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks... we were trying to find a suitable type without needing too much from the runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should introduce dedicated internal helper type for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also try and do this entirely within the jit (that's where I started), but it would be nice to have class and field handles and the like.
I also am still not sure how effective any of this will be, so I want to do some more analysis. Earlier studies suggest we can prove maybe 1% of boxes do not escape (static, not dynamic) but were based on SPMI which can't replay a bunch of possibly interesting cases. Given the cost of the analysis I would like to see if we can get that number a bit higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That 1% is a static measure, that is, 1% of the box call sites, not necessarily 1% of boxes allocated.
Yep, that was the reason... ; Assembly listing for method Program:Pattern[Dog](Dog) (FullOpts)
; Emitting BLENDED_CODE for X86 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 arg0 [V00 ] ( 0, 0 ) struct ( 8) zero-ref single-def <Dog>
;* V01 loc0 [V01 ] ( 0, 0 ) long -> zero-ref class-hnd exact <Dog>
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1 [V03 ] ( 0, 0 ) long -> zero-ref class-hnd exact "Single-def Box Helper" <Dog>
;* V04 tmp2 [V04 ] ( 0, 0 ) long -> zero-ref "Inlining Arg"
;* V05 tmp3 [V05 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] "stack allocated boxed value class temp" <System.Collections.Generic.KeyValuePair`2[long,Dog]>
;
; Lcl frame size = 0
G_M46299_IG01: ;; offset=0x0000
;; size=0 bbWeight=1 PerfScore 0.00
G_M46299_IG02: ;; offset=0x0000
ret
;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code 1, prolog size 0, PerfScore 1.00, instruction count 1, allocated bytes for code 1 (MethodHash=a9204b24) for method Program:Pattern[Dog](Dog) (FullOpts)
; ============================================================ |
For #85570 one box gets optimized, but the other box is passed to an unbox helper. This helper causes the box to escape if the type tests fail. There is some preliminary helper escape logic in #103148 but it is too aggressive for unboxing. We would need some kind of partial escape analysis and a different helper pattern for this. |
We stack allocate in #58554 but the local box is exposed, possibly because of a So asm is not that pretty yet ; Assembly listing for method Program:Test[System.ValueTuple`2[int,System.__Canon],System.ValueTuple`2[int,System.__Canon]](System.ValueTuple`2[int,System.__Canon]):ubyte (FullOpts)
; Emitting BLENDED_CODE for X86 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 TypeCtx [V00,T01] ( 4, 4 ) long -> rcx single-def
; V01 arg0 [V01,T00] ( 3, 6 ) byref -> rdx single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1 [V03 ] ( 0, 0 ) long -> zero-ref class-hnd exact "Single-def Box Helper" <System.ValueTuple`2[int,System.__Canon]>
;* V04 tmp2 [V04 ] ( 0, 0 ) byref -> zero-ref "ISINST eval op1"
; V05 tmp3 [V05,T02] ( 2, 4 ) byref -> rdx class-hnd "spilling qmarkNull" <System.ValueTuple`2[int,System.__Canon]>
; V06 tmp4 [V06 ] ( 4, 4 ) struct (24) [rsp+0x08] do-not-enreg[XSF] must-init addr-exposed "stack allocated boxed value class temp" <System.Collections.Generic.KeyValuePair`2[long,System.ValueTuple`2[int,System.__Canon]]>
;* V07 tmp5 [V07 ] ( 0, 0 ) ref -> zero-ref "field V01.Item2 (fldOffset=0x0)" P-INDEP
;* V08 tmp6 [V08 ] ( 0, 0 ) int -> zero-ref "field V01.Item1 (fldOffset=0x8)" P-INDEP
;* V09 tmp7 [V09 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref" <System.ValueTuple`2[int,System.__Canon]>
; V10 cse0 [V10,T03] ( 3, 3 ) long -> rax "CSE #01: aggressive"
;
; Lcl frame size = 40
G_M33081_IG01: ;; offset=0x0000
sub rsp, 40
vxorps xmm4, xmm4, xmm4
vmovdqu xmmword ptr [rsp+0x08], xmm4
xor eax, eax
mov qword ptr [rsp+0x18], rax
mov qword ptr [rsp+0x20], rcx
;; size=26 bbWeight=1 PerfScore 4.83
G_M33081_IG02: ;; offset=0x001A
mov rax, qword ptr [rcx+0x40]
mov rcx, qword ptr [rax]
mov qword ptr [rsp+0x08], rcx
;; size=12 bbWeight=1 PerfScore 5.00
G_M33081_IG03: ;; offset=0x0026
vmovdqu xmm0, xmmword ptr [rdx]
vmovdqu xmmword ptr [rsp+0x10], xmm0
;; size=10 bbWeight=1 PerfScore 5.00
G_M33081_IG04: ;; offset=0x0030
mov rcx, qword ptr [rsp+0x08]
lea rdx, bword ptr [rsp+0x08]
xor r8, r8
cmp rcx, qword ptr [rax+0x08]
cmovne rdx, r8
test rdx, rdx
setne al
movzx rax, al
;; size=30 bbWeight=1 PerfScore 6.50
G_M33081_IG05: ;; offset=0x004E
add rsp, 40
ret
;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code 83, prolog size 26, PerfScore 22.58, instruction count 21, allocated bytes for code 83 (MethodHash=0b7f7ec6) for method Program:Test[System.ValueTuple`2[int,System.__Canon],System.ValueTuple`2[int,System.__Canon]](System.ValueTuple`2[int,System.__Canon]):ubyte (FullOpts)
; ============================================================ |
Will run mihu bot once we're a bit further along in testing, to get a sense of diffs. |
src/coreclr/jit/lclmorph.cpp
Outdated
} | ||
INDEBUG(TopValue(0).Consume()); | ||
PopValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to call EscapeValue
in the non-transformed case (that might be the reason for the asserts)
With this PR + #103391 the disassembly for this function becomes ; Assembly listing for method C:Test[System.ValueTuple`2[int,System.__Canon],System.ValueTuple`2[int,System.__Canon]](System.ValueTuple`2[int,System.__Canon]):ubyte (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 TypeCtx [V00,T00] ( 4, 4 ) long -> rcx single-def
;* V01 arg0 [V01 ] ( 0, 0 ) byref -> zero-ref single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1 [V03 ] ( 0, 0 ) long -> zero-ref class-hnd exact "Single-def Box Helper" <System.ValueTuple`2[int,System.__Canon]>
;* V04 tmp2 [V04 ] ( 0, 0 ) byref -> zero-ref "ISINST eval op1"
;* V05 tmp3 [V05 ] ( 0, 0 ) int -> zero-ref "spilling qmarkNull"
;* V06 tmp4 [V06 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] "stack allocated boxed value class temp" <System.Collections.Generic.KeyValuePair`2[long,System.ValueTuple`2[int,System.__Canon]]>
;* V07 tmp5 [V07 ] ( 0, 0 ) ref -> zero-ref "field V01.Item2 (fldOffset=0x0)" P-INDEP
;* V08 tmp6 [V08 ] ( 0, 0 ) int -> zero-ref "field V01.Item1 (fldOffset=0x8)" P-INDEP
; V09 tmp7 [V09,T02] ( 2, 2 ) long -> rcx "V06.[000..008)"
;* V10 tmp8 [V10 ] ( 0, 0 ) ref -> zero-ref "V06.[008..016)"
;* V11 tmp9 [V11 ] ( 0, 0 ) int -> zero-ref "V06.[016..020)"
;* V12 tmp10 [V12 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref" <System.ValueTuple`2[int,System.__Canon]>
; V13 cse0 [V13,T01] ( 3, 3 ) long -> rax "CSE #01: aggressive"
;
; Lcl frame size = 8
G_M43406_IG01: ;; offset=0x0000
push rax
mov qword ptr [rsp], rcx
;; size=5 bbWeight=1 PerfScore 2.00
G_M43406_IG02: ;; offset=0x0005
mov rax, qword ptr [rcx+0x40]
mov rcx, qword ptr [rax]
cmp rcx, qword ptr [rax+0x08]
setne al
movzx rax, al
;; size=17 bbWeight=1 PerfScore 8.25
G_M43406_IG03: ;; offset=0x0016
add rsp, 8
ret
;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code 27, prolog size 5, PerfScore 11.50, instruction count 9, allocated bytes for code 27 (MethodHash=e76e5671) for method C:Test[System.ValueTuple`2[int,System.__Canon],System.ValueTuple`2[int,System.__Canon]](System.ValueTuple`2[int,System.__Canon]):ubyte (FullOpts)
; ============================================================ |
src/coreclr/jit/objectalloc.cpp
Outdated
if (lhsType != newType) | ||
{ | ||
lhs->ChangeType(newType); | ||
} | ||
|
||
if (rhsType != newType) | ||
{ | ||
rhs->ChangeType(newType); | ||
} | ||
|
||
parent->ChangeType(newType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still seems odd to me to forcefully retype an arbitrary sibling node of an ancestor, but maybe all TYP_REF
JIT IR nodes really do have equivalently typed TYP_BYREF
nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, this will change TYP_REF LCL_VAR(local whose LclVarDsc is TYP_REF)
to TYP_BYREF LCL_VAR(local whose LclVarDsc is TYP_REF)
... That will hit this assert:
runtime/src/coreclr/jit/lclvars.cpp
Lines 4691 to 4694 in fcc2f58
// Check that the LCL_VAR node has the same type as the underlying variable, save a few mismatches we allow. | |
assert(tree->TypeIs(varDsc->TypeGet(), genActualType(varDsc)) || | |
(tree->TypeIs(TYP_BYREF) && (varDsc->TypeGet() == TYP_I_IMPL)) || // Created by inliner substitution. | |
(tree->TypeIs(TYP_INT) && (varDsc->TypeGet() == TYP_LONG))); // Created by "optNarrowTree". |
Not sure if it will actually cause issues downstream, but it feels like we should insert some form of cast to maintain IR invariants. Potentially in quite a few places in object stack allocation. Anyway, something to think about, we don't need to block on this given that QMARK
s have very restricted shapes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to reconsider ... I added this because of assertion failures but don't recall the specific case and now can't reproduce it. Not clear to me how the JIT could create a TYP_REF GT_COLON, at least not one that this code would see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the motivating example -- from the object stack allocation tests
[000035] DA-X------- * STORE_LCL_VAR ref V03 tmp3
[000034] ---X------- \--* QMARK ref
[000025] ----------- if +--* EQ int
[000024] ----------- | +--* LCL_VAR ref V02 tmp2
[000023] ----------- | \--* CNS_INT ref null
[000033] ---X------- if \--* COLON ref
[000031] ---X------- else +--* QMARK ref
[000027] ---X------- if | +--* NE int
[000026] #--X------- | | +--* IND long
[000022] ----------- | | | \--* LCL_VAR ref V02 tmp2
[000021] H---------- | | \--* CNS_INT(h) long 0x7ffe34d5e970 class ObjectStackAllocation.SimpleClassB
[000030] ----------- if | \--* COLON ref
[000028] ----------- else | +--* LCL_VAR ref V02 tmp2
[000029] ----------- then | \--* CNS_INT ref null
[000032] ----------- then \--* CNS_INT ref null
We stack allocate the ref class, and so retype tmp2
to byref. This propagates up and if we don't retype sibling [000029]
we hit asserts.
I think it may be the case that the mistyped sibling must be null... I can check for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a comment on why the retyping feels wrong to me.
@MichalStrehovsky any feedback? Think you got added as a reviewer automatically. |
Turns out we only need to retype nulls, hopefully that's a bit more palatable. |
CI seems stuck... |
@dotnet/dnceng is CI stuck? No evident progress here in 2 hours. |
GitHub had an outage. |
btw I'm seeing this PR closes #11192 as well, but how about const sized arrays and strings? |
I made myself "owner" of files under tools\common\JitInterface because I don't want to miss PRs that change it. This looks good to me, there's not much to review in the areas I know about. Is crossgen2/nativeAOT implementation tracked somewhere? Do you have plans for it? |
I added a note to #11192. Jan says it should not be hard, so perhaps I can try and get it implemented soon, though it does not do much right now. For NAOT: once we have proper interprocedural support, it's possible that this will find a lot of unescaped objects. |
#103361 resurrected this long-deleted ifdef (crossgen1-specific code).
#103361 resurrected this long-deleted ifdef (crossgen1-specific code).
Enable object stack allocation for ref classes and extend the support to include boxed value classes. Use a specialized unbox helper for stack allocated boxes, both to avoid apparent escape of the box by the helper, and to ensure all box field accesses are visible to the JIT. Update the local address visitor to rewrite trees involving address of stack allocated boxes in some cases to avoid address exposure. Disable old promotion for stack allocated boxes (since we have no field handles) and allow physical promotion to enregister the box method table and/or payload as appropriate. In OSR methods handle the fact that the stack allocation may actually have been a heap allocation by the Tier0 method.
The analysis TP cost is around 0.4-0.7% (notes below). Boxes are much less likely to escape than ref classes (roughly ~90% of boxes escape, ~99.8% of ref classes escape). Codegen impact is diminished somewhat because many of the boxes are dead and were already getting optimized away.
Fixes #4584, #9118, #10195, #11192, #58535, #58554, #85570
Still todo: