-
Notifications
You must be signed in to change notification settings - Fork 2.7k
JIT: fix bug returning small structs on linux x64 #18563
Conversation
@CarolEidt PTAL No diffs for windows. For linux this causes diff in both crossgen and pmi:
PMI diffs on the new test case: ; Assembly listing for method GitHub_18522:Main():int
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;# V00 OutArgs [V00 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
; V01 tmp1 [V01,T00] ( 3, 6 ) byref -> rax
-; V02 tmp2 [V02 ] ( 3, 3 ) struct ( 8) [rsp+0x10] do-not-enreg[SF] ld-addr-op
-; V03 tmp3 [V03 ] ( 3, 3 ) ushort -> [rsp+0x10] do-not-enreg[] V02.F0(offs=0x00) P-DEP
-; V04 cse0 [V04,T01] ( 4, 4 ) long -> [rsp+0x08]
+;* V02 tmp2 [V02 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op
+;* V03 tmp3 [V03 ] ( 0, 0 ) ushort -> zero-ref V02.F0(offs=0x00) P-INDEP
+; V04 cse0 [V04,T01] ( 4, 4 ) long -> [rsp+0x00]
;
-; Lcl frame size = 24
+; Lcl frame size = 8
G_M49330_IG01:
- sub rsp, 24
+ push rax
G_M49330_IG02:
mov rdi, 0xD1FFAB1E
mov esi, 3
call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
mov rax, 0xD1FFAB1E
mov rax, gword ptr [rax]
mov word ptr [rax+10], 170
mov rax, 0xD1FFAB1E
mov rax, gword ptr [rax]
add rax, 8
- mov word ptr [rsp+10H], 0
- mov word ptr [rsp+10H], 0
- mov edi, dword ptr [rsp+10H]
- mov dword ptr [rax], edi // bug: wide store
+ mov word ptr [rax], 0 // fix: narrow store (+ const)
mov rax, 0xD1FFAB1E
mov rax, gword ptr [rax]
cmp word ptr [rax+10], 170
je SHORT G_M49330_IG04
xor eax, eax
G_M49330_IG03:
- add rsp, 24
+ add rsp, 8
ret
G_M49330_IG04:
mov eax, 100
G_M49330_IG05:
- add rsp, 24
+ add rsp, 8
ret
-; Total bytes of code 118, prolog size 4 for method GitHub_18522:Main():int
+; Total bytes of code 100, prolog size 1 for method GitHub_18522:Main():int
; Assembly listing for method GitHub_18522:M113():struct
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; partially interruptible
; Final local variable assignments
;
-; V00 loc0 [V00 ] ( 2, 2 ) struct ( 8) [rsp+0x00] do-not-enreg[SF] must-init ld-addr-op
+;* V00 loc0 [V00 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
-; V02 tmp1 [V02 ] ( 2, 2 ) ushort -> [rsp+0x00] do-not-enreg[] V00.F0(offs=0x00) P-DEP
+;* V02 tmp1 [V02 ] ( 0, 0 ) ushort -> zero-ref V00.F0(offs=0x00) P-INDEP
;
-; Lcl frame size = 8
+; Lcl frame size = 0
G_M32228_IG01:
- push rax
- xor rax, rax
- mov qword ptr [rsp], rax
G_M32228_IG02:
- mov word ptr [rsp], 0
- mov eax, dword ptr [rsp]
+ xor eax, eax
G_M32228_IG03:
- add rsp, 8
ret
-; Total bytes of code 21, prolog size 7 for method GitHub_18522:M113():struct
+; Total bytes of code 3, prolog size 0 for method GitHub_18522:M113():struct |
I have not yet figured out if this is a regression. I believe the bug requires that the small struct be assigned directly to a field which is perhaps not as common as one might imagine. |
This fixes most examples (jakobbotsch/Fuzzlyn@a603f0e), but some still remain unfixed:
I can recheck when you make updates. 😄 EDIT: Some of these examples needed rereducing. See below. |
@jakobbotsch thanks much! I'll look at the new examples. It's really nice to have Fuzzlyn double-checking things. Keep up the good work. Arm and Arm64 failures look like the masked load tests. I though these were disabled? |
@jashook indicates that they still need to be removed from the .lst files, and he'll do so shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, could you add the test to the arm/arm64 lstFiles so the regression is run on windows arm(64) as well?
#18569 addresses arm(64) failures |
(edited: oops, was using old jits -- with the fix above things look to be correct...) First new example 11923495345812789064 looks ok from a diff standpoint: ;;; windows
48B96829AC14F8010000 mov rcx, 0x1F814AC2968
488B09 mov rcx, gword ptr [rcx]
4883C108 add rcx, 8
C60100 mov byte ptr [rcx], 0
;;; linux
48BF6829055807020000 mov rdi, 0x20758052968
488B3F mov rdi, gword ptr [rdi]
4883C708 add rdi, 8
C60700 mov byte ptr [rdi], 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good
I see that the bug was indeed in the #ifdef UNIX section
Ooops. It appears the bug is still there before reduction, but the reduced example I have posted does not have the bug (i.e. the reducer needs to be rerun again to display the new variant required). |
In 11049252875418439527 there is a 6 byte struct ;;; windows
8B10 mov edx, dword ptr [rax]
8911 mov dword ptr [rcx], edx
668B5004 mov dx, word ptr [rax+4]
66895104 mov word ptr [rcx+4], dx
;;; linux
488B4016 mov rax, qword ptr [rax+22]
4889470E mov qword ptr [rdi+14], rax // writes an extra 2 bytes Implication is that if we are indeed going to return non-power of two sized structs in registers then we need a different strategy to unpack these on the caller side -- we can't just map them to the normal power of two int type sizes (unless it's "safe" to expand the destination size too, which it sometimes is). So my fix above likely only works if the struct being copied is 1, 2, or 4 bytes. We also need to properly handle sizes 3, 5, 6, and 7. |
Here are the rereduced examples:
It may also be informative to look at the commit diff: Sorry about the confusion earlier. |
No problem. I haven't checked them all, but I suspect all these new cases have One thought as to how to fix the size 3, 5, 6, 7 cases is to flag when the "primitive type" produced by Not sure if this is viable though. Also may be clunky if/when we inline as we may see what look like type mismatches that could block inlines, or reinterpret/copy chains (struct -> int -> int -> struct) that don't nicely optimize away. |
And just for the record:
|
I am probably thinking about this too simplistically, but I think the right thing to do is to not change the type in the non-power-of-2 case, and simply allow struct assignments to be created as needed. Then morph should decide (hopefully correctly) when a copy is required. |
Yeah, I had that idea too -- basically just use the windows rules. By doing this we would (I believe) diverge from the SysV ABI for some kinds of small structs. Not sure if that has a ripple effect anywhere. |
I was actually assuming that the RHS of the struct assignment could simply be the call (i.e. the register result), and we would handle any necessary temp creation in |
Ah, I see. I might need to disentangle things a bit to make sure we still think of it as something returned by value ... but it might work. |
What I'm leaning towards is roughly:
|
Still wondering about the returned via two register cases, though maybe the really odd ones require ARM64? @jakobbotsch if you get a chance to tweak Fuzzlyn, you might try playing around with cases where the assigned structs are 9-15 bytes in size. |
@AndyAyersMS - that sounds like the right direction.
It would be good to (eventually) consider whether we could get the same codegen without retyping there. IIRC |
I'm willing to give it a try now, since I end up having to special case power of 2 sizes, and I could just toss all that. |
Updated to try and handle 3, 5, 6, and 7 byte struct cases. May be a bit ragged, but is holding up on the handful of tests I have locally. Will post diff stats later and look at more examples. |
No new PMI or Crossgen FX diffs for Linux. Will run PMI diffs on tests overnight. |
No new test diffs other than on the newly added test case. Diffs there: - mov rdi, qword ptr [rdi+22]
- mov qword ptr [rax+14], rdi // store 8 bytes at rax + 14 -- bad
+ lea rdi, bword ptr [rdi+22]
+ add rax, 14
+ mov esi, dword ptr [rdi]
+ mov dword ptr [rax], esi // store 4 bytes at rax + 14 -- ok
+ mov si, word ptr [rdi+4]
+ mov word ptr [rax+4], si // store 2 bytes at rax + 18 -- ok
...
-; Total bytes of code 124, prolog size 10 for method GitHub_18522_1:Main():int
+; Total bytes of code 136, prolog size 10 for method GitHub_18522_1:Main():int |
I can confirm that fixes all the previous examples, and also a few others I thought were unrelated. |
Fuzzlyn already (probabilistically) generates structs of this size. It does not appear to have found anything in this area. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks!
@jashook also requested you add the tests to tests\arm\Tests.lst and tests\arm64\Tests.lst
src/jit/importer.cpp
Outdated
@@ -8465,7 +8465,14 @@ GenTree* Compiler::impFixupCallStructReturn(GenTreeCall* call, CORINFO_CLASS_HAN | |||
if (retRegCount == 1) | |||
{ | |||
// struct returned in a single register | |||
call->gtReturnType = retTypeDesc->GetReturnRegType(0); | |||
// retype iff struct size exactly matches integer type size. | |||
if (retTypeDesc->IsEnclosingType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that you didn't just reverse the condition and avoid the empty stmt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some JITDUMP stuff there at one point. Will update.
@@ -141,7 +141,8 @@ void Compiler::lvaInitTypeRef() | |||
Compiler::structPassingKind howToReturnStruct; | |||
var_types returnType = getReturnTypeForStruct(retClsHnd, &howToReturnStruct); | |||
|
|||
if (howToReturnStruct == SPK_PrimitiveType) | |||
// We can safely widen the return type for enclosed structs. | |||
if ((howToReturnStruct == SPK_PrimitiveType) || (howToReturnStruct == SPK_EnclosingType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting that the compRetNativeType
is widened. I assume, then, that morph is taking care of the necessary transformation to ensure that the SPK_EnclosingType
case is handled, so that codegen gets what it needs for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure. I should look at this part more closely.
There is some code in morph that can be simplified now as SPK_Primitive type implies power of two sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That code was for arg passing, not for return values. So nothing to do there.
Did a bit of testing over on OSX with older dotnet SDKs, and as far as I can tell this bug goes back to at least Core 1.1. |
@dotnet/jit-contrib two big changes since you all last reviewed:
Test cases to hopefully cover the spectrum of possibilities fairly well. Still thinking of ditching the "enclosing type" concept since everwhere we check it we have the class handle and so can refetch the size and compare it to the primitive size. Now see a few more diffs in normal PMI runs over FX.
The overall diff impact is still small and the regressions are mostly necessary diffs. Some of the new diffs happen in methods that return a byte-sized struct: ; ============================================================
; Assembly listing for method System.Reflection.Metadata.MethodSignature`1[__Canon][System.__Canon]:get_ReturnType():ref:this
; Emitting BLENDED_CODE for X64 CPU with AVX
G_M15744_IG01:
G_M15744_IG02:
- 8B4710 mov eax, dword ptr [rdi+16]
+ 480FBE4710 movsx rax, byte ptr [rdi+16]
G_M15744_IG03:
C3 ret
-; Total bytes of code 4, prolog size 0 for method System.Reflection.Metadata.MethodSignature`1[Int32][System.Int32]:get_Header():struct:this
+; Total bytes of code 6, prolog size 0 for method System.Reflection.Metadata.MethodSignature`1[Int32][System.Int32]:get_Header():struct:this Previously bytes 1-3 might contain unrelated stuff. |
Hmm, looks like a rebase is going to be needed. When I do that I'll also reorganize so all the code commits are first (and squashed) and the test commits second (likewise squashed). Let me see how this holds up in testing before doing any of that. |
The new _7 test failed for arm32/arm64. Investigating. bl GitHub_18522_7:M16():struct
str w0, [x19] |
Looks like ARM64 / Windows needs similar "return via temp" logic for the 3, 5, 6, 7 size cases... so more cases to handle. |
src/jit/compiler.cpp
Outdated
// we need to preserve small types. | ||
useType = GetEightByteType(structDesc, 0); | ||
howToReturnStruct = SPK_PrimitiveType; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct, you might want to add that for the other cases useType
will remain TYP_UNKNOWN instead of changing to TYP_INT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment.
src/jit/compiler.cpp
Outdated
else | ||
{ | ||
// Currently: 3, 5, 6, or 7 byte structs | ||
assert(structSize <= genTypeSize(useType)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assert should be less than:
assert(structSize < genTypeSize(useType));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
src/jit/compiler.cpp
Outdated
// Set 'useType' to the type of the first eightbyte item | ||
useType = GetEightByteType(structDesc, 0); | ||
assert(structDesc.passedInRegisters == true); | ||
if (structDesc.eightByteClassifications[0] == SystemVClassificationTypeSSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also add:
assert(structSize <= sizeof(double)));
As the code at 1031 which finalized the value of useType depends on this property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/jit/compiler.h
Outdated
@@ -4223,6 +4227,8 @@ class Compiler | |||
{ | |||
SPK_Unknown, // Invalid value, never returned | |||
SPK_PrimitiveType, // The struct is passed/returned using a primitive type. | |||
SPK_EnclosingType, // Like SPK_Primitive type, but used for return types that | |||
// are not a power of two byte size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment should be more like:
// Like SPK_Primitive type, but used for return types that
// require a primative type temp that is larger that the struct size.
// Currently used for structs of sizes: 3,5,6 or 7 bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not sure if I'm keeping this -- let me work through arm32 / arm64 and see how I feel once that's fixed.
Rebased to get past test list conflicts... |
BTW, I still expect arm32/arm64 to fail on some of the new tests -- in fact I need to understand why the 6 byte struct cases are not failing. Perhaps there is magic somewhere that I could extend to the 3 byte case. |
Result is as expected: Arm64 fails all the new tests; Arm32 only fails the _7 test since that tests a 3 byte struct, not a 6 byte struct. No magic. |
Ok, hopefully this is functionally complete. |
FMA test failed on Ubuntu -- no PMI diffs in this test, so am going to wager this was our old friend @dotnet-jit retest Ubuntu x64 Checked Innerloop Build and Test |
Re FMA failing: also #18467 |
tests/arm/Tests.lst
Outdated
Categories=EXPECTED_PASS | ||
HostStyle=0 | ||
|
||
[GitHub_18522_6.cmd_11903] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_11906
? Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure -- but now I remember there is a tool to regenerate these, so I suppose we should just use that.
tests\scripts\lst_creator.py
It would be nice if the tool output included the command line used to create that output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tool is a little funky because it does a recursive discovery of the tests. Hopefully with the work @sbomer is doing this legacy stuff just goes away. I think the easier approach here is to just keep it as it is, the numbering is not extremely important, it is just output by the tool to make sure the tags are unique. GitHub_18522_6.cmd_11903 should be unique enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the tool will strip the number anyways, so this is not going to break its input either.
The jit was retyping all calls with small struct returns as power of two sized ints when generating code for linux x64 and arm32/arm64. When the results of those calls were assigned to fields the jit would use overly wide stores which could corrupt neighboring fields. The fix is to keep better track of the smallest power of two enclosing int type size for the struct. If this exactly matches the struct size then the the result of the call can be used in an arbitrary context. If the enclosing type is larger, the call's result must be copied to a temp and then the temp can be reinterpreted as the struct for subsequent uses.
Defer retyping inline candidates and tail call candidates. Then handle deferred updates for cases where calls don't get inlined.
Unlike Windows x64, these ABIs return 3 (and 5,6,7) byte structs in a register. So make the necessary transformations for those cases too. Note the actual behavior change is triggered by the code in `getPrimitiveTypeForStruct` so there are no ifdefs at the transform points here.
jit must handle: callee is not an inline candidate, is an inline candidate and gets inlined, or inline candidate that does not get inlined. In the case where the callee gets inlined, handle the transitive cases where the callee's returne value itself comes from a call. Add a 3 byte test case to get coverage on arm32. Add new tests to the arm32/arm64 test lists.
Rebased and reordered so all the code commits (unsquashed, will squash on merge) come first and all the test commits (now squashed) come second. |
@briansull @CarolEidt would appreciate one more round of review (mainly those last two non-test commits) as this is a new part of the change that extends the handling to windows ABIs for arm. |
src/jit/flowgraph.cpp
Outdated
#if FEATURE_MULTIREG_RET | ||
|
||
// Did we record a struct return class handle above? | ||
#if FEATURE_MULTIREG_RET || defined(UNIX_AMD64_ABI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You didn't need to add UNIX_AMD64_ABI to this ifdef
as line 556 of target.h defines this as 1 for UNIX_AMD64_ABI:
#define FEATURE_MULTIREG_RET 1 // Support for returning a single value in more than one register
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest version just has FEATURE_MULTIREG_RET.
The jit was retyping all calls with small struct returns as power of two sized ints when generating code for linux x64 and arm32/arm64. When the results of those calls were assigned to fields the jit would use overly wide stores which could corrupt neighboring fields. The fix is to keep better track of the smallest power of two enclosing int type size for the struct. If this exactly matches the struct size then the the result of the call can be used in an arbitrary context. If the enclosing type is larger, the call's result must be copied to a temp and then the temp can be reinterpreted as the struct for subsequent uses. Defer retyping inline candidates and tail call candidates. Then handle deferred updates for cases where calls don't get inlined. Add test cases for 6 byte structs showing the various situations the jit must handle: callee is not an inline candidate, is an inline candidate and gets inlined, or inline candidate that does not get inlined. Add a 3 byte test case to get coverage on arm32. Add new tests to the arm32/arm64 test lists. Commit migrated from dotnet/coreclr@2f2a9b1
The jit was retyping all calls with small struct returns as 32 bit ints
when generating code for linux x64. When the results of those calls were
assigned to fields the jit would use 32 bit stores which could corrupt
neighboring fields.
The fix is to keep the small int types for 8 and 16
bytebit structs so thatthe corresponding stores are the right size.
Fixes #18522.