-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unsafe cast confuses JIT Inline Optimizer #47082
Comments
I can repro this on sharplab.io (too many variables ( G_M53653_IG01: ;; offset=0000H
55 push rbp
4881ECC0000000 sub rsp, 192
C5F877 vzeroupper
488DAC24C0000000 lea rbp, [rsp+C0H]
C4413857C0 vxorps xmm8, xmm8
48B840FFFFFFFFFFFFFF mov rax, -192
C5797F0428 vmovdqa xmmword ptr [rax+rbp], xmm8
C5797F440510 vmovdqa xmmword ptr [rbp+rax+10H], xmm8
C5797F440520 vmovdqa xmmword ptr [rbp+rax+20H], xmm8
4883C030 add rax, 48
75E9 jne SHORT -5 instr
;; bbWeight=1 PerfScore 7.58
G_M53653_IG02: ;; offset=0039H
48C745C001000000 mov qword ptr [rbp-40H], 1
C5F857C0 vxorps xmm0, xmm0
C5FA7F4580 vmovdqu xmmword ptr [rbp-80H], xmm0
C5FA7F4590 vmovdqu xmmword ptr [rbp-70H], xmm0
C5FA7F45A0 vmovdqu xmmword ptr [rbp-60H], xmm0
C5FA7F45B0 vmovdqu xmmword ptr [rbp-50H], xmm0
48C7458002000000 mov qword ptr [rbp-80H], 2
C5F857C0 vxorps xmm0, xmm0
C5FA7F8540FFFFFF vmovdqu xmmword ptr [rbp-C0H], xmm0
C5FA7F8550FFFFFF vmovdqu xmmword ptr [rbp-B0H], xmm0
C5FA7F8560FFFFFF vmovdqu xmmword ptr [rbp-A0H], xmm0
C5FA7F8570FFFFFF vmovdqu xmmword ptr [rbp-90H], xmm0
48C78540FFFFFF01000000 mov qword ptr [rbp-C0H], 1
488D7DC0 lea rdi, [rbp-40H]
E8075DFFFF call TestCase:SomeUsage(byref)
488D7D80 lea rdi, [rbp-80H]
E8FE5CFFFF call TestCase:SomeUsage(byref)
488DBD40FFFFFF lea rdi, [rbp-C0H]
E8F25CFFFF call TestCase:SomeUsage(byref)
90 nop
;; bbWeight=1 PerfScore 16.42
G_M53653_IG03: ;; offset=00AFH
488D6500 lea rsp, [rbp]
5D pop rbp
C3 ret
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 181, prolog size 57, PerfScore 45.40, instruction count 34, allocated bytes for code 194 (MethodHash=cd5c2e6a) for method TestCase:Test()
; ============================================================ I guess some recent PR helped (jump threading? cse for fp constants?) |
Interesting, this reproduces for me on the latest master (c7fda3b). G_M54881_IG01:
sub rsp, 232
vzeroupper
xor rax, rax
mov qword ptr [rsp+28H], rax
vxorps xmm4, xmm4
vmovdqa xmmword ptr [rsp+30H], xmm4
vmovdqa xmmword ptr [rsp+40H], xmm4
mov rax, -144
vmovdqa xmmword ptr [rsp+rax+E0H], xmm4
vmovdqa xmmword ptr [rsp+rax+F0H], xmm4
vmovdqa xmmword ptr [rsp+rax+100H], xmm4
add rax, 48
jne SHORT -5 instr
mov qword ptr [rsp+E0H], rax
G_M54881_IG02:
mov qword ptr [rsp+A8H], 1
vxorps xmm0, xmm0
vmovdqu xmmword ptr [rsp+68H], xmm0
vmovdqu xmmword ptr [rsp+78H], xmm0
vmovdqu xmmword ptr [rsp+88H], xmm0
vmovdqu xmmword ptr [rsp+98H], xmm0
mov qword ptr [rsp+68H], 2
vxorps xmm0, xmm0
vmovdqu xmmword ptr [rsp+28H], xmm0
vmovdqu xmmword ptr [rsp+38H], xmm0
vmovdqu xmmword ptr [rsp+48H], xmm0
vmovdqu xmmword ptr [rsp+58H], xmm0
mov qword ptr [rsp+28H], 1
lea rcx, [rsp+A8H]
call SomeUsage(byref)
lea rcx, [rsp+68H]
call SomeUsage(byref)
lea rcx, [rsp+28H]
call SomeUsage(byref)
nop
G_M54881_IG03:
add rsp, 232
ret At this point I am not sure as I was initially testing against ef73cd9, which does include the jump threading commit. FWIW, here's my original analysis: So, the elimination of redundant initialization is performed in
The first example has only one basic block when it gets to this phase, while the second has forked control flow right after the first basic block: ***** BB01
STMT00030 (IL 0x010... ???)
N003 ( 7, 9) [000129] -A------R--- * ASG double
N002 ( 3, 4) [000128] D------N---- +--* LCL_VAR double V09 tmp6
N001 ( 3, 4) [000006] ------------ \--* CNS_DBL double 2.0000000000000000
***** BB01
STMT00023 (IL 0x010... ???)
N004 ( 9, 11) [000100] ------------ * JTRUE void
N003 ( 7, 9) [000099] N------N-U-- \--* NE int
N001 ( 3, 4) [000097] ------------ +--* LCL_VAR double V09 tmp6
N002 ( 3, 4) [000098] ------------ \--* CNS_DBL double 2.0000000000000000 This is after all the simplification of control flow and compaction of basic blocks that happens before that. The fact that the first call is "good" is almost accidental. For example, if I were to swap the order of conditions to: if (value == 2.0) { }
if (value == 1.0) { } this results in all the structs being initialized, if To the best of my understanding this happens because the fact that This is from the first method: Importing BB02 (PC=000) of 'ThrowawayTesting.Testing:LoadInline(byref,double)'
[ 0] 0 (0x000) ldarg.1
[ 1] 1 (0x001) ldc.r8 1.0000000000000000
[ 2] 10 (0x00a) bne.un.s
Folding operator with constant nodes into a constant:
[000024] N--------U-- * NE int
[000022] ------------ +--* CNS_DBL double 1.0000000000000000
[000023] ------------ \--* CNS_DBL double 1.0000000000000000
Bashed to int constant:
[000024] ------------ * CNS_INT int 0 And this is from the second: Importing BB02 (PC=000) of 'ThrowawayTesting.Testing:LoadInline(byref,double)'
[ 0] 0 (0x000) ldarg.1
lvaGrabTemp returning 4 (V04 tmp1) called for Inlining Arg.
[ 1] 1 (0x001) ldc.r8 1.0000000000000000
[ 2] 10 (0x00a) bne.un.s
[000025] ------------ * JTRUE void
[000024] N--------U-- \--* NE int
[000022] ------------ +--* LCL_VAR double V04 tmp1
[000023] ------------ \--* CNS_DBL double 1.0000000000000000 Smuggling the parameter through a dummy local fixes the issue 😄: LoadInline(out result, 0);
var local = value;
ulong ul = Unsafe.As<double, ulong>(ref local); // <-- LINE A
result.A1 = ul; // <-- LINE B mov qword ptr [rsp+A8H], 1
mov qword ptr [rsp+68H], 2
mov qword ptr [rsp+28H], 1 Edit: 3rd case seems very much like the 2d, only now the locals replacing the argument are marked as address-taken due to this tree being created for [000040] --C--------- * COMMA void
[000084] ------------ +--* ADDR byref
[000085] -------N---- | \--* LCL_VAR double V04/V08/V12 tmp1/tmp5/tmp9
[000039] ------------ \--* NOP void
Local V04/V08/V12 should not be enregistered because: it is address exposed This leads to all the trees surviving constant propagation. This does not happen in the second case because the return from ***** BB16
STMT00026 (IL 0x010... ???)
[000112] -ACXG------- * ASG long
[000111] D------N---- +--* LCL_VAR long V10 tmp7
[000110] *-CXG------- \--* IND long
[000161] ------------ \--* ADDR byref
[000162] -------N---- \--* LCL_VAR double V09 tmp6
***** BB16
STMT00027 (IL 0x010... ???)
[000117] -A---------- * ASG long
[000116] -------N---- +--* FIELD long A1
[000113] ------------ | \--* ADDR byref
[000114] -------N---- | \--* LCL_VAR struct<MyStruct, 64> V01 loc1
[000115] ------------ \--* LCL_VAR long V10 tmp7
|
Assigning to Egor for now until further triage. |
Ah sorry, I didn't realize I had to uncomment those two(one) lines to reproduce. To summarize it here is a minimal repro: unsafe class Tests
{
void InlineMe(int x)
{
if (x == 1)
Console.WriteLine();
void* unusedAddress = Unsafe.AsPointer(ref x);
// even if it's used the first condition in this method still has to be folded if `x` is a constant
}
void Test()
{
InlineMe(2);
}
}
G_M174_IG01:
sub rsp, 40
G_M174_IG02:
mov dword ptr [rsp+24H], 2
cmp dword ptr [rsp+24H], 1
jne SHORT G_M174_IG04
G_M174_IG03:
call System.Console:WriteLine()
G_M174_IG04:
nop
G_M174_IG05:
add rsp, 40
ret It happens (as @SingleAccretion pointed out) because So I guess we can close this issue as a dup of "Implement forward substitution" #6973 |
Extend ref counting done by local morph so that we can determine single-def single-use locals. Add a phase that runs just after local morph that will attempt to forward single-def single-use local defs to uses when they are in adjacent statements. Fix or work around issues uncovered elsewhere: * `gtFoldExprCompare` might fold "identical" volatile subtrees * `fgGetStubAddrArg` cannot handle complex trees * some simd/hw operations can lose struct handles * some calls cannot handle struct local args Addresses dotnet#6973 and related issues. Still sorting through exactly which ones are fixed, so list below may need revising. Fixes dotnet#48605. Fixes dotnet#51599. Fixes dotnet#55472. Improves some but not all cases in dotnet#12280 and dotnet#62064. Does not fix dotnet#33002, dotnet#47082, or dotnet#63116; these require handling multiple uses or bypassing statements.
Extend ref counting done by local morph so that we can determine single-def single-use locals. Add a phase that runs just after local morph that will attempt to forward single-def single-use local defs to uses when they are in adjacent statements. Fix or work around issues uncovered elsewhere: * `gtFoldExprCompare` might fold "identical" volatile subtrees * `fgGetStubAddrArg` cannot handle complex trees * some simd/hw operations can lose struct handles * some calls cannot handle struct local args * morph expects args not to interfere * fix arm; don't forward sub no return calls * update debuginfo test (we may want to revisit this) * handle subbing past normalize on store assignment * clean up nullcheck of new helper Addresses #6973 and related issues. Still sorting through exactly which ones are fixed, so list below may need revising. Fixes #48605. Fixes #51599. Fixes #55472. Improves some but not all cases in #12280 and #62064. Does not fix #33002, #47082, or #63116; these require handling multiple uses or bypassing statements.
Codegen is equivalent in all 3 cases today. |
Version Used: NET5
Not a hard bug - the suboptimal code is still logical correct.
Steps to Reproduce:
Compiler settings
Assume following example (stripped down from real source code) which should load a struct with a kind of literal value.
In the Test() method, three variables x, y, z are initialized with 1.0, 2.0 and 1.0 again.
Case 1: When you compile to a optimized release build following optimal code is generated for the three LoadInline() calls.
[SkipLocalsInit is NOT used, so at the begin of the Test() method local memory is already initialized]
Case 2: When LINE A and LINE B are both activated, following sub optimal (but working) code is produced.
The second and third call perform a redundant zero out of the struct. Interestingly the first call is still optimal. Observe the fact that the method arguments in call 1 and call 3 are the same, but produce different code.
Case 3: Even more strange - but not my real world use case - the optimizer gets totaly confused, when only LINE A is active and LINE B is commented out. The code is still working, but obviously the JIT constant branch removal has totaly failed.
Expected Behavior:
All three variants should produce the same optimal code.
Actual Behavior:
Case 1: The optimal code :-)
Case 2: For large enough structs the code and runtime behavior is very bad and seems unsymetric between the same method calls.
Case 3: This is not a real use case because a senseless Unsafe cast is made without any usage. But no warning is raised, when you accidentally compile such a source code.
Off Topic:
It would be great - especially for my use case - when a method argument could be restricted to a compile time constant.
category:cq
theme:inlining
skill-level:intermediate
cost:medium
impact:small
The text was updated successfully, but these errors were encountered: