Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsafe cast confuses JIT Inline Optimizer #47082

Closed
gerhard17 opened this issue Jan 16, 2021 · 5 comments
Closed

Unsafe cast confuses JIT Inline Optimizer #47082

gerhard17 opened this issue Jan 16, 2021 · 5 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@gerhard17
Copy link

gerhard17 commented Jan 16, 2021

Version Used: NET5

Not a hard bug - the suboptimal code is still logical correct.

Steps to Reproduce:

Compiler settings

  <OutputType>Exe</OutputType>
  <TargetFramework>net5</TargetFramework>
  <TieredCompilation>false</TieredCompilation>
  <Optimize>true</Optimize>

Assume following example (stripped down from real source code) which should load a struct with a kind of literal value.
In the Test() method, three variables x, y, z are initialized with 1.0, 2.0 and 1.0 again.

public struct MyStruct
{
	public ulong A0, A1, A2, A3;
	public ulong B0, B1, B2, B3;
}

public sealed class TestCase
{
	[MethodImpl(MethodImplOptions.AggressiveInlining)]
	public static void LoadInline(out MyStruct result, ulong value) {
		result = default;
		result.A0 = value;
	}

	[MethodImpl(MethodImplOptions.AggressiveInlining)]
	public static void LoadInline(out MyStruct result, double value) {
		// Branch is JIT inline removable, when value is a compile time constant
		if (value == 1.0) {
			LoadInline(out result, 1);
			return;
		}
		// Branch is JIT inline removable, when value is a compile time constant
		if (value == 2.0) {
			LoadInline(out result, 2);
			return;
		}
		// Use default Value
		LoadInline(out result, 0);
		// Strange behavior, when code is active
		// ulong ul = Unsafe.As<double, ulong>(ref value); // <-- LINE A
		// result.A1 = ul;                                 // <-- LINE B
	}

	[MethodImpl(MethodImplOptions.NoInlining)]
	public static void SomeUsage(in MyStruct result) {
		;
	}

	[MethodImpl(MethodImplOptions.NoInlining)]
	public static void Test() {
		LoadInline(out var x, 1.0);
		LoadInline(out var y, 2.0);
		LoadInline(out var z, 1.0);
		SomeUsage(x);
		SomeUsage(y);
		SomeUsage(z);
	}
}

Case 1: When you compile to a optimized release build following optimal code is generated for the three LoadInline() calls.
[SkipLocalsInit is NOT used, so at the begin of the Test() method local memory is already initialized]

			LoadInline(out var x, 1.0);
00007FFB079BAA0B  mov         dword ptr [rsp+0A8h],1  
			LoadInline(out var y, 2.0);
00007FFB079BAA16  mov         dword ptr [rsp+68h],2  
			LoadInline(out var z, 1.0);
00007FFB079BAA1E  mov         dword ptr [rsp+28h],1  

Case 2: When LINE A and LINE B are both activated, following sub optimal (but working) code is produced.
The second and third call perform a redundant zero out of the struct. Interestingly the first call is still optimal. Observe the fact that the method arguments in call 1 and call 3 are the same, but produce different code.

		LoadInline(out var x, 1.0);
00007FFB079CAA0E  mov         qword ptr [rsp+0A8h],1  
		LoadInline(out var y, 2.0);
00007FFB079CAA1A  vxorps      xmm0,xmm0,xmm0  
00007FFB079CAA1E  vmovdqu     xmmword ptr [rsp+68h],xmm0  
00007FFB079CAA24  vmovdqu     xmmword ptr [rsp+78h],xmm0  
00007FFB079CAA2A  vmovdqu     xmmword ptr [rsp+88h],xmm0  
00007FFB079CAA33  vmovdqu     xmmword ptr [rsp+98h],xmm0  
00007FFB079CAA3C  mov         qword ptr [rsp+68h],2  
		LoadInline(out var z, 1.0);
00007FFB079CAA45  vxorps      xmm0,xmm0,xmm0  
00007FFB079CAA49  vmovdqu     xmmword ptr [rsp+28h],xmm0  
00007FFB079CAA4F  vmovdqu     xmmword ptr [rsp+38h],xmm0  
00007FFB079CAA55  vmovdqu     xmmword ptr [rsp+48h],xmm0  
00007FFB079CAA5B  vmovdqu     xmmword ptr [rsp+58h],xmm0  
00007FFB079CAA61  mov         qword ptr [rsp+28h],1  

Case 3: Even more strange - but not my real world use case - the optimizer gets totaly confused, when only LINE A is active and LINE B is commented out. The code is still working, but obviously the JIT constant branch removal has totaly failed.

		LoadInline(out var x, 1.0);
00007FFB079EAA0E  vmovsd      xmm0,qword ptr [InlineStructInitializeBug.TestCase.Test()+0318h (07FFB079EACC8h)]  
00007FFB079EAA16  vmovsd      qword ptr [rsp+30h],xmm0  
00007FFB079EAA1C  vmovsd      xmm0,qword ptr [rsp+30h]  
00007FFB079EAA22  vucomisd    xmm0,mmword ptr [InlineStructInitializeBug.TestCase.Test()+0328h (07FFB079EACD8h)]  
00007FFB079EAA2A  jp          InlineStructInitializeBug.TestCase.Test()+0B7h (07FFB079EAA67h)  
00007FFB079EAA2C  jne         InlineStructInitializeBug.TestCase.Test()+0B7h (07FFB079EAA67h)  
00007FFB079EAA2E  vxorps      xmm0,xmm0,xmm0  
00007FFB079EAA32  vmovdqu     xmmword ptr [rsp+0B8h],xmm0  
00007FFB079EAA3B  vmovdqu     xmmword ptr [rsp+0C8h],xmm0  
00007FFB079EAA44  vmovdqu     xmmword ptr [rsp+0D8h],xmm0  
00007FFB079EAA4D  vmovdqu     xmmword ptr [rsp+0E8h],xmm0  
00007FFB079EAA56  mov         qword ptr [rsp+0B8h],1  
00007FFB079EAA62  jmp         InlineStructInitializeBug.TestCase.Test()+0131h (07FFB079EAAE1h)  
00007FFB079EAA67  vmovsd      xmm0,qword ptr [rsp+30h]  
00007FFB079EAA6D  vucomisd    xmm0,mmword ptr [InlineStructInitializeBug.TestCase.Test()+0338h (07FFB079EACE8h)]  
00007FFB079EAA75  jp          InlineStructInitializeBug.TestCase.Test()+0FFh (07FFB079EAAAFh)  
00007FFB079EAA77  jne         InlineStructInitializeBug.TestCase.Test()+0FFh (07FFB079EAAAFh)  
00007FFB079EAA79  vxorps      xmm0,xmm0,xmm0  
00007FFB079EAA7D  vmovdqu     xmmword ptr [rsp+0B8h],xmm0  
00007FFB079EAA86  vmovdqu     xmmword ptr [rsp+0C8h],xmm0  
00007FFB079EAA8F  vmovdqu     xmmword ptr [rsp+0D8h],xmm0  
00007FFB079EAA98  vmovdqu     xmmword ptr [rsp+0E8h],xmm0  
00007FFB079EAAA1  mov         qword ptr [rsp+0B8h],2  
00007FFB079EAAAD  jmp         InlineStructInitializeBug.TestCase.Test()+0131h (07FFB079EAAE1h)  
00007FFB079EAAAF  vxorps      xmm0,xmm0,xmm0  
00007FFB079EAAB3  vmovdqu     xmmword ptr [rsp+0B8h],xmm0  
00007FFB079EAABC  vmovdqu     xmmword ptr [rsp+0C8h],xmm0  
00007FFB079EAAC5  vmovdqu     xmmword ptr [rsp+0D8h],xmm0  
00007FFB079EAACE  vmovdqu     xmmword ptr [rsp+0E8h],xmm0  
00007FFB079EAAD7  xor         ecx,ecx  
00007FFB079EAAD9  mov         qword ptr [rsp+0B8h],rcx  
		LoadInline(out var y, 2.0);
00007FFB079EAAE1  vmovsd      xmm0,qword ptr [InlineStructInitializeBug.TestCase.Test()+0348h (07FFB079EACF8h)]  
00007FFB079EAAE9  vmovsd      qword ptr [rsp+28h],xmm0  
00007FFB079EAAEF  vmovsd      xmm0,qword ptr [rsp+28h]  
00007FFB079EAAF5  vucomisd    xmm0,mmword ptr [InlineStructInitializeBug.TestCase.Test()+0358h (07FFB079EAD08h)]  
00007FFB079EAAFD  jp          InlineStructInitializeBug.TestCase.Test()+0181h (07FFB079EAB31h)  
00007FFB079EAAFF  jne         InlineStructInitializeBug.TestCase.Test()+0181h (07FFB079EAB31h)  
00007FFB079EAB01  vxorps      xmm0,xmm0,xmm0  
00007FFB079EAB05  vmovdqu     xmmword ptr [rsp+78h],xmm0  
00007FFB079EAB0B  vmovdqu     xmmword ptr [rsp+88h],xmm0  
00007FFB079EAB14  vmovdqu     xmmword ptr [rsp+98h],xmm0  
00007FFB079EAB1D  vmovdqu     xmmword ptr [rsp+0A8h],xmm0  
00007FFB079EAB26  mov         qword ptr [rsp+78h],1  
00007FFB079EAB2F  jmp         InlineStructInitializeBug.TestCase.Test()+01EFh (07FFB079EAB9Fh)  
00007FFB079EAB31  vmovsd      xmm0,qword ptr [rsp+28h]  
00007FFB079EAB37  vucomisd    xmm0,mmword ptr [InlineStructInitializeBug.TestCase.Test()+0368h (07FFB079EAD18h)]  
00007FFB079EAB3F  jp          InlineStructInitializeBug.TestCase.Test()+01C3h (07FFB079EAB73h)  
00007FFB079EAB41  jne         InlineStructInitializeBug.TestCase.Test()+01C3h (07FFB079EAB73h)  
00007FFB079EAB43  vxorps      xmm0,xmm0,xmm0  
00007FFB079EAB47  vmovdqu     xmmword ptr [rsp+78h],xmm0  
00007FFB079EAB4D  vmovdqu     xmmword ptr [rsp+88h],xmm0  
00007FFB079EAB56  vmovdqu     xmmword ptr [rsp+98h],xmm0  
00007FFB079EAB5F  vmovdqu     xmmword ptr [rsp+0A8h],xmm0  
00007FFB079EAB68  mov         qword ptr [rsp+78h],2  
00007FFB079EAB71  jmp         InlineStructInitializeBug.TestCase.Test()+01EFh (07FFB079EAB9Fh)  
00007FFB079EAB73  vxorps      xmm0,xmm0,xmm0  
00007FFB079EAB77  vmovdqu     xmmword ptr [rsp+78h],xmm0  
00007FFB079EAB7D  vmovdqu     xmmword ptr [rsp+88h],xmm0  
00007FFB079EAB86  vmovdqu     xmmword ptr [rsp+98h],xmm0  
00007FFB079EAB8F  vmovdqu     xmmword ptr [rsp+0A8h],xmm0  
00007FFB079EAB98  xor         ecx,ecx  
00007FFB079EAB9A  mov         qword ptr [rsp+78h],rcx  
		LoadInline(out var z, 1.0);
00007FFB079EAB9F  vmovsd      xmm0,qword ptr [InlineStructInitializeBug.TestCase.Test()+0378h (07FFB079EAD28h)]  
00007FFB079EABA7  vmovsd      qword ptr [rsp+20h],xmm0  
00007FFB079EABAD  vmovsd      xmm0,qword ptr [rsp+20h]  
00007FFB079EABB3  vucomisd    xmm0,mmword ptr [InlineStructInitializeBug.TestCase.Test()+0388h (07FFB079EAD38h)]  
00007FFB079EABBB  jp          InlineStructInitializeBug.TestCase.Test()+0236h (07FFB079EABE6h)  
00007FFB079EABBD  jne         InlineStructInitializeBug.TestCase.Test()+0236h (07FFB079EABE6h)  
00007FFB079EABBF  vxorps      xmm0,xmm0,xmm0  
00007FFB079EABC3  vmovdqu     xmmword ptr [rsp+38h],xmm0  
00007FFB079EABC9  vmovdqu     xmmword ptr [rsp+48h],xmm0  
00007FFB079EABCF  vmovdqu     xmmword ptr [rsp+58h],xmm0  
00007FFB079EABD5  vmovdqu     xmmword ptr [rsp+68h],xmm0  
00007FFB079EABDB  mov         qword ptr [rsp+38h],1  
00007FFB079EABE4  jmp         InlineStructInitializeBug.TestCase.Test()+0292h (07FFB079EAC42h)  
00007FFB079EABE6  vmovsd      xmm0,qword ptr [rsp+20h]  
00007FFB079EABEC  vucomisd    xmm0,mmword ptr [InlineStructInitializeBug.TestCase.Test()+0398h (07FFB079EAD48h)]  
00007FFB079EABF4  jp          InlineStructInitializeBug.TestCase.Test()+026Fh (07FFB079EAC1Fh)  
00007FFB079EABF6  jne         InlineStructInitializeBug.TestCase.Test()+026Fh (07FFB079EAC1Fh)  
00007FFB079EABF8  vxorps      xmm0,xmm0,xmm0  
00007FFB079EABFC  vmovdqu     xmmword ptr [rsp+38h],xmm0  
00007FFB079EAC02  vmovdqu     xmmword ptr [rsp+48h],xmm0  
00007FFB079EAC08  vmovdqu     xmmword ptr [rsp+58h],xmm0  
00007FFB079EAC0E  vmovdqu     xmmword ptr [rsp+68h],xmm0  
00007FFB079EAC14  mov         qword ptr [rsp+38h],2  
00007FFB079EAC1D  jmp         InlineStructInitializeBug.TestCase.Test()+0292h (07FFB079EAC42h)  
00007FFB079EAC1F  vxorps      xmm0,xmm0,xmm0  
00007FFB079EAC23  vmovdqu     xmmword ptr [rsp+38h],xmm0  
00007FFB079EAC29  vmovdqu     xmmword ptr [rsp+48h],xmm0  
00007FFB079EAC2F  vmovdqu     xmmword ptr [rsp+58h],xmm0  
00007FFB079EAC35  vmovdqu     xmmword ptr [rsp+68h],xmm0  
00007FFB079EAC3B  xor         ecx,ecx  
00007FFB079EAC3D  mov         qword ptr [rsp+38h],rcx  

Expected Behavior:

All three variants should produce the same optimal code.

			LoadInline(out var x, 1.0);
00007FFB079BAA0B  mov         dword ptr [rsp+0A8h],1  
			LoadInline(out var y, 2.0);
00007FFB079BAA16  mov         dword ptr [rsp+68h],2  
			LoadInline(out var z, 1.0);
00007FFB079BAA1E  mov         dword ptr [rsp+28h],1  

Actual Behavior:

Case 1: The optimal code :-)
Case 2: For large enough structs the code and runtime behavior is very bad and seems unsymetric between the same method calls.
Case 3: This is not a real use case because a senseless Unsafe cast is made without any usage. But no warning is raised, when you accidentally compile such a source code.

Off Topic:
It would be great - especially for my use case - when a method argument could be restricted to a compile time constant.

category:cq
theme:inlining
skill-level:intermediate
cost:medium
impact:small

@gerhard17 gerhard17 added the tenet-performance Performance related issue label Jan 16, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Jan 16, 2021
@EgorBo
Copy link
Member

EgorBo commented Jan 16, 2021

I can repro this on sharplab.io (too many variables (lclMAX_TRACKED=512) jit can track lead to spills?) but I can't repro it on master (net6):

G_M53653_IG01:              ;; offset=0000H
       55                   push     rbp
       4881ECC0000000       sub      rsp, 192
       C5F877               vzeroupper
       488DAC24C0000000     lea      rbp, [rsp+C0H]
       C4413857C0           vxorps   xmm8, xmm8
       48B840FFFFFFFFFFFFFF mov      rax, -192
       C5797F0428           vmovdqa  xmmword ptr [rax+rbp], xmm8
       C5797F440510         vmovdqa  xmmword ptr [rbp+rax+10H], xmm8
       C5797F440520         vmovdqa  xmmword ptr [rbp+rax+20H], xmm8
       4883C030             add      rax, 48
       75E9                 jne      SHORT  -5 instr
						;; bbWeight=1    PerfScore 7.58
G_M53653_IG02:              ;; offset=0039H
       48C745C001000000     mov      qword ptr [rbp-40H], 1
       C5F857C0             vxorps   xmm0, xmm0
       C5FA7F4580           vmovdqu  xmmword ptr [rbp-80H], xmm0
       C5FA7F4590           vmovdqu  xmmword ptr [rbp-70H], xmm0
       C5FA7F45A0           vmovdqu  xmmword ptr [rbp-60H], xmm0
       C5FA7F45B0           vmovdqu  xmmword ptr [rbp-50H], xmm0
       48C7458002000000     mov      qword ptr [rbp-80H], 2
       C5F857C0             vxorps   xmm0, xmm0
       C5FA7F8540FFFFFF     vmovdqu  xmmword ptr [rbp-C0H], xmm0
       C5FA7F8550FFFFFF     vmovdqu  xmmword ptr [rbp-B0H], xmm0
       C5FA7F8560FFFFFF     vmovdqu  xmmword ptr [rbp-A0H], xmm0
       C5FA7F8570FFFFFF     vmovdqu  xmmword ptr [rbp-90H], xmm0
       48C78540FFFFFF01000000 mov      qword ptr [rbp-C0H], 1
       488D7DC0             lea      rdi, [rbp-40H]
       E8075DFFFF           call     TestCase:SomeUsage(byref)
       488D7D80             lea      rdi, [rbp-80H]
       E8FE5CFFFF           call     TestCase:SomeUsage(byref)
       488DBD40FFFFFF       lea      rdi, [rbp-C0H]
       E8F25CFFFF           call     TestCase:SomeUsage(byref)
       90                   nop
						;; bbWeight=1    PerfScore 16.42
G_M53653_IG03:              ;; offset=00AFH
       488D6500             lea      rsp, [rbp]
       5D                   pop      rbp
       C3                   ret
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 181, prolog size 57, PerfScore 45.40, instruction count 34, allocated bytes for code 194 (MethodHash=cd5c2e6a) for method TestCase:Test()
; ============================================================

I guess some recent PR helped (jump threading? cse for fp constants?)

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Jan 16, 2021

Interesting, this reproduces for me on the latest master (c7fda3b).

G_M54881_IG01:
       sub      rsp, 232
       vzeroupper 
       xor      rax, rax
       mov      qword ptr [rsp+28H], rax
       vxorps   xmm4, xmm4
       vmovdqa  xmmword ptr [rsp+30H], xmm4
       vmovdqa  xmmword ptr [rsp+40H], xmm4
       mov      rax, -144
       vmovdqa  xmmword ptr [rsp+rax+E0H], xmm4
       vmovdqa  xmmword ptr [rsp+rax+F0H], xmm4
       vmovdqa  xmmword ptr [rsp+rax+100H], xmm4
       add      rax, 48
       jne      SHORT  -5 instr
       mov      qword ptr [rsp+E0H], rax
G_M54881_IG02:
       mov      qword ptr [rsp+A8H], 1
       vxorps   xmm0, xmm0
       vmovdqu  xmmword ptr [rsp+68H], xmm0
       vmovdqu  xmmword ptr [rsp+78H], xmm0
       vmovdqu  xmmword ptr [rsp+88H], xmm0
       vmovdqu  xmmword ptr [rsp+98H], xmm0
       mov      qword ptr [rsp+68H], 2
       vxorps   xmm0, xmm0
       vmovdqu  xmmword ptr [rsp+28H], xmm0
       vmovdqu  xmmword ptr [rsp+38H], xmm0
       vmovdqu  xmmword ptr [rsp+48H], xmm0
       vmovdqu  xmmword ptr [rsp+58H], xmm0
       mov      qword ptr [rsp+28H], 1
       lea      rcx, [rsp+A8H]
       call     SomeUsage(byref)
       lea      rcx, [rsp+68H]
       call     SomeUsage(byref)
       lea      rcx, [rsp+28H]
       call     SomeUsage(byref)
       nop      
G_M54881_IG03:
       add      rsp, 232
       ret      

At this point I am not sure as I was initially testing against ef73cd9, which does include the jump threading commit.

FWIW, here's my original analysis:

So, the elimination of redundant initialization is performed in Compiler::optRemoveRedundantZeroInits(), which has the following behavior:

This phase iterates over basic blocks starting with the first basic block
until there is no unique basic block successor or until it detects a loop.

The first example has only one basic block when it gets to this phase, while the second has forked control flow right after the first basic block:
------------ BB01 [000..046) -> BB03 (cond), preds={} succs={BB02,BB03}
As you've correctly identified, this is due to the known not-taken branches not having been pruned by that point. This is the end of the first basic block:

***** BB01
STMT00030 (IL 0x010...  ???)
N003 (  7,  9) [000129] -A------R---              *  ASG       double
N002 (  3,  4) [000128] D------N----              +--*  LCL_VAR   double V09 tmp6         
N001 (  3,  4) [000006] ------------              \--*  CNS_DBL   double 2.0000000000000000

***** BB01
STMT00023 (IL 0x010...  ???)
N004 (  9, 11) [000100] ------------              *  JTRUE     void  
N003 (  7,  9) [000099] N------N-U--              \--*  NE        int   
N001 (  3,  4) [000097] ------------                 +--*  LCL_VAR   double V09 tmp6         
N002 (  3,  4) [000098] ------------                 \--*  CNS_DBL   double 2.0000000000000000

This is after all the simplification of control flow and compaction of basic blocks that happens before that. The fact that the first call is "good" is almost accidental.

For example, if I were to swap the order of conditions to:

if (value == 2.0) { }
if (value == 1.0) { }

this results in all the structs being initialized, if (x, y, z) is (1, 2, 1) or none if it is (2, 2, 2). This is the result of the conditional jump nodes surviving inlining and making their way into morph, where a lot is eventually pruned, but evidently not all.

To the best of my understanding this happens because the fact that value has its address taken via ldarga causes the inliner to request a temp and lose sight of it being a constant.

This is from the first method:

Importing BB02 (PC=000) of 'ThrowawayTesting.Testing:LoadInline(byref,double)'
    [ 0]   0 (0x000) ldarg.1
    [ 1]   1 (0x001) ldc.r8 1.0000000000000000
    [ 2]  10 (0x00a) bne.un.s
Folding operator with constant nodes into a constant:
               [000024] N--------U--              *  NE        int   
               [000022] ------------              +--*  CNS_DBL   double 1.0000000000000000
               [000023] ------------              \--*  CNS_DBL   double 1.0000000000000000
Bashed to int constant:
               [000024] ------------              *  CNS_INT   int    0

And this is from the second:

Importing BB02 (PC=000) of 'ThrowawayTesting.Testing:LoadInline(byref,double)'
    [ 0]   0 (0x000) ldarg.1
lvaGrabTemp returning 4 (V04 tmp1) called for Inlining Arg.
    [ 1]   1 (0x001) ldc.r8 1.0000000000000000
    [ 2]  10 (0x00a) bne.un.s

               [000025] ------------              *  JTRUE     void  
               [000024] N--------U--              \--*  NE        int   
               [000022] ------------                 +--*  LCL_VAR   double V04 tmp1         
               [000023] ------------                 \--*  CNS_DBL   double 1.0000000000000000

Smuggling the parameter through a dummy local fixes the issue 😄:

LoadInline(out result, 0);

var local = value;
ulong ul = Unsafe.As<double, ulong>(ref local); // <-- LINE A
result.A1 = ul;                                 // <-- LINE B
mov      qword ptr [rsp+A8H], 1
mov      qword ptr [rsp+68H], 2
mov      qword ptr [rsp+28H], 1

Edit: 3rd case seems very much like the 2d, only now the locals replacing the argument are marked as address-taken due to this tree being created for Unsafe.As:

               [000040] --C---------              *  COMMA     void  
               [000084] ------------              +--*  ADDR      byref 
               [000085] -------N----              |  \--*  LCL_VAR   double V04/V08/V12 tmp1/tmp5/tmp9         
               [000039] ------------              \--*  NOP       void  

Local V04/V08/V12 should not be enregistered because: it is address exposed

This leads to all the trees surviving constant propagation.

This does not happen in the second case because the return from Unsafe.As gets immediately injected into the assignment of A1 (albeit with a local in between), producing these two statements:

***** BB16
STMT00026 (IL 0x010...  ???)
               [000112] -ACXG-------              *  ASG       long  
               [000111] D------N----              +--*  LCL_VAR   long   V10 tmp7         
               [000110] *-CXG-------              \--*  IND       long  
               [000161] ------------                 \--*  ADDR      byref 
               [000162] -------N----                    \--*  LCL_VAR   double V09 tmp6         

***** BB16
STMT00027 (IL 0x010...  ???)
               [000117] -A----------              *  ASG       long  
               [000116] -------N----              +--*  FIELD     long   A1
               [000113] ------------              |  \--*  ADDR      byref 
               [000114] -------N----              |     \--*  LCL_VAR   struct<MyStruct, 64> V01 loc1         
               [000115] ------------              \--*  LCL_VAR   long   V10 tmp7         

LocalAddressVisitor recognizes the IND(ADDR(LCL_VAR)) pattern:

Otherwise it must be accessed through some kind of indirection. Usually this is
something like IND(ADDR(LCL_VAR)), global morph will change it to GT_LCL_VAR or
GT_LCL_FLD so the lclvar does not need to be address exposed.

@JulieLeeMSFT
Copy link
Member

Assigning to Egor for now until further triage.

@JulieLeeMSFT JulieLeeMSFT added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Jan 22, 2021
@EgorBo
Copy link
Member

EgorBo commented Feb 10, 2021

Ah sorry, I didn't realize I had to uncomment those two(one) lines to reproduce.
@SingleAccretion thanks for such a deep analysis! 👍

To summarize it here is a minimal repro:

unsafe class Tests
{
    void InlineMe(int x)
    {
        if (x == 1)
            Console.WriteLine();

        void* unusedAddress = Unsafe.AsPointer(ref x);
        // even if it's used the first condition in this method still has to be folded if `x` is a constant
    }

    void Test()
    {
        InlineMe(2);
    }
}

Test here is supposed to be optimized into just no-op after inlining InlineMe but it instead emits:

G_M174_IG01:
       sub      rsp, 40
G_M174_IG02:
       mov      dword ptr [rsp+24H], 2
       cmp      dword ptr [rsp+24H], 1
       jne      SHORT G_M174_IG04
G_M174_IG03:
       call     System.Console:WriteLine()
G_M174_IG04:
       nop      
G_M174_IG05:
       add      rsp, 40
       ret

It happens (as @SingleAccretion pointed out) because impInlineFetchArg/fgInlinePrependStatements don't do substitution of GT_CNS_INT(2) argument because InlineMe takes an address of it ==> argCanBeModified is true (see here).

So I guess we can close this issue as a dup of "Implement forward substitution" #6973
Or implement a quick loop in fgInlinePrependStatements over BB/statements to do the substitution until we hit a Ldarga and give up after? (or we need SSA for that?)
@AndyAyersMS what do you think?

@EgorBo EgorBo added this to the Future milestone Feb 12, 2021
@EgorBo EgorBo removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Feb 12, 2021
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jan 13, 2022
Extend ref counting done by local morph so that we can determine
single-def single-use locals.

Add a phase that runs just after local morph that will attempt to
forward single-def single-use local defs to uses when they are in
adjacent statements.

Fix or work around issues uncovered elsewhere:
* `gtFoldExprCompare` might fold "identical" volatile subtrees
* `fgGetStubAddrArg` cannot handle complex trees
* some simd/hw operations can lose struct handles
* some calls cannot handle struct local args

Addresses dotnet#6973 and related issues. Still sorting through exactly
which ones are fixed, so list below may need revising.

Fixes dotnet#48605.
Fixes dotnet#51599.
Fixes dotnet#55472.

Improves some but not all cases in dotnet#12280 and dotnet#62064.

Does not fix dotnet#33002, dotnet#47082, or dotnet#63116; these require handling multiple
uses or bypassing statements.
AndyAyersMS added a commit that referenced this issue Feb 4, 2022
Extend ref counting done by local morph so that we can determine
single-def single-use locals.

Add a phase that runs just after local morph that will attempt to
forward single-def single-use local defs to uses when they are in
adjacent statements.

Fix or work around issues uncovered elsewhere:
* `gtFoldExprCompare` might fold "identical" volatile subtrees
* `fgGetStubAddrArg` cannot handle complex trees
* some simd/hw operations can lose struct handles
* some calls cannot handle struct local args
* morph expects args not to interfere
* fix arm; don't forward sub no return calls
* update debuginfo test (we may want to revisit this)
* handle subbing past normalize on store assignment
* clean up nullcheck of new helper

Addresses #6973 and related issues. Still sorting through exactly
which ones are fixed, so list below may need revising.

Fixes #48605.
Fixes #51599.
Fixes #55472.

Improves some but not all cases in #12280 and #62064.

Does not fix #33002, #47082, or #63116; these require handling multiple
uses or bypassing statements.
@jakobbotsch
Copy link
Member

Codegen is equivalent in all 3 cases today.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants