Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures in checked/release asm diffs #76347

Closed
BruceForstall opened this issue Sep 29, 2022 · 21 comments · Fixed by #76460 or #76616
Closed

Failures in checked/release asm diffs #76347

BruceForstall opened this issue Sep 29, 2022 · 21 comments · Fixed by #76460 or #76616
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci-optional Blocking optional rolling runs
Milestone

Comments

@BruceForstall
Copy link
Member

Pipeline https://dev.azure.com/dnceng-public/public/_build/results?buildId=29308&view=results has been reporting failures, that need to be investigated.

@BruceForstall BruceForstall added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 29, 2022
@BruceForstall BruceForstall added this to the 8.0.0 milestone Sep 29, 2022
@ghost
Copy link

ghost commented Sep 29, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Pipeline https://dev.azure.com/dnceng-public/public/_build/results?buildId=29308&view=results has been reporting failures, that need to be investigated.

Author: BruceForstall
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: 8.0.0

@BruceForstall BruceForstall self-assigned this Sep 29, 2022
@BruceForstall
Copy link
Member Author

One example:

C:\bugs\spmicollect4>C:\gh\runtime\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\superpmi.exe -a -c 155161 C:\gh\runtime\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\clrjit_win_x64_x64.dll C:\gh\runtime\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\clrjit_win_x64_x64.dll c:\spmi\mch\eb8352bd-0a13-4b5b-badb-58f9ecc40c44.windows.x64\coreclr_tests.run.windows.x64.checked.mch
Using jit(C:\gh\runtime\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\clrjit_win_x64_x64.dll) with input (c:\spmi\mch\eb8352bd-0a13-4b5b-badb-58f9ecc40c44.windows.x64\coreclr_tests.run.windows.x64.checked.mch)
 indexCount=1 (155161)
Jit startup took 2.277200ms
Jit startup took 8.512700ms
Code Size mismatch: Left=45, Right=39

-----------------------------------------------
Block:   Left
Size:    45
Address: 264133383d0
CodePtr: 26413338754
-----------------------------------------------
264133383d0: 55                         push    rbp
264133383d1: 57                         push    rdi
264133383d2: 48 83 ec 28                sub     rsp, 40
264133383d6: 48 8d 6c 24 30             lea     rbp, [rsp + 48]
264133383db: 83 3d 60 e7 74 34 00       cmp     dword ptr [rip + 880076640], 0
264133383e2: 74 05                      je      5
264133383e4: e8 10 47 ad 93             call    -1817360624
264133383e9: ff 15 c0 96 75 34          call    qword ptr [rip + 880121536]
264133383ef: 89 45 f4                   mov     dword ptr [rbp - 12], eax
264133383f2: 8b 45 f4                   mov     eax, dword ptr [rbp - 12]
264133383f5: 98                         cwde
264133383f6: 48 83 c4 28                add     rsp, 40
264133383fa: 5f                         pop     rdi
264133383fb: 5d                         pop     rbp
264133383fc: c3                         ret
-----------------------------------------------
-----------------------------------------------
Block:   Right
Size:    39
Address: 26413338f30
CodePtr: 26413338d74
-----------------------------------------------
26413338f30: 55                         push    rbp
26413338f31: 57                         push    rdi
26413338f32: 48 83 ec 28                sub     rsp, 40
26413338f36: 48 8d 6c 24 30             lea     rbp, [rsp + 48]
26413338f3b: 83 3d 60 e7 74 34 00       cmp     dword ptr [rip + 880076640], 0
26413338f42: 74 05                      je      5
26413338f44: e8 10 47 ad 93             call    -1817360624
26413338f49: ff 15 c0 96 75 34          call    qword ptr [rip + 880121536]
26413338f4f: 90                         nop
26413338f50: 48 83 c4 28                add     rsp, 40
26413338f54: 5f                         pop     rdi
26413338f55: 5d                         pop     rbp
26413338f56: c3                         ret
-----------------------------------------------
ISSUE: <ASM_DIFF> main method 155161 of size 6 differs

@BruceForstall BruceForstall added the blocking-clean-ci-optional Blocking optional rolling runs label Sep 29, 2022
@BruceForstall
Copy link
Member Author

In the above case, the issue is that the MC sets TailcallStress=1 because the test forces that during a run. The Checked compiler uses that and behaves differently (in this case, somewhat oddly differently?). Forcing TailcallStress=0 for the Checked compiler leads to no diffs.

This means we have a general problem with Checked/Release diffs of "run" collections that contain DEBUG-only configuration variables set that could affect behavior. We need to be able to force the DEBUG compiler to not use any such variables.

One very aggressive option would be to skip any MC that has any entry in the GetIntConfigValue or GetStringConfigValue tables.

We could explicitly override / clear variables using the -jitoption force ... and -jit2option force ... arguments, but that would require enumerating everything that might affect DEBUG only codegen, which might be fragile (and there might be command-line length issues).

@BruceForstall
Copy link
Member Author

In the coreclr_tests collection, 1546 tests have GetIntConfigValue, 2383 have GetStringConfigValue (some will have both).

The current set of config values used in the tests is:

EnableAVX2
EnableHWIntrinsic
EnableSSE2
JitAggressiveInlining
JitConstCSE
JitDiffableDasm
JitDisasm
JitDoAssertionProp
JitDoRedundantBranchOpts
JitDoSsa
JitDoValueNumber
JitEnableFinallyCloning
JitFuncInfoLogFile
JITInlineDepth
JitNoCSE
JitNoForceFallback
JitNoStructPromotion
JitObjectStackAllocation
JitOptRepeat
JitProfileCasts
JitRandomGuardedDevirtualization
JitRandomOnStackReplacement
JitStdOutFile
JitStress
JitStressModeNames
JitStressModeNamesNot
JitStressModeNamesOnly
JitStressRegs
TailcallStress
TC_OnStackReplacement_InitialCounter

@jakobbotsch
Copy link
Member

Do we know why this only started failing recently? The Sep 4th run has only x86 failures (that were fixed by #75338). The Sep 10th run has both x86 and x64 failures, so seems like something changed between Sep 4th and Sep 10th.

@BruceForstall
Copy link
Member Author

I merged #74961 on Sept. 7, which is when the "run" collection appeared. (And removed the PMI collection the same day: #75211)

@BruceForstall
Copy link
Member Author

After ignoring the Config values in the MCs, there are still 33 diffs. One example is JIT.HardwareIntrinsics.General.VectorAs__AsVectorUInt32:RunBasicScenario():this from (I believe) src\tests\JIT\HardwareIntrinsics\General\Vector128_1\AsVector.UInt32.cs, method context 63865 from coreclr_tests.run.windows.x64.checked.mch:

 55                  	push	rbp
 48 81 ec e0 00 00 00	sub	rsp, 224
 c5 f8 77            	vzeroupper
 48 8d ac 24 e0 00 00 00	lea	rbp, [rsp + 224]
 33 c0               	xor	eax, eax
 48 89 85 48 ff ff ff	mov	qword ptr [rbp - 184], rax
 c5 d8 57 e4         	vxorps	xmm4, xmm4, xmm4
 c5 f9 7f a5 50 ff ff ff	vmovdqa	xmmword ptr [rbp - 176], xmm4
 c5 f9 7f a5 60 ff ff ff	vmovdqa	xmmword ptr [rbp - 160], xmm4
 48 b8 70 ff ff ff ff ff ff ff	movabs	rax, -144
 c5 f9 7f 24 28      	vmovdqa	xmmword ptr [rax + rbp], xmm4
 c5 f9 7f 64 05 10   	vmovdqa	xmmword ptr [rbp + rax + 16], xmm4
 c5 f9 7f 64 05 20   	vmovdqa	xmmword ptr [rbp + rax + 32], xmm4
 48 83 c0 30         	add	rax, 48
 75 e9               	jne	-23
 48 89 4d 10         	mov	qword ptr [rbp + 16], rcx
 48 b9 78 86 b9 83 87 02 00 00	movabs	rcx, 2781053814392
 ff 15 e0 51 cf 07   	call	qword ptr [rip + 131027424]
 ff 15 08 f1 e7 07   	call	qword ptr [rip + 132641032]
 c5 f9 6e c0         	vmovd	xmm0, eax
 c4 e2 79 58 c0      	vpbroadcastd	xmm0, xmm0
 c5 f9 29 45 f0      	vmovapd	xmmword ptr [rbp - 16], xmm0
 c5 f9 28 45 f0      	vmovapd	xmm0, xmmword ptr [rbp - 16]
 c5 f8 28 c0         	vmovaps	xmm0, xmm0
 c5 fd 11 45 d0      	vmovupd	ymmword ptr [rbp - 48], ymm0
 48 8b 4d 10         	mov	rcx, qword ptr [rbp + 16]
 48 89 4d 98         	mov	qword ptr [rbp - 104], rcx
 c5 fd 10 45 d0      	vmovupd	ymm0, ymmword ptr [rbp - 48]
 c5 fd 11 45 b0      	vmovupd	ymmword ptr [rbp - 80], ymm0
 c5 f9 28 45 f0      	vmovapd	xmm0, xmmword ptr [rbp - 16]
 c5 f9 29 45 a0      	vmovapd	xmmword ptr [rbp - 96], xmm0
 48 8b 4d 98         	mov	rcx, qword ptr [rbp - 104]
 48 8d 55 b0         	lea	rdx, [rbp - 80]
 4c 8d 45 a0         	lea	r8, [rbp - 96]
 49 b9 78 86 b9 83 87 02 00 00	movabs	r9, 2781053814392
 ff 15 78 10 15 08   	call	qword ptr [rip + 135598200]
 c5 fe 6f 45 d0      	vmovdqu	ymm0, ymmword ptr [rbp - 48]        // Checked
 c5 f9 10 45 d0         vmovupd xmm0, xmmword ptr [rbp - 48]        // Release
 c5 f9 29 45 f0      	vmovapd	xmmword ptr [rbp - 16], xmm0
 48 8b 4d 10         	mov	rcx, qword ptr [rbp + 16]
 48 89 8d 48 ff ff ff	mov	qword ptr [rbp - 184], rcx
 c5 f9 28 45 f0      	vmovapd	xmm0, xmmword ptr [rbp - 16]
 c5 f9 29 45 80      	vmovapd	xmmword ptr [rbp - 128], xmm0
 c5 fd 10 45 d0      	vmovupd	ymm0, ymmword ptr [rbp - 48]
 c5 fd 11 85 50 ff ff ff	vmovupd	ymmword ptr [rbp - 176], ymm0
 48 8b 8d 48 ff ff ff	mov	rcx, qword ptr [rbp - 184]
 48 8d 55 80         	lea	rdx, [rbp - 128]
 4c 8d 85 50 ff ff ff	lea	r8, [rbp - 176]
 49 b9 78 86 b9 83 87 02 00 00	movabs	r9, 2781053814392
 ff 15 90 10 15 08   	call	qword ptr [rip + 135598224]
 90                  	nop
 c5 f8 77            	vzeroupper                    // Checked
 48 81 c4 e0 00 00 00	add	rsp, 224
 5d                  	pop	rbp
 c3                  	ret

@tannergooding Any idea what code is generating this? We need to find something that is causing DEBUG and non-DEBUG code to be different.

@BruceForstall
Copy link
Member Author

In the Checked build dump, the diff code IR is:

Generating: N072 (???,???) [000061] -----------                            IL_OFFSET void   INLRT @ 0x029[E-] REG NA
Generating: N074 (  3,  2) [000021] -c---------                   t21 =    LCL_VAR   simd32<System.Numerics.Vector`1[System.UInt32]> V02 loc1          NA REG NA
                                                                        /--*  t21    simd32
Generating: N076 (  4,  3) [000022] -----------                   t22 = *  HWINTRINSIC simd16 uint GetLower REG mm0
IN0015:        vmovdqu  ymm0, ymmword ptr[V02 rbp-30H]

@tannergooding
Copy link
Member

In the Checked build dump, the diff code IR is:

Are you saying that part of the IR doesn't exist in the release build?

It's, in general, odd that we'd switch like this. I'm not aware of any logic that differs this way, especially not between debug/release.

We have a couple places that can differ between "minOpts" and "optimizationsEnabled", but those are normal/expected.

@BruceForstall
Copy link
Member Author

Are you saying that part of the IR doesn't exist in the release build?

No. It's very hard to see the IR in Release builds since we don't have the dumpers. So I presume some form of this IR does exist, but there's some kind of bug where we are inadvertently doing something in Checked that we shouldn't.

For this particular case, shouldn't the code generated for GetLower be what the Release version generates, namely vmovupd xmm0, xmmword ptr [rbp - 48]? It seems like the Checked version is loading the full 32-bits into ymm0 instead of loading into xmm0. The code is implementing Vector<Uint32>.AsVector128() call (with Vector<T> being SIMD32).

@tannergooding
Copy link
Member

Is this using any COMPlus_EnableIsa=0 flags (like COMPlus_EnableAVX2=0)?

For the "default" scenario (AVX2 enabled) We import this as NI_Vector256_GetLower: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L731-L736

This will then hit here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L2356-L2366, which will generate the HWINTRINSIC simd16 <baseType> GetLower node we're seeing.

This node gets effectively no handling outside the common handling for HWIntrinsics (like VN/CSE) and isn't touched again until lowering (generalized handling) and lsra (special handling): https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lsraxarch.cpp#L2101-L2122

@BruceForstall
Copy link
Member Author

There are no COMPlus variables set in this scenario.

One thing I noticed which looks dangerous (but doesn't seem to be the problem here) is that assertIsContainableHWIntrinsicOp(), inside a #ifdef DEBUG block, calls Lowering::TryGetContainableHWIntrinsicOp(). That function is documented as potentially having side-effects. It is generally illegal for code to have IR side-effects inside an #ifdef DEBUG block since those side-effects won't occur in a Release build. Can you see any way where this could be a problem?

@tannergooding
Copy link
Member

Can you see any way where this could be a problem?

Not off the top of my head. The handling here is about removing a NI_Vector128/256_CreateScalarUnsafe and consuming the underlying scalar op directly...

By the time we hit codegen we shouldn't have any such cases that should mutate. It would be good to fix this (likely move the logic to part of LowerCreateScalarUnsafe instead), but there's no way it could cause GetLower to suddenly read 32-bytes from memory instead of 16.

@tannergooding
Copy link
Member

Something is flipping either the node type or the computed EA_ATTR to be SIMD32 and causing YMM to be used instead.

@BruceForstall
Copy link
Member Author

Btw, this is a Tier-0 compilation, so OptimizationEnabled() is false

@BruceForstall
Copy link
Member Author

Note that that Checked version is using YMM whereas the Release version is using XMM.

@tannergooding
Copy link
Member

Right, which is "too large a read". GetLower returns a TYP_SIMD16 and so should just read 16-bytes.

I'll see if I can repro this locally and if I can determine what's breaking where...

@tannergooding
Copy link
Member

tannergooding commented Sep 30, 2022

Ok, so the Release code generation isn't a side effect of GetLower or other HWIntrinsic codegen AFAICT.

I changed the instruction being emitted for the relevant path and it still emits C5F91045D0 vmovupd xmm0, xmmword ptr [rbp-30H] in Release

Notably I did find two places where we were tracking the "wrong" simdSize for NI_Vector256_GetLower and where we could emit "better" codegen. I've put up a PR to fix those here: #76456

This has a side-effect of making checked consistent with release but it isn't actually the root cause. There still remains some actual issue causing Checked/Release to differ.

Maybe there is something special about LCL_VAR simd32 and how its handled in morph/rationalize?

In checked we get:

    [ 0]  41 (0x029) ldloc.1
    [ 1]  42 (0x02a) call 2B000186
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
Named Intrinsic System.Runtime.Intrinsics.Vector128.AsVector128: Recognized
  Known type Vector128<uint>
  Known type SIMD Vector<uint>
  Known type SIMD Vector<uint>

    [ 1]  47 (0x02f) stloc.0

STMT00004 ( 0x029[E-] ... ??? )
               [000025] -A---------                         *  ASG       simd16 (copy)
               [000023] D------N---                         +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1[System.UInt32]> V01 loc0         
               [000022] -----------                         \--*  HWINTRINSIC simd16 uint GetLower
               [000021] -----------                            \--*  LCL_VAR   simd32<System.Numerics.Vector`1[System.UInt32]> V02 loc1         

The only real transform to this happens in rationalize where we rewrite asg(LCL_VAR, X) to STORE_LCL_VAR(X), thus getting:

N001 (  3,  2) [000021] -----------                   t21 =    LCL_VAR   simd32<System.Numerics.Vector`1[System.UInt32]> V02 loc1         
                                                            /--*  t21    simd32 
N002 (  4,  3) [000022] -----------                   t22 = *  HWINTRINSIC simd16 uint GetLower
                                                            /--*  t22    simd16 
N004 (  8,  6) [000025] DA---------                         *  STORE_LCL_VAR simd16<System.Runtime.Intrinsics.Vector128`1[System.UInt32]> V01 loc0

@tannergooding
Copy link
Member

Found the bug... AsVector128 has an assert which mutates the simdSize

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 30, 2022
BruceForstall added a commit to BruceForstall/runtime that referenced this issue Sep 30, 2022
Recorded SPMI method contexts include configuration environment variables
such as `COMPlus_JITMinOpts` that are replayed. However, when doing
asmdiffs replays to compare a Release to a Checked compiler (non-DEBUG
to DEBUG), there may be codegen-altering configuration variables
such as JitStress that are only read and interpreted by the DEBUG
compiler. This leads to asm diffs.

Introduce a `-ignoreStoredConfig` argument to superpmi.exe, and use it
in superpmi.py when doing Checked/Release asm diffs, that pretends there
are no stored config variables. This assumes that the stored config variables
only alter JIT behavior but that they JIT will succeed with or without them.
This is also slightly more than necessary: if there is a config variable
that the Release compiler knows about, we won't use that, either. However,
we have no easy way (currently) to distinguish which variables are DEBUG
and which are both DEBUG and non-DEBUG available.

Contributes to dotnet#76347
BruceForstall added a commit that referenced this issue Oct 1, 2022
Recorded SPMI method contexts include configuration environment variables
such as `COMPlus_JITMinOpts` that are replayed. However, when doing
asmdiffs replays to compare a Release to a Checked compiler (non-DEBUG
to DEBUG), there may be codegen-altering configuration variables
such as JitStress that are only read and interpreted by the DEBUG
compiler. This leads to asm diffs.

Introduce a `-ignoreStoredConfig` argument to superpmi.exe, and use it
in superpmi.py when doing Checked/Release asm diffs, that pretends there
are no stored config variables. This assumes that the stored config variables
only alter JIT behavior but that they JIT will succeed with or without them.
This is also slightly more than necessary: if there is a config variable
that the Release compiler knows about, we won't use that, either. However,
we have no easy way (currently) to distinguish which variables are DEBUG
and which are both DEBUG and non-DEBUG available.

Contributes to #76347
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Oct 2, 2022
@BruceForstall BruceForstall reopened this Oct 3, 2022
@BruceForstall
Copy link
Member Author

@BruceForstall
Copy link
Member Author

BruceForstall commented Oct 3, 2022

In at least one case, the issue is that we have a tier-0 with/PGO instrumentation being replayed. The JIT calls allocPgoInstrumentationBySchema() which returns a pointer on replay. Of course, this pointer can't be the same as during collection, meaning the pointer will be different between different replays. The NearDiffer doesn't know about this pointer, and to ignore its differences (or map it back to what would be the original value if the original collection address were used).

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Oct 4, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Oct 5, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Nov 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci-optional Blocking optional rolling runs
Projects
None yet
3 participants