Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPMI Replay failing sporadically #102773

Closed
tannergooding opened this issue May 28, 2024 · 11 comments · Fixed by #102914 or #103100
Closed

SPMI Replay failing sporadically #102773

tannergooding opened this issue May 28, 2024 · 11 comments · Fixed by #102914 or #103100
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'

Comments

@tannergooding
Copy link
Member

tannergooding commented May 28, 2024

There looks to be a non-deterministic failure occuring in some SPMI replay runs across a range of PRs:

[11:07:25] ISSUE: <ASSERT> #288834 D:\a\_work\1\s\src\coreclr\jit\scopeinfo.cpp (864) - Assertion failed '!m_VariableLiveRanges->back().m_EndEmitLocation.Valid()' in 'System.Numerics.Tensors.TensorPrimitives+TanOperatorDouble:Invoke(System.Runtime.Intrinsics.Vector128`1[double]):System.Runtime.Intrinsics.Vector128`1[double]' during 'Generate code' (IL size 686; hash 0x5f971363; FullOpts)

It looks to trigger more reliably on x86 but also fails sometimes on x64, neither platform reproduces it everytime: https://dev.azure.com/dnceng-public/public/_build?definitionId=150&_a=summary

The Invoke method in question isn't really doing anything particularly special, it's mostly just doing some basic arithmetic operations overall: https://source.dot.net/#System.Numerics.Tensors/System/Numerics/Tensors/netcore/TensorPrimitives.Tan.cs,b0e53fd55e442a32,references

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=689641
Build error leg or test failing: runtime-coreclr superpmi-replay (Build SuperPMI replay windows x86 checked)
Pull request: #102702

Known Issue Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "'!m_VariableLiveRanges->back().m_EndEmitLocation.Valid()' in 'System.Numerics.Tensors.TensorPrimitives+TanOperatorDouble",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=689641
Error message validated: ['!m_VariableLiveRanges->back().m_EndEmitLocation.Valid()' in 'System.Numerics.Tensors.TensorPrimitives+TanOperatorDouble]
Result validation: ❌ Known issue did not match with the provided build.
Validation performed at: 5/28/2024 5:44:27 PM UTC

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
@tannergooding tannergooding added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' labels May 28, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label May 28, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@kunalspathak
Copy link
Member

@jakobbotsch
Copy link
Member

This looks like a bug in either LSRA or codegen in the order of uses of operands.
We have

N271 (  3,  2) [000192] -----------                  t192 =    LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V04 loc2         u:2 mm0 (last use) REG mm0 $2c3
N273 (  3,  2) [000181] ----------z                  t181 =    LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V11 loc9         u:2 mm6 (last use) REG mm6 $1ca
N275 (  3,  2) [000182] -c---------                  t182 =    LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V18 loc16        u:2 NA (last use) REG NA $2ca
N277 (  3,  2) [000527] -----------                  t527 =    LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V50 tmp26        u:2 mm0 (last use) REG mm0 $1cf
                                                            ┌──▌  t181   simd16 
                                                            ├──▌  t182   simd16 
                                                            ├──▌  t527   simd16 
N279 ( 10,  7) [000528] -----------                  t528 =   HWINTRINSIC simd16 double MultiplyAdd REG mm6 $2ce
N281 (  3,  2) [000194] ----------z                  t194 =    LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V04 loc2         u:2 mm0 REG mm0 $2c3
                                                            ┌──▌  t192   simd16 
                                                            ├──▌  t528   simd16 
                                                            ├──▌  t194   simd16 
N283 ( 17, 12) [000545] ----------Z                  t545 =   HWINTRINSIC simd16 double MultiplyAdd REG mm0 $2cf

Codegen consumes the operands of [000545] in order t192, t528, t194 while LSRA expects them to be used in a different order (t194 first, which unspills into xmm0 that gets used by t192).

@jakobbotsch
Copy link
Member

The LSRA handling for NI_FMA_MultiplyAdd seems to do a lot of swapping of operands before building their uses, which seems to be the source of the bug.
@tannergooding or @kunalspathak, can you please take a look?

jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue May 31, 2024
The operands of the FMA intrinsic are permuted in a non-standard way
during LSRA. Codegen already takes this into account, but the handling
was missing when consuming the operands.

Ideally we would permute these during lowering instead to avoid these
hacks.

Fix dotnet#102773
@jakobbotsch
Copy link
Member

Actually there are already provisions in codegen to take this swapping into account, it just doesn't get applied to the consumption of the operands. So the fix is probably straightforward... I opened #102914 with it.

jakobbotsch added a commit that referenced this issue May 31, 2024
The operands of the FMA intrinsic are permuted in a non-standard way
during LSRA. Codegen already takes this into account, but the handling
was missing when consuming the operands.

Fix #102773
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label May 31, 2024
@kunalspathak
Copy link
Member

Still seeing these failures in latest main.


[09:22:35] ERROR: Method 527085 of size 1311 failed to load and compile correctly (C:\h\w\AFBB09D7\p\clrjit_unix_x64_x64.dll).

[09:22:35] ISSUE: <ASSERT> #515397 D:\a\_work\1\s\src\coreclr\jit\scopeinfo.cpp (864) - Assertion failed '!m_VariableLiveRanges->back().m_EndEmitLocation.Valid()' in 'System.Numerics.Tensors.TensorPrimitives:CosineSimilarityCore[float](System.ReadOnlySpan`1[float],System.ReadOnlySpan`1[float]):float' during 'Generate code' (IL size 1311; hash 0x9d7a95bf; Tier1)

[09:22:35] 

[09:22:35] ISSUE: <ASSERT> #527085 D:\a\_work\1\s\src\coreclr\jit\scopeinfo.cpp (864) - Assertion failed '!m_VariableLiveRanges->back().m_EndEmitLocation.Valid()' in 'System.Numerics.Tensors.TensorPrimitives:CosineSimilarityCore[float](System.ReadOnlySpan`1[float],System.ReadOnlySpan`1[float]):float' during 'Generate code' (IL size 1311; hash 0x9d7a95bf; Tier1)

@kunalspathak kunalspathak reopened this Jun 5, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jun 5, 2024
@jakobbotsch
Copy link
Member

There are probably more issues like #102914 or #103024 that remain.

@jakobbotsch
Copy link
Member

Although looking at the history of superpmi-replay it looks more like the problem was reintroduced by a recent change.

@kunalspathak
Copy link
Member

Although looking at the history of superpmi-replay it looks more like the problem was reintroduced by a recent change.

Yes, seems to be from recent change and wondering if there is a general handling missing somewhere or just point fix when we make these changes. Not sure why these failures don't reflect in the CI run of the PR and gets sneaked in.

@jakobbotsch
Copy link
Member

It could also be related to the recent superpmi-collect capturing the context with the problem. Anyway, I'll take a look.

@jakobbotsch
Copy link
Member

Looks like LSRA and codegen are disagreeing on the emitOp order:

N001 (  3,  2) [000235] -----------                  t235 =    LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V32 loc30        u:2 <l:$644, c:$1d3>
N002 (  3,  2) [000236] -----------                  t236 =    LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V32 loc30        u:2 (last use) <l:$644, c:$1d3>
N003 (  3,  2) [000237] -----------                  t237 =    LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V29 loc27        u:3 (last use) $640
                                                            ┌──▌  t235   simd16 
                                                            ├──▌  t236   simd16 
                                                            ├──▌  t237   simd16 
N004 ( 10,  7) [001216] -----------                 t1216 =   HWINTRINSIC simd16 float MultiplyAdd <l:$60a, c:$60b>
...
LSRA: [000237] [000235] [000236]
Codegen: [000237] [000236] [000235]

I'll try to see if we can just remove this build/consumption order oddity.

jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue Jun 5, 2024
The building and consumption of these operands can happen in op1, op2,
op3 order regardless of whether the codegen uses the registers in a
different order.

Fix dotnet#102773
jakobbotsch added a commit that referenced this issue Jun 6, 2024
#103100)

The building and consumption of these operands can happen in op1, op2,
op3 order regardless of whether the codegen uses the registers in a
different order.

Fix #102773
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Jun 6, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jul 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet
3 participants