Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

baseservices\exceptions\stackoverflow\stackoverflowtester fails under pgo stress on windows arm #81360

Closed
AndyAyersMS opened this issue Jan 30, 2023 · 16 comments · Fixed by #85272

Comments

@AndyAyersMS
Copy link
Member

https://dev.azure.com/dnceng-public/public/_build/results?buildId=151537&view=ms.vss-test-web.build-test-results-tab

Fails for full PGO and random GDV, eg

set DOTNET_TieredCompilation=1
set DOTNET_ReadyToRun=0
set DOTNET_TC_QuickJitForLoops=1
set DOTNET_TieredPGO=1

  Starting:    baseservices.exceptions.XUnitWrapper (parallel test collections = on, max threads = 2)
    baseservices\exceptions\stackoverflow\stackoverflowtester\stackoverflowtester.cmd [FAIL]
      
      Return code:      1
      Raw output file:      C:\h\w\A4B60915\w\AFE90948\uploads\Reports\baseservices.exceptions\stackoverflow\stackoverflowtester\stackoverflowtester.output.txt
      Raw output:
      BEGIN EXECUTION
       "C:\h\w\A4B60915\p\corerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false"  stackoverflowtester.dll 
      Running stackoverflow test(smallframe main)
      "Stack overflow."
      "Repeat 32317 times:"
      "--------------------------------"
      "   at TestStackOverflow.Program.InfiniteRecursionB()"
    
      ...

      Running stackoverflow3 test()
      Call number 50 to the Execute method
      Call number 100 to the Execute method
      Call number 150 to the Execute method
      "Stack overflow."
      ""
      "Assert failure(PID 2200 [0x00000898], Thread: 8624 [0x21b0]): Consistency check failed: FAILED: m_crawl.pFrame->IsTransitionToNativeFrame()"
      ""
      "<no module>! <no symbol> + 0x0 (0x00000000)"
      "    File: D:\a\_work\1\s\src\coreclr\vm\stackwalk.cpp Line: 2281"
      "    Image: C:\h\w\A4B60915\p\corerun.exe"
      ""
      ""
      Exit code: 0xC0000602, expected 0xC00000FD or 0x800703E9

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jan 30, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 30, 2023
@ghost
Copy link

ghost commented Jan 30, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

https://dev.azure.com/dnceng-public/public/_build/results?buildId=151537&view=ms.vss-test-web.build-test-results-tab

Fails for full PGO and random GDV, eg

set DOTNET_TieredCompilation=1
set DOTNET_ReadyToRun=0
set DOTNET_TC_QuickJitForLoops=1
set DOTNET_TieredPGO=1

  Starting:    baseservices.exceptions.XUnitWrapper (parallel test collections = on, max threads = 2)
    baseservices\exceptions\stackoverflow\stackoverflowtester\stackoverflowtester.cmd [FAIL]
      
      Return code:      1
      Raw output file:      C:\h\w\A4B60915\w\AFE90948\uploads\Reports\baseservices.exceptions\stackoverflow\stackoverflowtester\stackoverflowtester.output.txt
      Raw output:
      BEGIN EXECUTION
       "C:\h\w\A4B60915\p\corerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false"  stackoverflowtester.dll 
      Running stackoverflow test(smallframe main)
      "Stack overflow."
      "Repeat 32317 times:"
      "--------------------------------"
      "   at TestStackOverflow.Program.InfiniteRecursionB()"
    
      ...

      Running stackoverflow3 test()
      Call number 50 to the Execute method
      Call number 100 to the Execute method
      Call number 150 to the Execute method
      "Stack overflow."
      ""
      "Assert failure(PID 2200 [0x00000898], Thread: 8624 [0x21b0]): Consistency check failed: FAILED: m_crawl.pFrame->IsTransitionToNativeFrame()"
      ""
      "<no module>! <no symbol> + 0x0 (0x00000000)"
      "    File: D:\a\_work\1\s\src\coreclr\vm\stackwalk.cpp Line: 2281"
      "    Image: C:\h\w\A4B60915\p\corerun.exe"
      ""
      ""
      Exit code: 0xC0000602, expected 0xC00000FD or 0x800703E9

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@AndyAyersMS AndyAyersMS added arch-arm32 os-windows and removed untriaged New issue has not been triaged by the area owner labels Jan 30, 2023
@AndyAyersMS AndyAyersMS added this to the 8.0.0 milestone Jan 30, 2023
@AndyAyersMS
Copy link
Member Author

arm/windows is no longer supported.

Might want to consider disabling this test

@AndyAyersMS
Copy link
Member Author

Can't build locally: #81493.

Will download from CI instead.

@AndyAyersMS
Copy link
Member Author

I can repro with downloaded bits; investigating.

@AndyAyersMS
Copy link
Member Author

Looks like we catch the overflow and have a "partial context" and then assert after walking one frame:

 	KernelBase.dll!wil::details::DebugBreak(void)	Unknown
 	coreclr.dll!CHECK::Setup(const char * message, const char * condition, const char * file, int line) Line 198	C++
>	coreclr.dll!StackFrameIterator::NextRaw() Line 2281	C++
 	[Inline Frame] coreclr.dll!StackFrameIterator::Next() Line 1571	C++
 	coreclr.dll!Thread::StackWalkFramesEx(REGDISPLAY * pRD, StackWalkAction(*)(CrawlFrame *, void *) pCallback, void * pData, unsigned int flags, Frame * pStartFrame) Line 929	C++
 	coreclr.dll!Thread::StackWalkFrames(StackWalkAction(*)(CrawlFrame *, void *) pCallback, void * pData, unsigned int flags, Frame * pStartFrame) Line 1007	C++
 	coreclr.dll!LogCallstackForLogWorker(Thread * pThread) Line 307	C++
 	coreclr.dll!LogStackOverflowStackTraceThread(void * arg) Line 591	C++
 	[External Code]	

@janvorli think this is something you probably should look at?

To repro, on windows arm box, with runfo bits from CI:

set DOTNET_ReadyToRun=0
set DOTNET_TieredPGO=1
set DOTNET_JitRandomGuardedDevirtualization=1
set DOTNET_JitRandomEdgeCounts=1
set DOTNET_JitRandomlyCollect64BitCounts=1

set CORE_ROOT=...

corerun.exe stackoverflow3.dll

@janvorli
Copy link
Member

janvorli commented Feb 3, 2023

@AndyAyersMS I'll take a look.

@AndyAyersMS
Copy link
Member Author

@janvorli should I assign this over to you?

@janvorli janvorli assigned janvorli and unassigned AndyAyersMS Feb 14, 2023
@janvorli
Copy link
Member

@AndyAyersMS I have just assigned that to myself.

@JulieLeeMSFT
Copy link
Member

@janvorli, did you have a chance to look into this? Our goal is to resolve CI test failures within a month. Thanks.

@janvorli
Copy link
Member

janvorli commented Mar 6, 2023

@JulieLeeMSFT not yet, but taking a look now.

@janvorli
Copy link
Member

@JulieLeeMSFT I have found the culprit. The problem happens in case the stack overflow happens in native code and there is an explicit frame between the managed and native code. Some time ago, I have added a fix for a problem when the stack overflow happened in native code, but I haven't found that it actually works only in case there is no explicit frame between the managed code and the failing native code frame. That fix makes the FaultingExceptionFrame that is created for the stack overflow to contain context of the managed code frame, so the stack walker uses that to move to the next frame. But while doing so, it hits the other explicit frame that it doesn't expect there and fires an assert.

I am working on a fix, figuring out the cleanest way to fix it.

@JulieLeeMSFT
Copy link
Member

@janvorli, great that you root-caused this issue!!

@JulieLeeMSFT I have found the culprit. The problem happens in case the stack overflow happens in native code and there is an explicit frame between the managed and native code. Some time ago, I have added a fix for a problem when the stack overflow happened in native code, but I haven't found that it actually works only in case there is no explicit frame between the managed code and the failing native code frame. That fix makes the FaultingExceptionFrame that is created for the stack overflow to contain context of the managed code frame, so the stack walker uses that to move to the next frame. But while doing so, it hits the other explicit frame that it doesn't expect there and fires an assert.

I am working on a fix, figuring out the cleanest way to fix it.

@JulieLeeMSFT
Copy link
Member

JulieLeeMSFT commented Mar 30, 2023

Failed in runtime-coreclr pgostress test: https://dev.azure.com/dnceng-public/public/_build/results?buildId=219898&view=ms.vss-test-web.build-test-results-tab&runId=4120128&resultId=121377&paneView=debug

Failed on Arm64: coreclr windows arm64 Checked fullpgo_random_gdv_edge @ Windows.11.Arm64.Open

Error message:

Return code:      1
Raw output file:      C:\h\w\ABE2098D\w\9B8E0845\uploads\Reports\baseservices.exceptions\stackoverflow\stackoverflowtester\stackoverflowtester.output.txt
Raw output:
BEGIN EXECUTION
"C:\h\w\ABE2098D\p\corerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false"  stackoverflowtester.dll
Running stackoverflow test(smallframe main)
"Stack overflow."
"Repeat 32326 times:"
"--------------------------------"
"   at TestStackOverflow.Program.InfiniteRecursionA()"
"   at TestStackOverflow.Program.InfiniteRecursionC()"
"   at TestStackOverflow.Program.InfiniteRecursionB()"
"--------------------------------"
"   at TestStackOverflow.Program.InfiniteRecursionA()"
"   at TestStackOverflow.Program.Test(Boolean)"
"   at TestStackOverflow.Program.Main(System.String[])"
""
Running stackoverflow test(largeframe main)
"Stack overflow."
"Repeat 8 times:"
"--------------------------------"
"   at TestStackOverflow.Program.InfiniteRecursionC2()"
"   at TestStackOverflow.Program.InfiniteRecursionB2()"
"   at TestStackOverflow.Program.InfiniteRecursionA2()"
"--------------------------------"
"   at TestStackOverflow.Program.Test(Boolean)"
"   at TestStackOverflow.Program.Main(System.String[])"
""
Running stackoverflow test(smallframe secondary)
"Stack overflow."
"Repeat 32379 times:"
"--------------------------------"
"   at TestStackOverflow.Program.InfiniteRecursionB()"
"   at TestStackOverflow.Program.InfiniteRecursionA()"
"   at TestStackOverflow.Program.InfiniteRecursionC()"
"--------------------------------"
"   at TestStackOverflow.Program.InfiniteRecursionB()"
"   at TestStackOverflow.Program.InfiniteRecursionA()"
"   at TestStackOverflow.Program.Test(Boolean)"
"   at TestStackOverflow.Program+<>c__DisplayClass7_0.<SecondaryThreadsTest>b__0()"
"   at System.Threading.Thread+StartHelper.RunWorker()"
"   at System.Threading.Thread+StartHelper.Run()"
"   at System.Threading.Thread.StartCallback()"
""
Running stackoverflow test(largeframe secondary)
"Stack overflow."
"Repeat 8 times:"
"--------------------------------"
"   at TestStackOverflow.Program.InfiniteRecursionC2()"
"   at TestStackOverflow.Program.InfiniteRecursionB2()"
"   at TestStackOverflow.Program.InfiniteRecursionA2()"
"--------------------------------"
"   at TestStackOverflow.Program.Test(Boolean)"
"   at TestStackOverflow.Program+<>c__DisplayClass7_0.<SecondaryThreadsTest>b__0()"
"   at System.Threading.Thread+StartHelper.RunWorker()"
"   at System.Threading.Thread+StartHelper.Run()"
"   at System.Threading.Thread.StartCallback()"
""
Running stackoverflow3 test()
Call number 50 to the Execute method
Call number 100 to the Execute method
Call number 150 to the Execute method
"Stack overflow."
""
"Assert failure(PID 8936 [0x000022e8], Thread: 6124 [0x17ec]): Consistency check failed: FAILED: m_crawl.pFrame->IsTransitionToNativeFrame()"
""
"<no module>! <no symbol> + 0x0 (0x00000000)"
"    File: D:\a_work\1\s\src\coreclr\vm\stackwalk.cpp Line: 2284"
"    Image: C:\h\w\ABE2098D\p\corerun.exe"
""
""
Exit code: 0xC0000602, expected 0xC00000FD or 0x800703E9
Expected: 100
Actual: 105
END EXECUTION - FAILED
FAILED
Test failed. Trying to see if dump file was created in C:\cores since 3/28/2023 4:54:28 PM
Processing C:\cores\corerun.exe.8936.dmp
Unable to find cdb.exe at C:\Program Files (x86)\Windows Kits\10\Debuggers\arm\cdb.exe
Processing C:\cores\corerun.exe.7800.dmp
Unable to find cdb.exe at C:\Program Files (x86)\Windows Kits\10\Debuggers\arm\cdb.exe
Test Harness Exitcode is : 1
To run the test:

set CORE_ROOT=C:\h\w\ABE2098D\p
C:\h\w\ABE2098D\w\9B8E0845\e\baseservices\exceptions\stackoverflow\stackoverflowtester\stackoverflowtester.cmd
Expected: True
Actual:   False

Stack trace

   at baseservices_exceptions._stackoverflow_stackoverflowtester_stackoverflowtester_._stackoverflow_stackoverflowtester_stackoverflowtester_cmd()
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)

@JulieLeeMSFT
Copy link
Member

Failed again. runtime-coreclr pgostress

baseservices\exceptions\stackoverflow\stackoverflowtester\stackoverflowtester.cmd

Running stackoverflow3 test()
Call number 50 to the Execute method
Call number 100 to the Execute method
Call number 150 to the Execute method
"Stack overflow."
""
"Assert failure(PID 5888 [0x00001700], Thread: 4160 [0x1040]): Consistency check failed: FAILED: m_crawl.pFrame->IsTransitionToNativeFrame()"
""
"<no module>! <no symbol> + 0x0 (0x00000000)"
"    File: D:\a_work\1\s\src\coreclr\vm\stackwalk.cpp Line: 2284"
"    Image: C:\h\w\A51808B0\p\corerun.exe"
""
""
Exit code: 0xC0000602, expected 0xC00000FD or 0x800703E9
Expected: 100
Actual: 105
END EXECUTION - FAILED
FAILED
Test failed. Trying to see if dump file was created in C:\cores since 4/1/2023 7:54:15 AM
Processing C:\cores\corerun.exe.5888.dmp
Unable to find cdb.exe at C:\Program Files (x86)\Windows Kits\10\Debuggers\arm\cdb.exe
Processing C:\cores\corerun.exe.9204.dmp
Unable to find cdb.exe at C:\Program Files (x86)\Windows Kits\10\Debuggers\arm\cdb.exe
Test Harness Exitcode is : 1
To run the test:
> set CORE_ROOT=C:\h\w\A51808B0\p
> C:\h\w\A51808B0\w\AE9408F0\e\baseservices\exceptions\stackoverflow\stackoverflowtester\stackoverflowtester.cmd
Expected: True
Actual:   False


Stack trace
   at baseservices_exceptions._stackoverflow_stackoverflowtester_stackoverflowtester_._stackoverflow_stackoverflowtester_stackoverflowtester_cmd()
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)

@JulieLeeMSFT
Copy link
Member

Ping @janvorli. Checking back on the status.

@janvorli
Copy link
Member

@JulieLeeMSFT thank you for reminding me, this has fallen off my radar due to my vacation and a sick leave after that. I'll get back to it as soon as I can.

@JulieLeeMSFT JulieLeeMSFT removed the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 17, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 24, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 25, 2023
@ghost ghost locked as resolved and limited conversation to collaborators May 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants