Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing/eventpipe/eventsvalidation tests failing with AF: codeLength > 0 #69375

Closed
jkotas opened this issue May 15, 2022 · 13 comments · Fixed by #70609
Closed

tracing/eventpipe/eventsvalidation tests failing with AF: codeLength > 0 #69375

jkotas opened this issue May 15, 2022 · 13 comments · Fixed by #70609
Assignees
Milestone

Comments

@jkotas
Copy link
Member

jkotas commented May 15, 2022

Hit in #69360 (comment):

     Assert failure(PID 5146 [0x0000141a], Thread: 5148 [0x141c]): codeLength > 0
          File: /__w/1/s/src/coreclr/vm/eetwain.cpp Line: 5880
          Image: /root/helix/work/correlation/corerun
      
      /root/helix/work/workitem/e/tracing/eventpipe/eventsvalidation/ExceptionThrown_V1/ExceptionThrown_V1.sh: line 408:  5146 Aborted                 (core dumped)

Based on the investigation below, this assert is just one of the possible failure modes of this test due this bug. Any crash or AF with ETW::MethodLog::SendEventsForJitMethods on stack is potentially caused by this bug.

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 15, 2022
@jkotas
Copy link
Member Author

jkotas commented May 15, 2022

More info from the dump.

  • The assert is hit during tracing run-down.
#4  0xee20ce22 in PROCAbort (signal=0) at /__w/1/s/src/coreclr/pal/src/thread/process.cpp:2613
#5  0xee20ccda in RaiseFailFastException (pExceptionRecord=<optimized out>, pContextRecord=<optimized out>, dwFlags=<optimized out>) at /__w/1/s/src/coreclr/pal/src/thread/process.cpp:1264
#6  0xee096522 in TerminateOnAssert () at /__w/1/s/src/coreclr/utilcode/debug.cpp:185
#7  0xee0969e2 in _DbgBreakCheck (szFile=0xedce383a "/__w/1/s/src/coreclr/vm/eetwain.cpp", iLine=-316672944, szExpr=0xedcc359b "codeLength > 0", fConstrained=<optimized out>)
    at /__w/1/s/src/coreclr/utilcode/debug.cpp:415
#8  0xee096cd6 in _DbgBreakCheckNoThrow (szFile=0xedce383a "/__w/1/s/src/coreclr/vm/eetwain.cpp", iLine=5880, szExpr=0xedcc359b "codeLength > 0", fConstrained=0)
    at /__w/1/s/src/coreclr/utilcode/debug.cpp:434
#9  0xee096fa0 in DbgAssertDialog (szFile=0xedce383a "/__w/1/s/src/coreclr/vm/eetwain.cpp", iLine=5880, szExpr=0x0) at /__w/1/s/src/coreclr/pal/inc/pal.h:3485
#10 0xeddcde68 in EECodeManager::GetFunctionSize (this=<optimized out>, gcInfoToken=...) at /__w/1/s/src/coreclr/vm/eetwain.cpp:5880
#11 0xee1a3162 in EEJitManager::JitTokenToMethodRegionInfo (this=0xa435b18, MethodToken=..., methodRegionInfo=0xed1fe940) at /__w/1/s/src/coreclr/vm/codeman.h:1477
#12 0xedfd8996 in EECodeInfo::GetMethodRegionInfo (this=0xed1fe950, methodRegionInfo=0xed1fe940) at /__w/1/s/src/coreclr/vm/codeman.h:1672
#13 ETW::MethodLog::SendMethodEvent (pMethodDesc=0xe301530c, dwEventOptions=<optimized out>, bIsJit=<optimized out>, namespaceOrClassName=0x0, methodName=0x0, methodSignature=0x0,
    pNativeCodeStartAddress=3808445049, pConfig=0xed1fe9f0) at /__w/1/s/src/coreclr/vm/eventtrace.cpp:6660
#14 0xedfdc652 in ETW::MethodLog::SendEventsForJitMethodsHelper (pLoaderAllocatorFilter=<optimized out>, dwEventOptions=133256, fLoadOrDCStart=<optimized out>, fUnloadOrDCEnd=<optimized out>,
    fSendMethodEvent=1, fSendILToNativeMapEvent=1, fGetCodeIds=1) at /__w/1/s/src/coreclr/vm/eventtrace.cpp:7064
#15 0xedfdc872 in ETW::MethodLog::SendEventsForJitMethods (pDomainFilter=0xa432ed8, pLoaderAllocatorFilter=0x0, dwEventOptions=133256) at /__w/1/s/src/coreclr/vm/eventtrace.cpp:7148
#16 0xedfdcc2c in ETW::EnumerationLog::IterateDomain (pDomain=0xa432ed8, enumerationOptions=133256) at /__w/1/s/src/coreclr/vm/eventtrace.cpp:7241
#17 0xedfd334c in ETW::EnumerationLog::IterateAppDomain (pAppDomain=<optimized out>, enumerationOptions=133256) at /__w/1/s/src/coreclr/vm/eventtrace.cpp:7198
#18 ETW::EnumerationLog::EnumerationHelper (moduleFilter=<optimized out>, domainFilter=<optimized out>, enumerationOptions=133256) at /__w/1/s/src/coreclr/vm/eventtrace.cpp:7501
#19 0xedfd481c in ETW::EnumerationLog::EndRundown () at /__w/1/s/src/coreclr/vm/eventtrace.cpp:4224
#20 0xee182396 in ep_rt_execute_rundown (execution_checkpoints=<optimized out>) at /__w/1/s/src/coreclr/vm/eventing/eventpipe/ep-rt-coreclr.h:1904
#21 ep_session_execute_rundown (session=<optimized out>, execution_checkpoints=<optimized out>) at /__w/1/s/src/native/eventpipe/ep-session.c:321
#22 disable_holding_lock (id=<optimized out>, provider_callback_data_queue=<optimized out>) at /__w/1/s/src/native/eventpipe/ep.c:591
#23 disable_helper (id=3967830600) at /__w/1/s/src/native/eventpipe/ep.c:651
#24 0xee182040 in ep_disable (id=3967830600) at /__w/1/s/src/native/eventpipe/ep.c:1134
#25 0xee19788c in eventpipe_protocol_helper_stop_tracing (message=0xed1fed68, stream=0xec86e028) at /__w/1/s/src/native/eventpipe/ds-eventpipe-protocol.c:397
#26 ds_eventpipe_protocol_helper_handle_ipc_message (message=0xed1fed68, stream=0xec86e028) at /__w/1/s/src/native/eventpipe/ds-eventpipe-protocol.c:550
#27 0xee196c92 in server_thread (data=<optimized out>) at /__w/1/s/src/native/eventpipe/ds-server.c:158
#28 0xee211d78 in CorUnix::CPalThread::ThreadEntry (pvParam=0xa426b58) at /__w/1/s/src/coreclr/pal/src/thread/thread.cpp:1829
#29 0xee57a614 in start_thread (arg=0x8b84b381) at pthread_create.c:463

@jkotas
Copy link
Member Author

jkotas commented May 15, 2022

The rundown is trying to access GCInfo of a dynamic method. The dynamic method that it is trying to access appears to be destroyed already and its memory was overwritten that leads to the assert.

The finalizer thread is actively running and destroying other dynamic methods.

@jkotas
Copy link
Member Author

jkotas commented May 15, 2022

This appears to be a race condition in the rundown code.

@jkotas jkotas added blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' area-Tracing-coreclr labels May 15, 2022
@jkotas
Copy link
Member Author

jkotas commented May 15, 2022

You should be able to hit this crash more reliably by handcrafting a test that creates and destroys a lot of dynamic methods.

@tommcdon
Copy link
Member

@josalem

@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label May 16, 2022
@tommcdon tommcdon added this to the 7.0.0 milestone May 16, 2022
@tommcdon
Copy link
Member

Related to 6f563b3?

@josalem
Copy link
Contributor

josalem commented May 19, 2022

Has there only been one instance of this failing? If this is indeed a race condition in the tracing code, I would have assumed we would have seen this issue for a long time. Rundown logic hasn't changed significantly in a long time. I would assert that something else changed around the tracing code. I'll take a look at the PR Tom mentioned and see if I spot anything.

@jkotas
Copy link
Member Author

jkotas commented May 19, 2022

If this is indeed a race condition in the tracing code,

Yes, it is.

The heap enumeration and code memory allocation are all protected by m_CodeHeapCritSec. The problem that this lock does not protect the initialization of this memory. Here is the sequence of events that leads to this assert:

  1. EEJitManager::allocGCInfo allocates a piece of memory for GCInfo and sets up codeheader to point to this memory (under m_CodeHeapCritSec lock)
  2. The tracing code kicks in and starts enumerating the codeheap. It will see a codeheader that points to GCInfo block that was not initialized yet. It is what triggers the assert above.
  3. The JIT initializes GCInfo. It is too late. The tracing code crashed on the uninitialized GCInfo already.

@josalem
Copy link
Contributor

josalem commented May 19, 2022

Thanks for the clarification! Presumably this race has been around for a long time then. We'll take a look at adjusting the tracing code to handle this case.

@VincentBu
Copy link
Contributor

Failed again in: runtime-coreclr outerloop 20220523.2

Failed test:

coreclr Linux arm Checked no_tiered_compilation @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440

- tracing/eventpipe/eventsvalidation/ExceptionThrown_V1/ExceptionThrown_V1.sh

Error message:

Assert failure(PID 377 [0x00000179], Thread: 379 [0x017b]): codeLength > 0
File: /__w/1/s/src/coreclr/vm/eetwain.cpp Line: 5880
Image: /root/helix/work/correlation/corerun

/root/helix/work/workitem/e/tracing/eventpipe/eventsvalidation/ExceptionThrown_V1/ExceptionThrown_V1.sh: line 408:   377 Aborted                 (core dumped) $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"

Return code:      1
Raw output file:      /root/helix/work/workitem/uploads/Reports/tracing.eventpipe/eventsvalidation/ExceptionThrown_V1/ExceptionThrown_V1.output.txt
Raw output:
BEGIN EXECUTION
/root/helix/work/correlation/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false ExceptionThrown_V1.dll ''
0.0s: ==TEST STARTING==
3.3s: Attempting to delete the zombied pipe: /tmp/dotnet-diagnostic-310-7465873-socket
3.3s: Deleted
3.4s: Started sending sentinel events...
3.5s: Connecting to EventPipe...
5.5s: Creating EventPipeEventSource...
5.6s: EventPipeEventSource created
5.7s: Dynamic.All callback registered
5.7s: Starting stream processing...
6.5s: Saw new provider 'Microsoft-DotNETCore-SampleProfiler'
6.6s: Saw sentinel event
6.6s: Stopped sending sentinel events
6.7s: Starting event generating action...
6.7s: Thrown an excpetion 0 times...
6.7s: Thrown an excpetion 100 times...
6.8s: Thrown an excpetion 200 times...
6.8s: Saw new provider 'Microsoft-Windows-DotNETRuntime'
6.8s: Thrown an excpetion 300 times...
6.9s: Thrown an excpetion 400 times...
6.9s: Thrown an excpetion 500 times...
7.0s: Thrown an excpetion 600 times...
7.0s: Thrown an excpetion 700 times...
7.1s: Thrown an excpetion 800 times...
7.1s: Thrown an excpetion 900 times...
7.1s: Stopping event generating action
7.2s: Sending StopTracing command...
[createdump] Gathering state for process 377 corerun
[createdump] Crashing thread 0000017b signal 00000006
[createdump] Writing minidump with heap to file /home/helixbot/dotnetbuild/dumps/coredump.377.dmp
[createdump] Written 146321408 bytes (35723 pages) to core file
[createdump] Dump successfully written
Expected: 100
Actual: 134
END EXECUTION - FAILED
Test Harness Exitcode is : 1
To run the test:

set CORE_ROOT=/root/helix/work/correlation
/root/helix/work/workitem/e/tracing/eventpipe/eventsvalidation/ExceptionThrown_V1/ExceptionThrown_V1.sh
Expected: True
Actual:   False


Stack trace
   at tracing_eventpipe._eventsvalidation_ExceptionThrown_V1_ExceptionThrown_V1_._eventsvalidation_ExceptionThrown_V1_ExceptionThrown_V1_sh()

@tommcdon
Copy link
Member

tommcdon commented Jun 7, 2022

@davmason ptal

@tommcdon tommcdon assigned davmason and unassigned tommcdon Jun 7, 2022
@tommcdon
Copy link
Member

This test has 1 failure in the last 2 weeks, removing the blocking-ci tag

@tommcdon tommcdon removed the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Jun 10, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 11, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 30, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants