Fix Missing Metadata for ThreadPool Event #67150

josalem · 2022-03-25T17:22:05Z

Will require backport

(copy/paste of root cause explanation from issue)
In .net6 we added a managed implementation of the thread pool. To send the same events under the same provider we had to do some duplication of event definitions so that managed code had the ability to send the same events. Every event needs to be represented in managed code for EventListener to know how to parse the events. To that end, there is a python script that generates managed function stubs with the correct event IDs and parameters. For components that transition from native to managed, these stubs would be filled in by hand. For example, the thread pool had all the events sent from managed code filled out.

After that conversion, it looks like any event with ThreadPool in its name gets left out of the code generation phase:

runtime/src/coreclr/scripts/genRuntimeEventSources.py

Lines 88 to 91 in 0d59046

    
           # ThreadPool events are defined manually in NativeRuntimeEventSource.PortableThreadPool.cs 
        
           symbol = eventNode.getAttribute("symbol") 
        
           if "ThreadPool" in symbol: 
        
               return

This specific event, however, is only sent inside an FCALL related to overlapped IO and wasn't implemented in managed code:

<event value="65" version="0" level="win:Verbose"  template="ThreadPoolIOWork"
                           keywords="ThreadingKeyword"
                           task="ThreadPool"
                           opcode="IOPack"
                           symbol="ThreadPoolIOPack"
                           message="$(string.RuntimePublisher.ThreadPoolIOPackEventMessage)"/>

sent from:

runtime/src/coreclr/vm/nativeoverlapped.cpp

Lines 91 to 149 in 3fe2959

    
           FCIMPL1(LPOVERLAPPED, AllocateNativeOverlapped, OverlappedDataObject* overlappedUNSAFE) 
        
           { 
        
               FCALL_CONTRACT; 
        
               LPOVERLAPPED lpOverlapped       = NULL; 
        
               OVERLAPPEDDATAREF   overlapped  = ObjectToOVERLAPPEDDATAREF(overlappedUNSAFE); 
        
               OBJECTREF           userObject  = overlapped->m_userObject; 
        
               HELPER_METHOD_FRAME_BEGIN_RET_ATTRIB_2(Frame::FRAME_ATTR_NONE, overlapped, userObject); 
        
               if (g_pOverlappedDataClass == NULL) 
        
               { 
        
                   g_pOverlappedDataClass = CoreLibBinder::GetClass(CLASS__OVERLAPPEDDATA); 
        
                   // We have optimization to avoid creating event if IO is in default domain.  This depends on default domain 
        
                   // can not be unloaded. 
        
               } 
        
               CONSISTENCY_CHECK(overlapped->GetMethodTable() == g_pOverlappedDataClass); 
        
               if (userObject != NULL) 
        
               { 
        
                   if (userObject->GetMethodTable() == g_pPredefinedArrayTypes[ELEMENT_TYPE_OBJECT].AsMethodTable()) 
        
                   { 
        
                       BASEARRAYREF asArray = (BASEARRAYREF) userObject; 
        
                       OBJECTREF *pObj = (OBJECTREF*)(asArray->GetDataPtr()); 
        
                       SIZE_T num = asArray->GetNumComponents(); 
        
                       SIZE_T i; 
        
                       for (i = 0; i < num; i ++) 
        
                       { 
        
                           ValidatePinnedObject(pObj[i]); 
        
                       } 
        
                   } 
        
                   else 
        
                   { 
        
                       ValidatePinnedObject(userObject); 
        
                   } 
        
               } 
        
               NewHolder<NATIVEOVERLAPPED_AND_HANDLE> overlappedHolder(new NATIVEOVERLAPPED_AND_HANDLE()); 
        
               overlappedHolder->m_handle = GetAppDomain()->CreateTypedHandle(overlapped, HNDTYPE_ASYNCPINNED); 
        
               lpOverlapped = &(overlappedHolder.Extract()->m_overlapped); 
        
               lpOverlapped->Internal = 0; 
        
               lpOverlapped->InternalHigh = 0; 
        
               lpOverlapped->Offset = overlapped->m_offsetLow; 
        
               lpOverlapped->OffsetHigh = overlapped->m_offsetHigh; 
        
               lpOverlapped->hEvent = (HANDLE)overlapped->m_eventHandle; 
        
               overlapped->m_pNativeOverlapped = lpOverlapped; 
        
               HELPER_METHOD_FRAME_END(); 
        
               LOG((LF_INTEROP, LL_INFO10000, "In AllocNativeOperlapped thread 0x%x\n", GetThread())); 
        
               if (ETW_EVENT_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_DOTNET_Context, ThreadPoolIODequeue)) 
        
                   FireEtwThreadPoolIOPack(lpOverlapped, overlappedUNSAFE, GetClrInstanceId()); 
        
               return lpOverlapped; 
        
           } 
        
           FCIMPLEND

As a result, managed code doesn't have any metadata for this event. When the event is dispatched to managed code, EventSource attempts to read the struct in m_eventData which was zero initialized, so it NREs.

The reason this was only repro-ing for a web app is that ASP.NET Core on Windows uses a lot of overlapped IO so it sends these events while a basic console app does not.

This issue only presents for an in-process EventListener. Out-of-process listeners will not be causing NREs in the target runtime.

CC @dotnet/dotnet-diag @tommcdon

fixes #12141

ghost · 2022-03-25T17:22:10Z

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

Issue Details

fixes #65630

Will require backport

(copy/paste of root cause explanation from issue)
In .net6 we added a managed implementation of the thread pool. To send the same events under the same provider we had to do some duplication of event definitions so that managed code had the ability to send the same events. Every event needs to be represented in managed code for EventListener to know how to parse the events. To that end, there is a python script that generates managed function stubs with the correct event IDs and parameters. For components that transition from native to managed, these stubs would be filled in by hand. For example, the thread pool had all the events sent from managed code filled out.

After that conversion, it looks like any event with ThreadPool in its name gets left out of the code generation phase:

runtime/src/coreclr/scripts/genRuntimeEventSources.py

Lines 88 to 91 in 0d59046

    
           # ThreadPool events are defined manually in NativeRuntimeEventSource.PortableThreadPool.cs 
        
           symbol = eventNode.getAttribute("symbol") 
        
           if "ThreadPool" in symbol: 
        
               return

This specific event, however, is only sent inside an FCALL related to overlapped IO and wasn't implemented in managed code:

<event value="65" version="0" level="win:Verbose"  template="ThreadPoolIOWork"
                           keywords="ThreadingKeyword"
                           task="ThreadPool"
                           opcode="IOPack"
                           symbol="ThreadPoolIOPack"
                           message="$(string.RuntimePublisher.ThreadPoolIOPackEventMessage)"/>

sent from:

runtime/src/coreclr/vm/nativeoverlapped.cpp

Lines 91 to 149 in 3fe2959

    
           FCIMPL1(LPOVERLAPPED, AllocateNativeOverlapped, OverlappedDataObject* overlappedUNSAFE) 
        
           { 
        
               FCALL_CONTRACT; 
        
               LPOVERLAPPED lpOverlapped       = NULL; 
        
               OVERLAPPEDDATAREF   overlapped  = ObjectToOVERLAPPEDDATAREF(overlappedUNSAFE); 
        
               OBJECTREF           userObject  = overlapped->m_userObject; 
        
               HELPER_METHOD_FRAME_BEGIN_RET_ATTRIB_2(Frame::FRAME_ATTR_NONE, overlapped, userObject); 
        
               if (g_pOverlappedDataClass == NULL) 
        
               { 
        
                   g_pOverlappedDataClass = CoreLibBinder::GetClass(CLASS__OVERLAPPEDDATA); 
        
                   // We have optimization to avoid creating event if IO is in default domain.  This depends on default domain 
        
                   // can not be unloaded. 
        
               } 
        
               CONSISTENCY_CHECK(overlapped->GetMethodTable() == g_pOverlappedDataClass); 
        
               if (userObject != NULL) 
        
               { 
        
                   if (userObject->GetMethodTable() == g_pPredefinedArrayTypes[ELEMENT_TYPE_OBJECT].AsMethodTable()) 
        
                   { 
        
                       BASEARRAYREF asArray = (BASEARRAYREF) userObject; 
        
                       OBJECTREF *pObj = (OBJECTREF*)(asArray->GetDataPtr()); 
        
                       SIZE_T num = asArray->GetNumComponents(); 
        
                       SIZE_T i; 
        
                       for (i = 0; i < num; i ++) 
        
                       { 
        
                           ValidatePinnedObject(pObj[i]); 
        
                       } 
        
                   } 
        
                   else 
        
                   { 
        
                       ValidatePinnedObject(userObject); 
        
                   } 
        
               } 
        
               NewHolder<NATIVEOVERLAPPED_AND_HANDLE> overlappedHolder(new NATIVEOVERLAPPED_AND_HANDLE()); 
        
               overlappedHolder->m_handle = GetAppDomain()->CreateTypedHandle(overlapped, HNDTYPE_ASYNCPINNED); 
        
               lpOverlapped = &(overlappedHolder.Extract()->m_overlapped); 
        
               lpOverlapped->Internal = 0; 
        
               lpOverlapped->InternalHigh = 0; 
        
               lpOverlapped->Offset = overlapped->m_offsetLow; 
        
               lpOverlapped->OffsetHigh = overlapped->m_offsetHigh; 
        
               lpOverlapped->hEvent = (HANDLE)overlapped->m_eventHandle; 
        
               overlapped->m_pNativeOverlapped = lpOverlapped; 
        
               HELPER_METHOD_FRAME_END(); 
        
               LOG((LF_INTEROP, LL_INFO10000, "In AllocNativeOperlapped thread 0x%x\n", GetThread())); 
        
               if (ETW_EVENT_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_DOTNET_Context, ThreadPoolIODequeue)) 
        
                   FireEtwThreadPoolIOPack(lpOverlapped, overlappedUNSAFE, GetClrInstanceId()); 
        
               return lpOverlapped; 
        
           } 
        
           FCIMPLEND

As a result, managed code doesn't have any metadata for this event. When the event is dispatched to managed code, EventSource attempts to read the struct in m_eventData which was zero initialized, so it NREs.

The reason this was only repro-ing for a web app is that ASP.NET Core on Windows uses a lot of overlapped IO so it sends these events while a basic console app does not.

I have a patch that fixes this and will hopefully have a PR out very soon. We will need to backport this to .net6.

This issue only presents for an in-process EventListener. Out-of-process listeners will not be causing NREs in the target runtime.

CC @dotnet/dotnet-diag @tommcdon

Author:	josalem
Assignees:	josalem
Labels:	`area-System.Diagnostics.Tracing`
Milestone:	7.0.0

kouvel

Couple of suggestions, otherwise LGTM, thanks!

...vate.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.NativeSinks.cs

...s/System.Private.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.cs

noahfalk

LGTM modulo comments inline

...vate.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.NativeSinks.cs

...s/System.Private.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.cs

josalem · 2022-03-30T17:24:35Z

src/tests/tracing/runtimeeventsource/nativeruntimeeventsource.csproj

-    <!--
-      Test is blocking official build.
-      Related failures: #18907, #19340, #22441, #22729.
-      Issue tracking to re-enable: #22898.
-    -->
-    <DisableProjectBuild>true</DisableProjectBuild>


None of these issue numbers seem related to this code. I'm going to poke at the old coreclr repo issues to see if these refer to those numbers (the transfer did some funny business with mapping the numbers).

#12141

This test has been off for >3 years now. I want this test back on, but perhaps not as part of a changeset that is going to be backported. Thoughts?

josalem · 2022-04-13T18:39:16Z

~~I can't seem to repro the Mono failure locally~~ edit: 🤦 I was running the wrong test. I can repro the Mono failure locally. It looks like there's a SIGABRT when trying to get a wait handle from the GC:

=================================================================
Managed Stacktrace:
=================================================================
  at <unknown> <0xffffffff>
  at System.Runtime.InteropServices.GCHandle:InternalGet <0x00079>
  at System.Runtime.InteropServices.GCHandle:get_Target <0x00028>
  at HandleManager:FromHandle <0x00077>
  at System.Threading.WaitSubsystem:SetEvent <0x0001f>
  at System.Threading.EventWaitHandle:Set <0x0005a>
  at System.Diagnostics.Tracing.EventPipeEventDispatcher:StopDispatchTask <0x00042>
  at System.Diagnostics.Tracing.EventPipeEventDispatcher:CommitDispatchConfiguration <0x0008b>
  at System.Diagnostics.Tracing.EventPipeEventDispatcher:SendCommand <0x000dc>
  at System.Diagnostics.Tracing.EventListener:EnableEvents <0x000c2>
  at System.Diagnostics.Tracing.EventListener:EnableEvents <0x00023>
  at Tracing.Tests.SimpleEventListener:OnEventSourceCreated <0x00061>
  at System.Diagnostics.Tracing.EventSource:AddListener <0x00194>
  at <>c:<.ctor>b__10_0 <0x00044>
  at System.Diagnostics.Tracing.EventListener:CallBackForExistingEventSources <0x00268>
  at System.Diagnostics.Tracing.EventListener:.ctor <0x000b3>
  at Tracing.Tests.SimpleEventListener:.ctor <0x0001e>
  at Tracing.Tests.NativeRuntimeEventSourceTest:Main <0x000ab>
  at <Module>:runtime_invoke_int_object <0x00069>
=================================================================

Details


=================================================================
External Debugger Dump:
=================================================================
[New LWP 20477]
[New LWP 20478]
[New LWP 20479]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
warning: the debug information found in "/datadisks/disk1/work/B9570A9B/p/libcoreclr.so.dbg" does not match "/datadisks/disk1/work/B9570A9B/p/libcoreclr.so" (CRC mismatch).

0x00007f27c9e3a32a in __waitpid (pid=20595, stat_loc=0x7fff6d150ec0, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
30../sysdeps/unix/sysv/linux/waitpid.c: No such file or directory.
  Id   Target Id         Frame 
* 1    Thread 0x7f27ca461740 (LWP 20476) "corerun" 0x00007f27c9e3a32a in __waitpid (pid=20595, stat_loc=0x7fff6d150ec0, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
  2    Thread 0x7f27c63ff700 (LWP 20477) "SGen worker" 0x00007f27c9e35ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f27c90c16b8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  3    Thread 0x7f27c4871700 (LWP 20478) "corerun" 0x00007f27c920cbb9 in __GI___poll (fds=0x7f27c0002d50, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  4    Thread 0x7f27c4670700 (LWP 20479) "Finalizer" 0x00007f27c9e387c6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f27c90b2560) at ../sysdeps/unix/sysv/linux/futex-internal.h:205

Thread 4 (Thread 0x7f27c4670700 (LWP 20479)):
#0  0x00007f27c9e387c6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f27c90b2560) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7f27c90b2560, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007f27c9e388b8 in __new_sem_wait_slow (sem=0x7f27c90b2560, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007f27c78e64b6 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#4  0x00007f27c78bf6ea in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#5  0x00007f27c78bf579 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#6  0x00007f27c9e2f6db in start_thread (arg=0x7f27c4670700) at pthread_create.c:463
#7  0x00007f27c921961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f27c4871700 (LWP 20478)):
#0  0x00007f27c920cbb9 in __GI___poll (fds=0x7f27c0002d50, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f27c7b4811a in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#2  0x00007f27c7b45605 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#3  0x00007f27c7b43e69 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#4  0x00007f27c7b47531 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#5  0x00007f27c9e2f6db in start_thread (arg=0x7f27c4871700) at pthread_create.c:463
#6  0x00007f27c921961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f27c63ff700 (LWP 20477)):
#0  0x00007f27c9e35ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f27c90c16b8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7f27c90c1668, cond=0x7f27c90c1690) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7f27c90c1690, mutex=0x7f27c90c1668) at pthread_cond_wait.c:655
#3  0x00007f27c7969143 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#4  0x00007f27c9e2f6db in start_thread (arg=0x7f27c63ff700) at pthread_create.c:463
#5  0x00007f27c921961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f27ca461740 (LWP 20476)):
#0  0x00007f27c9e3a32a in __waitpid (pid=20595, stat_loc=0x7fff6d150ec0, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1  0x00007f27c7a818b7 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#2  0x00007f27c7a2372e in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#3  0x00007f27c7a81048 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#4  <signal handler called>
#5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#6  0x00007f27c91387f1 in __GI_abort () at abort.c:79
#7  0x00007f27c7afb045 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#8  0x00007f27c79115a3 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#9  0x00007f27c7afb405 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#10 0x00007f27c7afb53e in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#11 0x00007f27c7938c86 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#12 0x00007f27c78c2fda in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#13 0x00007f27c78756d1 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#14 0x00000000418eacea in ?? ()
#15 0x000056390eb8c0c8 in ?? ()
#16 0x00007f27c53ff000 in ?? ()
#17 0x0000000000000000 in ?? ()
/datadisks/disk1/work/B9570A9B/w/AA38092D/e/tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.sh: line 395: 20476 Aborted                 (core dumped) $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"

Return code:      1
Raw output file:      /datadisks/disk1/work/B9570A9B/w/AA38092D/uploads/Reports/tracing.runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.output.txt
Raw output:
BEGIN EXECUTION
/datadisks/disk1/work/B9570A9B/p/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false nativeruntimeeventsource.dll ''
Why are we accessing an entry that is not allocated

=================================================================
Native Crash Reporting
=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

=================================================================
Native stacktrace:
=================================================================
0x7f27c7a81772 - Unknown
0x7f27c7a2372e - Unknown
0x7f27c7a81048 - Unknown
0x7f27c9e3a980 - Unknown
0x7f27c9136e87 - Unknown
0x7f27c91387f1 - Unknown
0x7f27c7afb045 - Unknown
0x7f27c79115a3 - Unknown
0x7f27c7afb405 - Unknown
0x7f27c7afb53e - Unknown
0x7f27c7938c86 - Unknown
0x7f27c78c2fda - Unknown
0x7f27c78756d1 - Unknown
0x418eacea - Unknown

=================================================================
Basic Fault Address Reporting
=================================================================
Memory around native instruction pointer (0x7f27c9136e87):0x7f27c9136e77  d2 4c 89 ce bf 02 00 00 00 b8 0e 00 00 00 0f 05  .L..............
0x7f27c9136e87  48 8b 8c 24 08 01 00 00 64 48 33 0c 25 28 00 00  H..$....dH3.%(..
0x7f27c9136e97  00 44 89 c0 75 1f 48 81 c4 18 01 00 00 c3 0f 1f  .D..u.H.........
0x7f27c9136ea7  00 48 8b 15 b9 bf 3a 00 f7 d8 41 b8 ff ff ff ff  .H....:...A.....

=================================================================
Managed Stacktrace:
=================================================================
  at <unknown> <0xffffffff>
  at System.Runtime.InteropServices.GCHandle:InternalGet <0x00079>
  at System.Runtime.InteropServices.GCHandle:get_Target <0x00028>
  at HandleManager:FromHandle <0x00077>
  at System.Threading.WaitSubsystem:SetEvent <0x0001f>
  at System.Threading.EventWaitHandle:Set <0x0005a>
  at System.Diagnostics.Tracing.EventPipeEventDispatcher:StopDispatchTask <0x00042>
  at System.Diagnostics.Tracing.EventPipeEventDispatcher:CommitDispatchConfiguration <0x0008b>
  at System.Diagnostics.Tracing.EventPipeEventDispatcher:SendCommand <0x000dc>
  at System.Diagnostics.Tracing.EventListener:EnableEvents <0x000c2>
  at System.Diagnostics.Tracing.EventListener:EnableEvents <0x00023>
  at Tracing.Tests.SimpleEventListener:OnEventSourceCreated <0x00061>
  at System.Diagnostics.Tracing.EventSource:AddListener <0x00194>
  at <>c:<.ctor>b__10_0 <0x00044>
  at System.Diagnostics.Tracing.EventListener:CallBackForExistingEventSources <0x00268>
  at System.Diagnostics.Tracing.EventListener:.ctor <0x000b3>
  at Tracing.Tests.SimpleEventListener:.ctor <0x0001e>
  at Tracing.Tests.NativeRuntimeEventSourceTest:Main <0x000ab>
  at <Module>:runtime_invoke_int_object <0x00069>
=================================================================
Expected: 100
Actual: 134
END EXECUTION - FAILED
Test Harness Exitcode is : 1
To run the test:
> set CORE_ROOT=/datadisks/disk1/work/B9570A9B/p
> /datadisks/disk1/work/B9570A9B/w/AA38092D/e/tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.sh

josalem · 2022-04-13T19:09:18Z

Having some difficulties running the test under a debugger, but I was able to glean that we are hitting this assert in sgen-array-list.h:

static inline volatile gpointer *
sgen_array_list_get_slot (SgenArrayList *array, guint32 index)
{
	guint32 bucket, offset;

	SGEN_ASSERT (0, index < array->capacity, "Why are we accessing an entry that is not allocated"); // <----

	sgen_array_list_bucketize (index, &bucket, &offset);
	return &(array->entries [bucket] [offset]);
}

Which we seem to be getting to from this chain:

sgen_gchandle_get_target
mono_gchandle_get_target_handle
ves_icall_System_GCHandle_InternalGet_raw
...

Looking at sgen_gchandle_get_target:

GCObject*
sgen_gchandle_get_target (guint32 gchandle)
{
	guint index = MONO_GC_HANDLE_SLOT (gchandle);
	GCHandleType type = MONO_GC_HANDLE_TYPE (gchandle);
	HandleData *handles = gc_handles_for_type (type);
	/* Invalid handles are possible; accessing one should produce NULL. (#34276) */
	if (!handles)
		return NULL;
	return link_get (sgen_array_list_get_slot (&handles->entries_array, index), MONO_GC_HANDLE_TYPE_IS_WEAK (type));
}

It looks like we are getting to the return statement and the value of index is greater than the size of &handles->entries_array.

Hopefully this becomes painfully obvious once I can get it running under LLDB.

josalem · 2022-04-13T23:14:06Z

Got it under a debugger finally. Turns out you can't use LLDB in Codespaces -_-.

Anyway, I managed to peak at the frames leading up to the assert on Mono. Looking at the data shows that we're getting an invalid index into the HandleData->entries_array.entries array when the gchandle is masked:

(lldb) f 11
frame #11: 0x000000010312c0f0 libcoreclr.dylib`sgen_gchandle_get_target(gchandle=2151731204) at sgen-gchandles.c:305:19
   302 		/* Invalid handles are possible; accessing one should produce NULL. (#34276) */
   303 		if (!handles)
   304 			return NULL;
-> 305 		return link_get (sgen_array_list_get_slot (&handles->entries_array, index), MONO_GC_HANDLE_TYPE_IS_WEAK (type));
   306 	}
   307
   308 	void
(lldb) fr v
(guint32) gchandle = 2151731204
(guint) index = 268966400 // <---------
(GCHandleType) type = HANDLE_PINNED
(HandleData *) handles = 0x0000000103532c48
(lldb) p handles
(HandleData *) $1 = 0x0000000103532c48
(lldb) p *handles
(HandleData) $2 = {
  entries_array = {
    entries = {
      [0] = 0x0000600003e04300
      [1] = 0x0000000000000000
      [2] = 0x0000000000000000
      [3] = 0x0000000000000000
      ...
    }
    capacity = 32
    slot_hint = 0
    next_slot = 1
    bucket_alloc_callback = 0x000000010312d2a0 (libcoreclr.dylib`bucket_alloc_callback at sgen-gchandles.c:106)
    is_slot_set_func = 0x000000010312d190 (libcoreclr.dylib`is_slot_set at sgen-gchandles.c:88)
    set_slot_func = 0x000000010312d1d0 (libcoreclr.dylib`try_occupy_slot at sgen-gchandles.c:98)
    mem_type = -1
  }
  type = '\x03'
}

The code is supposed to be looking up a SafeWaitHandle for the wait handle on the buffer manager of the session. Poking at the session shows that the underlying handle matches (0x000000010081a008):

(lldb) p *_ep_sessions[0]->buffer_manager
(EventPipeBufferManager) $10 = {
  thread_session_state_list = {
    list = 0x0000000000000000
  }
  sequence_points = {
    list = 0x0000000000000000
  }
  rt_wait_event = (event = 0x000000010081a008) // <---------------
  rt_lock = {
    lock = 0x0000600001718d80
    owning_thread_id = 0x0000000000000000
  }
  ...
}

    frame #10: 0x000000010312c256 libcoreclr.dylib`sgen_array_list_get_slot(array=0x0000000103532c48, index=268966400) at sgen-array-list.h:112:2
  * frame #11: 0x000000010312c0f0 libcoreclr.dylib`sgen_gchandle_get_target(gchandle=2151731204) at sgen-gchandles.c:305:19
    frame #12: 0x00000001030c723d libcoreclr.dylib`mono_gchandle_get_target_internal(gchandle=0x000000010081a008) at sgen-mono.c:2782:9
    frame #13: 0x000000010306dad5 libcoreclr.dylib`mono_gchandle_get_target_handle(gchandle=0x000000010081a008) at handle.c:414:9
    frame #14: 0x00000001030acea9 libcoreclr.dylib`ves_icall_System_GCHandle_InternalGet(handle=0x000000010081a008, error=0x00007ff7bfefde78) at gc.c:676:9
    frame #15: 0x0000000102ffcada libcoreclr.dylib`ves_icall_System_GCHandle_InternalGet_raw(a0=0x000000010081a008) at icall-def.h:430:1  <------

The faulting managed code from EventPipeEventDispatcher.cs:

        private void StopDispatchTask()
        {
            Debug.Assert(Monitor.IsEntered(m_dispatchControlLock));

            if (m_dispatchTask != null)
            {
                m_stopDispatchTask = true;
                Debug.Assert(!m_dispatchTaskWaitHandle.SafeWaitHandle.IsInvalid);
                EventWaitHandle.Set(m_dispatchTaskWaitHandle.SafeWaitHandle); // <-------
                m_dispatchTask.Wait();
                m_dispatchTask = null;
            }
        }

We don't hit the Debug.Assert before the call to EventWaitHandle.Set().

@lambdageek or @lateralusX do you have any intuition about what might be going on here? I went digging into the wait handle alloc logic in Mono, but it might take me a while to pinpoint what's going on.

My proposal is that we turn this test off for Mono and file an issue to identify the root cause of this failure.

josalem · 2022-04-15T17:46:17Z

Are we content merging this with the test turned off for Mono? I don't believe the test failure is due to the code changes in this PR.

lateralusX · 2022-04-19T09:40:19Z

Are we content merging this with the test turned off for Mono? I don't believe the test failure is due to the code changes in this PR.

@josalem, so this test has never been run in the past on CoreCLR or Mono? If so, I guess the issue would just reproduce by re-enabling the test on Mono, without the other changes in this PR?

lateralusX · 2022-04-19T11:30:58Z

Hmm, so this line

runtime/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventDispatcher.cs

Line 160 in 57bfe47

EventWaitHandle.Set(m_dispatchTaskWaitHandle.SafeWaitHandle);

will go through the above callstack on Mono Unix, but what is returned from get GetWaitHandle is a wait handle and will not map to an underlying managed object, but a MonoW32Handle * created by mono_w32event_create, don't think this correctly be converted into a GCHandle, since it is not a managed object, unless these native handles are handled in any specific ways, @lambdageek, do you know if that is the case? I couldn't find that it actual register these as GC handles.

On Windows System.Threading.WaitSubsystem:SetEvent won't go through the handle manager, but directly to OS API's, meaning that handle is the OS primitive, but on Unix, it will try to convert it into a GC handle and then get back the target WaitableObject,

runtime/src/libraries/System.Private.CoreLib/src/System/Threading/WaitSubsystem.Unix.cs

Line 223 in 91da4ac

SetEvent(HandleManager.FromHandle(handle));

, and looking back this call didn't go through EventWaitHandle but did a direct call through Interop.Kernel32.SetEvent (changed here 70bfd2a) so only assumed SafeHandle and not a WaitableObject, not sure how that worked on CoreCLR Unix in the past since its a Interop.Kernel32.SetEvent call.

We already have comments in this code to handle the event in native code, so maybe it is safer to replace the GetWaitHandle with a set and wait icalls/qcalls calls instead making sure we get to the underlying implementation when doing this calls?

@josalem something you could look at as part of this PR or I can make that change and push it into this or create another PR and re-enable the test on Mono, what ever suites you best.

lateralusX · 2022-04-19T12:44:07Z

Looks like there was a rather big change in the WaitHandle, EventWaitHandle ~15 months ago, 33c7c4d, that probably affected the underlying assumptions around what GetWaitHandle could return and used in EventPipeEventDispatcher.cs. Moving away from handling the event altogether in managed code is probably the right way to go.

lateralusX · 2022-04-19T13:10:13Z

@josalem, @kouvel:

On Mono,

runtime/src/mono/System.Private.CoreLib/src/System/Threading/Overlapped.cs

Line 116 in c2f287c

private unsafe NativeOverlapped* AllocateNativeOverlapped()

is done in managed code, question is if that should raise the same event there as well, but from managed code? Right now I don't believe Mono will generate ThreadPoolIOPack event at all.

lateralusX · 2022-04-19T14:39:55Z

Guess this has not been seen on Mono in the past since the runtime tests was not run in the past and using EventListener to track inproc native runtime events has probably not been used on Mono yet, most of the EventPipe work so far is normally consumed using diagnostic server sessions.

josalem · 2022-04-19T16:42:53Z

Distilling a conversation @lateralusX and I had offline:

There is potentially a little more work on the metadata to make sure Mono fires the missing event as well.
Fixing the issue handle issue will come in a followup PR that will enable the test again on Mono, since the metadata fix is orthogonal to the handle issue.

lateralusX · 2022-04-20T09:14:36Z

Fixing the issue handle issue will come in a followup PR that will enable the test again on Mono, since the metadata fix is orthogonal to the handle issue.

I can work on that PR in parallel, but it will need to go after this one has been merged.

kouvel · 2022-04-20T17:42:12Z

...s/System.Private.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.cs

+            IntPtr Overlapped,
+            ushort ClrInstanceID = DefaultClrInstanceId)
+        {
+            if (!IsEnabled(EventLevel.Verbose, Keywords.ThreadingKeyword))


Just a nit, it may be unnecessary to check this here. When used, I figure this method would look similar to the private ThreadPoolIODequeue event method, where it is supported by a public helper method that would take the managed Overlapped object and pass in the hash code to this event method. Then it would be better to check IsEnabled in the public helper method before getting the hash code. Same comment for the other file.

I guess this check can be removed now since it's checked in the public method

...vate.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.NativeSinks.cs

...s/System.Private.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.cs

kouvel · 2022-04-20T17:49:53Z

...vate.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.NativeSinks.cs

+            if (!IsEnabled(EventLevel.Verbose, Keywords.ThreadingKeyword))
+                return;
+
+            EventData* data = stackalloc EventData[3]


Probably doesn't matter currently, but just curious. It looks like the other event methods in this file, such as ThreadPoolIODequeue, are forwarding to the native side to fire the event. This particular event is currently only fired from the native side in coreclr, but in the future if this method would be called in coreclr from the managed side, would the event need to be fired from the native side like for the other events?

This code needs to be moved to the runtime specific implementations of this class, e.g., NativeRuntimeEventSource.NativeSinks.{Mono|CoreCLR|NativeAOT}.cs. In the Mono case, this is only fired from managed code, so it only needs the managed implementation. For CoreCLR, it's only fired from native code, so we can use a NotImplementedException stub. I'll need to look at what NativeAOT does to see what's needed there.

I think we should follow the pattern used by other threadpool events when implementing this new event on Mono, instead of doing formatting in managed it should call a new icall, LogThreadPoolIOPack that works the same as all the other LogThreadPool*. Since it is not implemented from managed in CoreCLR I believe its fine to do the not implemented, but maybe it just as simple to just do the qcall and open up the ability to log this event from CoreCLR as well, even if its currently not used, to keep consistency with other thread pool events.

...c/System/Diagnostics/Tracing/NativeRuntimeEventSource.PortableThreadPool.NativeSinks.Mono.cs

josalem · 2022-04-21T19:30:25Z

I don't think we need the full extent of these changes for a backport. A subset of the changes should remove the NRE that folks will see in EventLIstener. Specifically, we only need to add

        [Event(65, Level = EventLevel.Verbose, Message = Messages.IO, Task = Tasks.ThreadPool, Opcode = Opcodes.IOPack, Version = 0, Keywords = Keywords.ThreadingKeyword)]
        private unsafe void ThreadPoolIOPack(
            IntPtr NativeOverlapped,
            IntPtr Overlapped,
            ushort ClrInstanceID = DefaultClrInstanceId)
        {
            throw new NotImplementedException();
        }

to NativeRuntimeEventSource.PortableThreadPool.CoreCLR.cs. This event is not fired in .net6 from Mono and in CoreCLR it is only fired from a QCall, so we don't need an implementation. We only need the function stub for the metadata to get generated on CoreCLR.

This patch contains a more complete fix, but has grown beyond something worth taking to tactics when this can be fixed with ~10 LOC instead of the full ~200 LOC. Thoughts @noahfalk @kouvel?

kouvel · 2022-04-21T19:36:59Z

Yea just the minimum for backport makes sense

noahfalk · 2022-04-21T20:12:10Z

Yep, I'd do the minimal change that fixes the issue for the backport

josalem · 2022-04-27T15:39:07Z

@lateralusX, I'm merging, so let's rebase #68320 on top of this.

josalem added the area-System.Diagnostics.Tracing label Mar 25, 2022

josalem added this to the 7.0.0 milestone Mar 25, 2022

josalem requested review from noahfalk, davmason and kouvel March 25, 2022 17:22

josalem self-assigned this Mar 25, 2022

kouvel approved these changes Mar 25, 2022

View reviewed changes

...vate.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.NativeSinks.cs Outdated Show resolved Hide resolved

...s/System.Private.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.cs Outdated Show resolved Hide resolved

noahfalk approved these changes Mar 25, 2022

View reviewed changes

...vate.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.NativeSinks.cs Outdated Show resolved Hide resolved

...s/System.Private.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.cs Outdated Show resolved Hide resolved

John Salem added 2 commits March 30, 2022 10:15

Add method stubs for eventid 65

360a4c6

Turn on and update test

18e7adc

josalem force-pushed the dev/josalem/fix-missing-event-metadata branch from b80bf77 to 18e7adc Compare March 30, 2022 17:16

josalem commented Mar 30, 2022

View reviewed changes

josalem mentioned this pull request Apr 14, 2022

[Mono] test failure: SIGABRT in nativeruntimeeventsource.cs #68032

Closed

John Salem added 2 commits April 14, 2022 17:18

Disable test for Mono

6f14663

fix test exclusion

7122aec

Add implementation of event for mono and use it

2f09eea

josalem requested a review from marek-safar as a code owner April 20, 2022 03:27

Fix typo

fcf6d79

kouvel reviewed Apr 20, 2022

View reviewed changes

John Salem added 2 commits April 20, 2022 21:04

refactor to match pattern and feedback

4acfe2e

Feedback and fix Mono corelib build

b087cc3

lateralusX reviewed Apr 21, 2022

View reviewed changes

...c/System/Diagnostics/Tracing/NativeRuntimeEventSource.PortableThreadPool.NativeSinks.Mono.cs Show resolved Hide resolved

lateralusX mentioned this pull request Apr 21, 2022

Move away from exposing internal EventPipe wait handle to EventPipeEventDispatcher. #68320

Merged

Move event firing to {Q|I}Call and ifdef for wasm

8cb8c4a

josalem requested review from lambdageek, vargaz and thaystg as code owners April 21, 2022 19:22

John Salem added 2 commits April 21, 2022 21:05

suppress CA1822

da14ec5

Fix ordering of ICall defs for Mono

c58fa15

lateralusX approved these changes Apr 22, 2022

View reviewed changes

josalem merged commit 276d642 into dotnet:main Apr 27, 2022

josalem deleted the dev/josalem/fix-missing-event-metadata branch April 27, 2022 15:39

josalem pushed a commit to josalem/runtime that referenced this pull request May 4, 2022

Minimal version of dotnet#67150

c889859

josalem mentioned this pull request May 4, 2022

[release/6.0] Fix missing metadata for ThreadPool Event #68876

Merged

carlossanlop pushed a commit that referenced this pull request May 14, 2022

Minimal version of #67150 (#68876)

21d3425

ghost locked as resolved and limited conversation to collaborators May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Missing Metadata for ThreadPool Event #67150

Fix Missing Metadata for ThreadPool Event #67150

josalem commented Mar 25, 2022 •

edited

Loading

ghost commented Mar 25, 2022

Will require backport

kouvel left a comment

noahfalk left a comment

josalem Mar 30, 2022

josalem Mar 30, 2022

josalem commented Apr 13, 2022 •

edited

Loading

josalem commented Apr 13, 2022

josalem commented Apr 13, 2022

josalem commented Apr 15, 2022

lateralusX commented Apr 19, 2022

lateralusX commented Apr 19, 2022 •

edited

Loading

lateralusX commented Apr 19, 2022 •

edited

Loading

lateralusX commented Apr 19, 2022 •

edited

Loading

lateralusX commented Apr 19, 2022 •

edited

Loading

josalem commented Apr 19, 2022

lateralusX commented Apr 20, 2022

kouvel Apr 20, 2022

kouvel Apr 20, 2022

kouvel Apr 20, 2022 •

edited

Loading

josalem Apr 20, 2022

lateralusX Apr 21, 2022

josalem commented Apr 21, 2022

kouvel commented Apr 21, 2022

noahfalk commented Apr 21, 2022

josalem commented Apr 27, 2022

	# ThreadPool events are defined manually in NativeRuntimeEventSource.PortableThreadPool.cs
	symbol = eventNode.getAttribute("symbol")
	if "ThreadPool" in symbol:
	return

	FCIMPL1(LPOVERLAPPED, AllocateNativeOverlapped, OverlappedDataObject* overlappedUNSAFE)
	{
	FCALL_CONTRACT;

	LPOVERLAPPED lpOverlapped = NULL;
	OVERLAPPEDDATAREF overlapped = ObjectToOVERLAPPEDDATAREF(overlappedUNSAFE);
	OBJECTREF userObject = overlapped->m_userObject;

	HELPER_METHOD_FRAME_BEGIN_RET_ATTRIB_2(Frame::FRAME_ATTR_NONE, overlapped, userObject);

	if (g_pOverlappedDataClass == NULL)
	{
	g_pOverlappedDataClass = CoreLibBinder::GetClass(CLASS__OVERLAPPEDDATA);
	// We have optimization to avoid creating event if IO is in default domain. This depends on default domain
	// can not be unloaded.
	}

	CONSISTENCY_CHECK(overlapped->GetMethodTable() == g_pOverlappedDataClass);

	if (userObject != NULL)
	{
	if (userObject->GetMethodTable() == g_pPredefinedArrayTypes[ELEMENT_TYPE_OBJECT].AsMethodTable())
	{
	BASEARRAYREF asArray = (BASEARRAYREF) userObject;
	OBJECTREF pObj = (OBJECTREF)(asArray->GetDataPtr());
	SIZE_T num = asArray->GetNumComponents();
	SIZE_T i;
	for (i = 0; i < num; i ++)
	{
	ValidatePinnedObject(pObj[i]);
	}
	}
	else
	{
	ValidatePinnedObject(userObject);
	}
	}

	NewHolder<NATIVEOVERLAPPED_AND_HANDLE> overlappedHolder(new NATIVEOVERLAPPED_AND_HANDLE());
	overlappedHolder->m_handle = GetAppDomain()->CreateTypedHandle(overlapped, HNDTYPE_ASYNCPINNED);
	lpOverlapped = &(overlappedHolder.Extract()->m_overlapped);

	lpOverlapped->Internal = 0;
	lpOverlapped->InternalHigh = 0;
	lpOverlapped->Offset = overlapped->m_offsetLow;
	lpOverlapped->OffsetHigh = overlapped->m_offsetHigh;
	lpOverlapped->hEvent = (HANDLE)overlapped->m_eventHandle;

	overlapped->m_pNativeOverlapped = lpOverlapped;

	HELPER_METHOD_FRAME_END();
	LOG((LF_INTEROP, LL_INFO10000, "In AllocNativeOperlapped thread 0x%x\n", GetThread()));

	if (ETW_EVENT_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_DOTNET_Context, ThreadPoolIODequeue))
	FireEtwThreadPoolIOPack(lpOverlapped, overlappedUNSAFE, GetClrInstanceId());

	return lpOverlapped;
	}
	FCIMPLEND

Fix Missing Metadata for ThreadPool Event #67150

Fix Missing Metadata for ThreadPool Event #67150

Conversation

josalem commented Mar 25, 2022 • edited Loading

Will require backport

ghost commented Mar 25, 2022

Will require backport

kouvel left a comment

Choose a reason for hiding this comment

noahfalk left a comment

Choose a reason for hiding this comment

josalem Mar 30, 2022

Choose a reason for hiding this comment

josalem Mar 30, 2022

Choose a reason for hiding this comment

josalem commented Apr 13, 2022 • edited Loading

josalem commented Apr 13, 2022

josalem commented Apr 13, 2022

josalem commented Apr 15, 2022

lateralusX commented Apr 19, 2022

lateralusX commented Apr 19, 2022 • edited Loading

lateralusX commented Apr 19, 2022 • edited Loading

lateralusX commented Apr 19, 2022 • edited Loading

lateralusX commented Apr 19, 2022 • edited Loading

josalem commented Apr 19, 2022

lateralusX commented Apr 20, 2022

kouvel Apr 20, 2022

Choose a reason for hiding this comment

kouvel Apr 20, 2022

Choose a reason for hiding this comment

kouvel Apr 20, 2022 • edited Loading

Choose a reason for hiding this comment

josalem Apr 20, 2022

Choose a reason for hiding this comment

lateralusX Apr 21, 2022

Choose a reason for hiding this comment

josalem commented Apr 21, 2022

kouvel commented Apr 21, 2022

noahfalk commented Apr 21, 2022

josalem commented Apr 27, 2022

josalem commented Mar 25, 2022 •

edited

Loading

josalem commented Apr 13, 2022 •

edited

Loading

lateralusX commented Apr 19, 2022 •

edited

Loading

lateralusX commented Apr 19, 2022 •

edited

Loading

lateralusX commented Apr 19, 2022 •

edited

Loading

lateralusX commented Apr 19, 2022 •

edited

Loading

kouvel Apr 20, 2022 •

edited

Loading