Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Missing Metadata for ThreadPool Event #67150

Merged
merged 11 commits into from
Apr 27, 2022

Conversation

josalem
Copy link
Contributor

@josalem josalem commented Mar 25, 2022

fixes #65630

Will require backport

(copy/paste of root cause explanation from issue)
In .net6 we added a managed implementation of the thread pool. To send the same events under the same provider we had to do some duplication of event definitions so that managed code had the ability to send the same events. Every event needs to be represented in managed code for EventListener to know how to parse the events. To that end, there is a python script that generates managed function stubs with the correct event IDs and parameters. For components that transition from native to managed, these stubs would be filled in by hand. For example, the thread pool had all the events sent from managed code filled out.

After that conversion, it looks like any event with ThreadPool in its name gets left out of the code generation phase:

# ThreadPool events are defined manually in NativeRuntimeEventSource.PortableThreadPool.cs
symbol = eventNode.getAttribute("symbol")
if "ThreadPool" in symbol:
return

This specific event, however, is only sent inside an FCALL related to overlapped IO and wasn't implemented in managed code:

<event value="65" version="0" level="win:Verbose"  template="ThreadPoolIOWork"
                           keywords="ThreadingKeyword"
                           task="ThreadPool"
                           opcode="IOPack"
                           symbol="ThreadPoolIOPack"
                           message="$(string.RuntimePublisher.ThreadPoolIOPackEventMessage)"/>

sent from:

FCIMPL1(LPOVERLAPPED, AllocateNativeOverlapped, OverlappedDataObject* overlappedUNSAFE)
{
FCALL_CONTRACT;
LPOVERLAPPED lpOverlapped = NULL;
OVERLAPPEDDATAREF overlapped = ObjectToOVERLAPPEDDATAREF(overlappedUNSAFE);
OBJECTREF userObject = overlapped->m_userObject;
HELPER_METHOD_FRAME_BEGIN_RET_ATTRIB_2(Frame::FRAME_ATTR_NONE, overlapped, userObject);
if (g_pOverlappedDataClass == NULL)
{
g_pOverlappedDataClass = CoreLibBinder::GetClass(CLASS__OVERLAPPEDDATA);
// We have optimization to avoid creating event if IO is in default domain. This depends on default domain
// can not be unloaded.
}
CONSISTENCY_CHECK(overlapped->GetMethodTable() == g_pOverlappedDataClass);
if (userObject != NULL)
{
if (userObject->GetMethodTable() == g_pPredefinedArrayTypes[ELEMENT_TYPE_OBJECT].AsMethodTable())
{
BASEARRAYREF asArray = (BASEARRAYREF) userObject;
OBJECTREF *pObj = (OBJECTREF*)(asArray->GetDataPtr());
SIZE_T num = asArray->GetNumComponents();
SIZE_T i;
for (i = 0; i < num; i ++)
{
ValidatePinnedObject(pObj[i]);
}
}
else
{
ValidatePinnedObject(userObject);
}
}
NewHolder<NATIVEOVERLAPPED_AND_HANDLE> overlappedHolder(new NATIVEOVERLAPPED_AND_HANDLE());
overlappedHolder->m_handle = GetAppDomain()->CreateTypedHandle(overlapped, HNDTYPE_ASYNCPINNED);
lpOverlapped = &(overlappedHolder.Extract()->m_overlapped);
lpOverlapped->Internal = 0;
lpOverlapped->InternalHigh = 0;
lpOverlapped->Offset = overlapped->m_offsetLow;
lpOverlapped->OffsetHigh = overlapped->m_offsetHigh;
lpOverlapped->hEvent = (HANDLE)overlapped->m_eventHandle;
overlapped->m_pNativeOverlapped = lpOverlapped;
HELPER_METHOD_FRAME_END();
LOG((LF_INTEROP, LL_INFO10000, "In AllocNativeOperlapped thread 0x%x\n", GetThread()));
if (ETW_EVENT_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_DOTNET_Context, ThreadPoolIODequeue))
FireEtwThreadPoolIOPack(lpOverlapped, overlappedUNSAFE, GetClrInstanceId());
return lpOverlapped;
}
FCIMPLEND

As a result, managed code doesn't have any metadata for this event. When the event is dispatched to managed code, EventSource attempts to read the struct in m_eventData which was zero initialized, so it NREs.

The reason this was only repro-ing for a web app is that ASP.NET Core on Windows uses a lot of overlapped IO so it sends these events while a basic console app does not.

This issue only presents for an in-process EventListener. Out-of-process listeners will not be causing NREs in the target runtime.

CC @dotnet/dotnet-diag @tommcdon

fixes #12141

@josalem josalem added this to the 7.0.0 milestone Mar 25, 2022
@josalem josalem requested review from noahfalk, davmason and kouvel March 25, 2022 17:22
@josalem josalem self-assigned this Mar 25, 2022
@ghost
Copy link

ghost commented Mar 25, 2022

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

Issue Details

fixes #65630

Will require backport

(copy/paste of root cause explanation from issue)
In .net6 we added a managed implementation of the thread pool. To send the same events under the same provider we had to do some duplication of event definitions so that managed code had the ability to send the same events. Every event needs to be represented in managed code for EventListener to know how to parse the events. To that end, there is a python script that generates managed function stubs with the correct event IDs and parameters. For components that transition from native to managed, these stubs would be filled in by hand. For example, the thread pool had all the events sent from managed code filled out.

After that conversion, it looks like any event with ThreadPool in its name gets left out of the code generation phase:

# ThreadPool events are defined manually in NativeRuntimeEventSource.PortableThreadPool.cs
symbol = eventNode.getAttribute("symbol")
if "ThreadPool" in symbol:
return

This specific event, however, is only sent inside an FCALL related to overlapped IO and wasn't implemented in managed code:

<event value="65" version="0" level="win:Verbose"  template="ThreadPoolIOWork"
                           keywords="ThreadingKeyword"
                           task="ThreadPool"
                           opcode="IOPack"
                           symbol="ThreadPoolIOPack"
                           message="$(string.RuntimePublisher.ThreadPoolIOPackEventMessage)"/>

sent from:

FCIMPL1(LPOVERLAPPED, AllocateNativeOverlapped, OverlappedDataObject* overlappedUNSAFE)
{
FCALL_CONTRACT;
LPOVERLAPPED lpOverlapped = NULL;
OVERLAPPEDDATAREF overlapped = ObjectToOVERLAPPEDDATAREF(overlappedUNSAFE);
OBJECTREF userObject = overlapped->m_userObject;
HELPER_METHOD_FRAME_BEGIN_RET_ATTRIB_2(Frame::FRAME_ATTR_NONE, overlapped, userObject);
if (g_pOverlappedDataClass == NULL)
{
g_pOverlappedDataClass = CoreLibBinder::GetClass(CLASS__OVERLAPPEDDATA);
// We have optimization to avoid creating event if IO is in default domain. This depends on default domain
// can not be unloaded.
}
CONSISTENCY_CHECK(overlapped->GetMethodTable() == g_pOverlappedDataClass);
if (userObject != NULL)
{
if (userObject->GetMethodTable() == g_pPredefinedArrayTypes[ELEMENT_TYPE_OBJECT].AsMethodTable())
{
BASEARRAYREF asArray = (BASEARRAYREF) userObject;
OBJECTREF *pObj = (OBJECTREF*)(asArray->GetDataPtr());
SIZE_T num = asArray->GetNumComponents();
SIZE_T i;
for (i = 0; i < num; i ++)
{
ValidatePinnedObject(pObj[i]);
}
}
else
{
ValidatePinnedObject(userObject);
}
}
NewHolder<NATIVEOVERLAPPED_AND_HANDLE> overlappedHolder(new NATIVEOVERLAPPED_AND_HANDLE());
overlappedHolder->m_handle = GetAppDomain()->CreateTypedHandle(overlapped, HNDTYPE_ASYNCPINNED);
lpOverlapped = &(overlappedHolder.Extract()->m_overlapped);
lpOverlapped->Internal = 0;
lpOverlapped->InternalHigh = 0;
lpOverlapped->Offset = overlapped->m_offsetLow;
lpOverlapped->OffsetHigh = overlapped->m_offsetHigh;
lpOverlapped->hEvent = (HANDLE)overlapped->m_eventHandle;
overlapped->m_pNativeOverlapped = lpOverlapped;
HELPER_METHOD_FRAME_END();
LOG((LF_INTEROP, LL_INFO10000, "In AllocNativeOperlapped thread 0x%x\n", GetThread()));
if (ETW_EVENT_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_DOTNET_Context, ThreadPoolIODequeue))
FireEtwThreadPoolIOPack(lpOverlapped, overlappedUNSAFE, GetClrInstanceId());
return lpOverlapped;
}
FCIMPLEND

As a result, managed code doesn't have any metadata for this event. When the event is dispatched to managed code, EventSource attempts to read the struct in m_eventData which was zero initialized, so it NREs.

The reason this was only repro-ing for a web app is that ASP.NET Core on Windows uses a lot of overlapped IO so it sends these events while a basic console app does not.

I have a patch that fixes this and will hopefully have a PR out very soon. We will need to backport this to .net6.

This issue only presents for an in-process EventListener. Out-of-process listeners will not be causing NREs in the target runtime.

CC @dotnet/dotnet-diag @tommcdon

Author: josalem
Assignees: josalem
Labels:

area-System.Diagnostics.Tracing

Milestone: 7.0.0

Copy link
Member

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of suggestions, otherwise LGTM, thanks!

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo comments inline

@josalem josalem force-pushed the dev/josalem/fix-missing-event-metadata branch from b80bf77 to 18e7adc Compare March 30, 2022 17:16
Comment on lines -8 to -13
<!--
Test is blocking official build.
Related failures: #18907, #19340, #22441, #22729.
Issue tracking to re-enable: #22898.
-->
<DisableProjectBuild>true</DisableProjectBuild>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these issue numbers seem related to this code. I'm going to poke at the old coreclr repo issues to see if these refer to those numbers (the transfer did some funny business with mapping the numbers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#12141

This test has been off for >3 years now. I want this test back on, but perhaps not as part of a changeset that is going to be backported. Thoughts?

@josalem
Copy link
Contributor Author

josalem commented Apr 13, 2022

I can't seem to repro the Mono failure locally edit: 🤦 I was running the wrong test. I can repro the Mono failure locally. It looks like there's a SIGABRT when trying to get a wait handle from the GC:

=================================================================
Managed Stacktrace:
=================================================================
  at <unknown> <0xffffffff>
  at System.Runtime.InteropServices.GCHandle:InternalGet <0x00079>
  at System.Runtime.InteropServices.GCHandle:get_Target <0x00028>
  at HandleManager:FromHandle <0x00077>
  at System.Threading.WaitSubsystem:SetEvent <0x0001f>
  at System.Threading.EventWaitHandle:Set <0x0005a>
  at System.Diagnostics.Tracing.EventPipeEventDispatcher:StopDispatchTask <0x00042>
  at System.Diagnostics.Tracing.EventPipeEventDispatcher:CommitDispatchConfiguration <0x0008b>
  at System.Diagnostics.Tracing.EventPipeEventDispatcher:SendCommand <0x000dc>
  at System.Diagnostics.Tracing.EventListener:EnableEvents <0x000c2>
  at System.Diagnostics.Tracing.EventListener:EnableEvents <0x00023>
  at Tracing.Tests.SimpleEventListener:OnEventSourceCreated <0x00061>
  at System.Diagnostics.Tracing.EventSource:AddListener <0x00194>
  at <>c:<.ctor>b__10_0 <0x00044>
  at System.Diagnostics.Tracing.EventListener:CallBackForExistingEventSources <0x00268>
  at System.Diagnostics.Tracing.EventListener:.ctor <0x000b3>
  at Tracing.Tests.SimpleEventListener:.ctor <0x0001e>
  at Tracing.Tests.NativeRuntimeEventSourceTest:Main <0x000ab>
  at <Module>:runtime_invoke_int_object <0x00069>
=================================================================
Details

=================================================================
External Debugger Dump:
=================================================================
[New LWP 20477]
[New LWP 20478]
[New LWP 20479]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
warning: the debug information found in "/datadisks/disk1/work/B9570A9B/p/libcoreclr.so.dbg" does not match "/datadisks/disk1/work/B9570A9B/p/libcoreclr.so" (CRC mismatch).

0x00007f27c9e3a32a in __waitpid (pid=20595, stat_loc=0x7fff6d150ec0, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
30../sysdeps/unix/sysv/linux/waitpid.c: No such file or directory.
  Id   Target Id         Frame 
* 1    Thread 0x7f27ca461740 (LWP 20476) "corerun" 0x00007f27c9e3a32a in __waitpid (pid=20595, stat_loc=0x7fff6d150ec0, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
  2    Thread 0x7f27c63ff700 (LWP 20477) "SGen worker" 0x00007f27c9e35ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f27c90c16b8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  3    Thread 0x7f27c4871700 (LWP 20478) "corerun" 0x00007f27c920cbb9 in __GI___poll (fds=0x7f27c0002d50, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  4    Thread 0x7f27c4670700 (LWP 20479) "Finalizer" 0x00007f27c9e387c6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f27c90b2560) at ../sysdeps/unix/sysv/linux/futex-internal.h:205

Thread 4 (Thread 0x7f27c4670700 (LWP 20479)):
#0  0x00007f27c9e387c6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f27c90b2560) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7f27c90b2560, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007f27c9e388b8 in __new_sem_wait_slow (sem=0x7f27c90b2560, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007f27c78e64b6 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#4  0x00007f27c78bf6ea in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#5  0x00007f27c78bf579 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#6  0x00007f27c9e2f6db in start_thread (arg=0x7f27c4670700) at pthread_create.c:463
#7  0x00007f27c921961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f27c4871700 (LWP 20478)):
#0  0x00007f27c920cbb9 in __GI___poll (fds=0x7f27c0002d50, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f27c7b4811a in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#2  0x00007f27c7b45605 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#3  0x00007f27c7b43e69 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#4  0x00007f27c7b47531 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#5  0x00007f27c9e2f6db in start_thread (arg=0x7f27c4871700) at pthread_create.c:463
#6  0x00007f27c921961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f27c63ff700 (LWP 20477)):
#0  0x00007f27c9e35ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f27c90c16b8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7f27c90c1668, cond=0x7f27c90c1690) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7f27c90c1690, mutex=0x7f27c90c1668) at pthread_cond_wait.c:655
#3  0x00007f27c7969143 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#4  0x00007f27c9e2f6db in start_thread (arg=0x7f27c63ff700) at pthread_create.c:463
#5  0x00007f27c921961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f27ca461740 (LWP 20476)):
#0  0x00007f27c9e3a32a in __waitpid (pid=20595, stat_loc=0x7fff6d150ec0, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1  0x00007f27c7a818b7 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#2  0x00007f27c7a2372e in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#3  0x00007f27c7a81048 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#4  <signal handler called>
#5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#6  0x00007f27c91387f1 in __GI_abort () at abort.c:79
#7  0x00007f27c7afb045 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#8  0x00007f27c79115a3 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#9  0x00007f27c7afb405 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#10 0x00007f27c7afb53e in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#11 0x00007f27c7938c86 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#12 0x00007f27c78c2fda in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#13 0x00007f27c78756d1 in ?? () from /datadisks/disk1/work/B9570A9B/p/libcoreclr.so
#14 0x00000000418eacea in ?? ()
#15 0x000056390eb8c0c8 in ?? ()
#16 0x00007f27c53ff000 in ?? ()
#17 0x0000000000000000 in ?? ()
/datadisks/disk1/work/B9570A9B/w/AA38092D/e/tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.sh: line 395: 20476 Aborted                 (core dumped) $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"

Return code:      1
Raw output file:      /datadisks/disk1/work/B9570A9B/w/AA38092D/uploads/Reports/tracing.runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.output.txt
Raw output:
BEGIN EXECUTION
/datadisks/disk1/work/B9570A9B/p/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false nativeruntimeeventsource.dll ''
Why are we accessing an entry that is not allocated

=================================================================
Native Crash Reporting
=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

=================================================================
Native stacktrace:
=================================================================
0x7f27c7a81772 - Unknown
0x7f27c7a2372e - Unknown
0x7f27c7a81048 - Unknown
0x7f27c9e3a980 - Unknown
0x7f27c9136e87 - Unknown
0x7f27c91387f1 - Unknown
0x7f27c7afb045 - Unknown
0x7f27c79115a3 - Unknown
0x7f27c7afb405 - Unknown
0x7f27c7afb53e - Unknown
0x7f27c7938c86 - Unknown
0x7f27c78c2fda - Unknown
0x7f27c78756d1 - Unknown
0x418eacea - Unknown

=================================================================
Basic Fault Address Reporting
=================================================================
Memory around native instruction pointer (0x7f27c9136e87):0x7f27c9136e77  d2 4c 89 ce bf 02 00 00 00 b8 0e 00 00 00 0f 05  .L..............
0x7f27c9136e87  48 8b 8c 24 08 01 00 00 64 48 33 0c 25 28 00 00  H..$....dH3.%(..
0x7f27c9136e97  00 44 89 c0 75 1f 48 81 c4 18 01 00 00 c3 0f 1f  .D..u.H.........
0x7f27c9136ea7  00 48 8b 15 b9 bf 3a 00 f7 d8 41 b8 ff ff ff ff  .H....:...A.....

=================================================================
Managed Stacktrace:
=================================================================
  at <unknown> <0xffffffff>
  at System.Runtime.InteropServices.GCHandle:InternalGet <0x00079>
  at System.Runtime.InteropServices.GCHandle:get_Target <0x00028>
  at HandleManager:FromHandle <0x00077>
  at System.Threading.WaitSubsystem:SetEvent <0x0001f>
  at System.Threading.EventWaitHandle:Set <0x0005a>
  at System.Diagnostics.Tracing.EventPipeEventDispatcher:StopDispatchTask <0x00042>
  at System.Diagnostics.Tracing.EventPipeEventDispatcher:CommitDispatchConfiguration <0x0008b>
  at System.Diagnostics.Tracing.EventPipeEventDispatcher:SendCommand <0x000dc>
  at System.Diagnostics.Tracing.EventListener:EnableEvents <0x000c2>
  at System.Diagnostics.Tracing.EventListener:EnableEvents <0x00023>
  at Tracing.Tests.SimpleEventListener:OnEventSourceCreated <0x00061>
  at System.Diagnostics.Tracing.EventSource:AddListener <0x00194>
  at <>c:<.ctor>b__10_0 <0x00044>
  at System.Diagnostics.Tracing.EventListener:CallBackForExistingEventSources <0x00268>
  at System.Diagnostics.Tracing.EventListener:.ctor <0x000b3>
  at Tracing.Tests.SimpleEventListener:.ctor <0x0001e>
  at Tracing.Tests.NativeRuntimeEventSourceTest:Main <0x000ab>
  at <Module>:runtime_invoke_int_object <0x00069>
=================================================================
Expected: 100
Actual: 134
END EXECUTION - FAILED
Test Harness Exitcode is : 1
To run the test:
> set CORE_ROOT=/datadisks/disk1/work/B9570A9B/p
> /datadisks/disk1/work/B9570A9B/w/AA38092D/e/tracing/runtimeeventsource/nativeruntimeeventsource/nativeruntimeeventsource.sh

@josalem
Copy link
Contributor Author

josalem commented Apr 13, 2022

Having some difficulties running the test under a debugger, but I was able to glean that we are hitting this assert in sgen-array-list.h:

static inline volatile gpointer *
sgen_array_list_get_slot (SgenArrayList *array, guint32 index)
{
	guint32 bucket, offset;

	SGEN_ASSERT (0, index < array->capacity, "Why are we accessing an entry that is not allocated"); // <----

	sgen_array_list_bucketize (index, &bucket, &offset);
	return &(array->entries [bucket] [offset]);
}

Which we seem to be getting to from this chain:

sgen_gchandle_get_target
mono_gchandle_get_target_handle
ves_icall_System_GCHandle_InternalGet_raw
...

Looking at sgen_gchandle_get_target:

GCObject*
sgen_gchandle_get_target (guint32 gchandle)
{
	guint index = MONO_GC_HANDLE_SLOT (gchandle);
	GCHandleType type = MONO_GC_HANDLE_TYPE (gchandle);
	HandleData *handles = gc_handles_for_type (type);
	/* Invalid handles are possible; accessing one should produce NULL. (#34276) */
	if (!handles)
		return NULL;
	return link_get (sgen_array_list_get_slot (&handles->entries_array, index), MONO_GC_HANDLE_TYPE_IS_WEAK (type));
}

It looks like we are getting to the return statement and the value of index is greater than the size of &handles->entries_array.

Hopefully this becomes painfully obvious once I can get it running under LLDB.

@josalem
Copy link
Contributor Author

josalem commented Apr 13, 2022

Got it under a debugger finally. Turns out you can't use LLDB in Codespaces -_-.

Anyway, I managed to peak at the frames leading up to the assert on Mono. Looking at the data shows that we're getting an invalid index into the HandleData->entries_array.entries array when the gchandle is masked:

(lldb) f 11
frame #11: 0x000000010312c0f0 libcoreclr.dylib`sgen_gchandle_get_target(gchandle=2151731204) at sgen-gchandles.c:305:19
   302 		/* Invalid handles are possible; accessing one should produce NULL. (#34276) */
   303 		if (!handles)
   304 			return NULL;
-> 305 		return link_get (sgen_array_list_get_slot (&handles->entries_array, index), MONO_GC_HANDLE_TYPE_IS_WEAK (type));
   306 	}
   307
   308 	void
(lldb) fr v
(guint32) gchandle = 2151731204
(guint) index = 268966400 // <---------
(GCHandleType) type = HANDLE_PINNED
(HandleData *) handles = 0x0000000103532c48
(lldb) p handles
(HandleData *) $1 = 0x0000000103532c48
(lldb) p *handles
(HandleData) $2 = {
  entries_array = {
    entries = {
      [0] = 0x0000600003e04300
      [1] = 0x0000000000000000
      [2] = 0x0000000000000000
      [3] = 0x0000000000000000
      ...
    }
    capacity = 32
    slot_hint = 0
    next_slot = 1
    bucket_alloc_callback = 0x000000010312d2a0 (libcoreclr.dylib`bucket_alloc_callback at sgen-gchandles.c:106)
    is_slot_set_func = 0x000000010312d190 (libcoreclr.dylib`is_slot_set at sgen-gchandles.c:88)
    set_slot_func = 0x000000010312d1d0 (libcoreclr.dylib`try_occupy_slot at sgen-gchandles.c:98)
    mem_type = -1
  }
  type = '\x03'
}

The code is supposed to be looking up a SafeWaitHandle for the wait handle on the buffer manager of the session. Poking at the session shows that the underlying handle matches (0x000000010081a008):

(lldb) p *_ep_sessions[0]->buffer_manager
(EventPipeBufferManager) $10 = {
  thread_session_state_list = {
    list = 0x0000000000000000
  }
  sequence_points = {
    list = 0x0000000000000000
  }
  rt_wait_event = (event = 0x000000010081a008) // <---------------
  rt_lock = {
    lock = 0x0000600001718d80
    owning_thread_id = 0x0000000000000000
  }
  ...
}
    frame #10: 0x000000010312c256 libcoreclr.dylib`sgen_array_list_get_slot(array=0x0000000103532c48, index=268966400) at sgen-array-list.h:112:2
  * frame #11: 0x000000010312c0f0 libcoreclr.dylib`sgen_gchandle_get_target(gchandle=2151731204) at sgen-gchandles.c:305:19
    frame #12: 0x00000001030c723d libcoreclr.dylib`mono_gchandle_get_target_internal(gchandle=0x000000010081a008) at sgen-mono.c:2782:9
    frame #13: 0x000000010306dad5 libcoreclr.dylib`mono_gchandle_get_target_handle(gchandle=0x000000010081a008) at handle.c:414:9
    frame #14: 0x00000001030acea9 libcoreclr.dylib`ves_icall_System_GCHandle_InternalGet(handle=0x000000010081a008, error=0x00007ff7bfefde78) at gc.c:676:9
    frame #15: 0x0000000102ffcada libcoreclr.dylib`ves_icall_System_GCHandle_InternalGet_raw(a0=0x000000010081a008) at icall-def.h:430:1  <------

The faulting managed code from EventPipeEventDispatcher.cs:

        private void StopDispatchTask()
        {
            Debug.Assert(Monitor.IsEntered(m_dispatchControlLock));

            if (m_dispatchTask != null)
            {
                m_stopDispatchTask = true;
                Debug.Assert(!m_dispatchTaskWaitHandle.SafeWaitHandle.IsInvalid);
                EventWaitHandle.Set(m_dispatchTaskWaitHandle.SafeWaitHandle); // <-------
                m_dispatchTask.Wait();
                m_dispatchTask = null;
            }
        }

We don't hit the Debug.Assert before the call to EventWaitHandle.Set().

@lambdageek or @lateralusX do you have any intuition about what might be going on here? I went digging into the wait handle alloc logic in Mono, but it might take me a while to pinpoint what's going on.

My proposal is that we turn this test off for Mono and file an issue to identify the root cause of this failure.

@josalem
Copy link
Contributor Author

josalem commented Apr 15, 2022

Are we content merging this with the test turned off for Mono? I don't believe the test failure is due to the code changes in this PR.

@lateralusX
Copy link
Member

Are we content merging this with the test turned off for Mono? I don't believe the test failure is due to the code changes in this PR.

@josalem, so this test has never been run in the past on CoreCLR or Mono? If so, I guess the issue would just reproduce by re-enabling the test on Mono, without the other changes in this PR?

@lateralusX
Copy link
Member

lateralusX commented Apr 19, 2022

Hmm, so this line

EventWaitHandle.Set(m_dispatchTaskWaitHandle.SafeWaitHandle);

will go through the above callstack on Mono Unix, but what is returned from get GetWaitHandle is a wait handle and will not map to an underlying managed object, but a MonoW32Handle * created by mono_w32event_create, don't think this correctly be converted into a GCHandle, since it is not a managed object, unless these native handles are handled in any specific ways, @lambdageek, do you know if that is the case? I couldn't find that it actual register these as GC handles.

On Windows System.Threading.WaitSubsystem:SetEvent won't go through the handle manager, but directly to OS API's, meaning that handle is the OS primitive, but on Unix, it will try to convert it into a GC handle and then get back the target WaitableObject,

, and looking back this call didn't go through EventWaitHandle but did a direct call through Interop.Kernel32.SetEvent (changed here 70bfd2a) so only assumed SafeHandle and not a WaitableObject, not sure how that worked on CoreCLR Unix in the past since its a Interop.Kernel32.SetEvent call.

We already have comments in this code to handle the event in native code, so maybe it is safer to replace the GetWaitHandle with a set and wait icalls/qcalls calls instead making sure we get to the underlying implementation when doing this calls?

@josalem something you could look at as part of this PR or I can make that change and push it into this or create another PR and re-enable the test on Mono, what ever suites you best.

@lateralusX
Copy link
Member

lateralusX commented Apr 19, 2022

Looks like there was a rather big change in the WaitHandle, EventWaitHandle ~15 months ago, 33c7c4d, that probably affected the underlying assumptions around what GetWaitHandle could return and used in EventPipeEventDispatcher.cs. Moving away from handling the event altogether in managed code is probably the right way to go.

@lateralusX
Copy link
Member

lateralusX commented Apr 19, 2022

@josalem, @kouvel:

On Mono,

private unsafe NativeOverlapped* AllocateNativeOverlapped()
is done in managed code, question is if that should raise the same event there as well, but from managed code? Right now I don't believe Mono will generate ThreadPoolIOPack event at all.

@lateralusX
Copy link
Member

lateralusX commented Apr 19, 2022

Guess this has not been seen on Mono in the past since the runtime tests was not run in the past and using EventListener to track inproc native runtime events has probably not been used on Mono yet, most of the EventPipe work so far is normally consumed using diagnostic server sessions.

@josalem
Copy link
Contributor Author

josalem commented Apr 19, 2022

Distilling a conversation @lateralusX and I had offline:

  • There is potentially a little more work on the metadata to make sure Mono fires the missing event as well.
  • Fixing the issue handle issue will come in a followup PR that will enable the test again on Mono, since the metadata fix is orthogonal to the handle issue.

@josalem josalem requested a review from marek-safar as a code owner April 20, 2022 03:27
@lateralusX
Copy link
Member

  • Fixing the issue handle issue will come in a followup PR that will enable the test again on Mono, since the metadata fix is orthogonal to the handle issue.

I can work on that PR in parallel, but it will need to go after this one has been merged.

IntPtr Overlapped,
ushort ClrInstanceID = DefaultClrInstanceId)
{
if (!IsEnabled(EventLevel.Verbose, Keywords.ThreadingKeyword))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit, it may be unnecessary to check this here. When used, I figure this method would look similar to the private ThreadPoolIODequeue event method, where it is supported by a public helper method that would take the managed Overlapped object and pass in the hash code to this event method. Then it would be better to check IsEnabled in the public helper method before getting the hash code. Same comment for the other file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this check can be removed now since it's checked in the public method

if (!IsEnabled(EventLevel.Verbose, Keywords.ThreadingKeyword))
return;

EventData* data = stackalloc EventData[3]
Copy link
Member

@kouvel kouvel Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably doesn't matter currently, but just curious. It looks like the other event methods in this file, such as ThreadPoolIODequeue, are forwarding to the native side to fire the event. This particular event is currently only fired from the native side in coreclr, but in the future if this method would be called in coreclr from the managed side, would the event need to be fired from the native side like for the other events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code needs to be moved to the runtime specific implementations of this class, e.g., NativeRuntimeEventSource.NativeSinks.{Mono|CoreCLR|NativeAOT}.cs. In the Mono case, this is only fired from managed code, so it only needs the managed implementation. For CoreCLR, it's only fired from native code, so we can use a NotImplementedException stub. I'll need to look at what NativeAOT does to see what's needed there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should follow the pattern used by other threadpool events when implementing this new event on Mono, instead of doing formatting in managed it should call a new icall, LogThreadPoolIOPack that works the same as all the other LogThreadPool*. Since it is not implemented from managed in CoreCLR I believe its fine to do the not implemented, but maybe it just as simple to just do the qcall and open up the ability to log this event from CoreCLR as well, even if its currently not used, to keep consistency with other thread pool events.

@josalem
Copy link
Contributor Author

josalem commented Apr 21, 2022

I don't think we need the full extent of these changes for a backport. A subset of the changes should remove the NRE that folks will see in EventLIstener. Specifically, we only need to add

        [Event(65, Level = EventLevel.Verbose, Message = Messages.IO, Task = Tasks.ThreadPool, Opcode = Opcodes.IOPack, Version = 0, Keywords = Keywords.ThreadingKeyword)]
        private unsafe void ThreadPoolIOPack(
            IntPtr NativeOverlapped,
            IntPtr Overlapped,
            ushort ClrInstanceID = DefaultClrInstanceId)
        {
            throw new NotImplementedException();
        }

to NativeRuntimeEventSource.PortableThreadPool.CoreCLR.cs. This event is not fired in .net6 from Mono and in CoreCLR it is only fired from a QCall, so we don't need an implementation. We only need the function stub for the metadata to get generated on CoreCLR.

This patch contains a more complete fix, but has grown beyond something worth taking to tactics when this can be fixed with ~10 LOC instead of the full ~200 LOC. Thoughts @noahfalk @kouvel?

@kouvel
Copy link
Member

kouvel commented Apr 21, 2022

Yea just the minimum for backport makes sense

@noahfalk
Copy link
Member

Yep, I'd do the minimal change that fixes the issue for the backport

@josalem
Copy link
Contributor Author

josalem commented Apr 27, 2022

@lateralusX, I'm merging, so let's rebase #68320 on top of this.

@josalem josalem merged commit 276d642 into dotnet:main Apr 27, 2022
@josalem josalem deleted the dev/josalem/fix-missing-event-metadata branch April 27, 2022 15:39
josalem pushed a commit to josalem/runtime that referenced this pull request May 4, 2022
carlossanlop pushed a commit that referenced this pull request May 14, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants