-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Missing Metadata for ThreadPool Event #67150
Fix Missing Metadata for ThreadPool Event #67150
Conversation
Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti Issue Detailsfixes #65630 Will require backport(copy/paste of root cause explanation from issue) After that conversion, it looks like any event with runtime/src/coreclr/scripts/genRuntimeEventSources.py Lines 88 to 91 in 0d59046
This specific event, however, is only sent inside an FCALL related to overlapped IO and wasn't implemented in managed code:
sent from: runtime/src/coreclr/vm/nativeoverlapped.cpp Lines 91 to 149 in 3fe2959
As a result, managed code doesn't have any metadata for this event. When the event is dispatched to managed code, The reason this was only repro-ing for a web app is that ASP.NET Core on Windows uses a lot of overlapped IO so it sends these events while a basic console app does not. I have a patch that fixes this and will hopefully have a PR out very soon. We will need to backport this to .net6. This issue only presents for an in-process CC @dotnet/dotnet-diag @tommcdon
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of suggestions, otherwise LGTM, thanks!
...vate.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.NativeSinks.cs
Outdated
Show resolved
Hide resolved
...s/System.Private.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo comments inline
...vate.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.NativeSinks.cs
Outdated
Show resolved
Hide resolved
...s/System.Private.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.cs
Outdated
Show resolved
Hide resolved
b80bf77
to
18e7adc
Compare
<!-- | ||
Test is blocking official build. | ||
Related failures: #18907, #19340, #22441, #22729. | ||
Issue tracking to re-enable: #22898. | ||
--> | ||
<DisableProjectBuild>true</DisableProjectBuild> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these issue numbers seem related to this code. I'm going to poke at the old coreclr repo issues to see if these refer to those numbers (the transfer did some funny business with mapping the numbers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test has been off for >3 years now. I want this test back on, but perhaps not as part of a changeset that is going to be backported. Thoughts?
Details
|
Having some difficulties running the test under a debugger, but I was able to glean that we are hitting this assert in static inline volatile gpointer *
sgen_array_list_get_slot (SgenArrayList *array, guint32 index)
{
guint32 bucket, offset;
SGEN_ASSERT (0, index < array->capacity, "Why are we accessing an entry that is not allocated"); // <----
sgen_array_list_bucketize (index, &bucket, &offset);
return &(array->entries [bucket] [offset]);
} Which we seem to be getting to from this chain:
Looking at GCObject*
sgen_gchandle_get_target (guint32 gchandle)
{
guint index = MONO_GC_HANDLE_SLOT (gchandle);
GCHandleType type = MONO_GC_HANDLE_TYPE (gchandle);
HandleData *handles = gc_handles_for_type (type);
/* Invalid handles are possible; accessing one should produce NULL. (#34276) */
if (!handles)
return NULL;
return link_get (sgen_array_list_get_slot (&handles->entries_array, index), MONO_GC_HANDLE_TYPE_IS_WEAK (type));
} It looks like we are getting to the return statement and the value of Hopefully this becomes painfully obvious once I can get it running under LLDB. |
Got it under a debugger finally. Turns out you can't use LLDB in Codespaces -_-. Anyway, I managed to peak at the frames leading up to the assert on Mono. Looking at the data shows that we're getting an invalid index into the
The code is supposed to be looking up a
The faulting managed code from private void StopDispatchTask()
{
Debug.Assert(Monitor.IsEntered(m_dispatchControlLock));
if (m_dispatchTask != null)
{
m_stopDispatchTask = true;
Debug.Assert(!m_dispatchTaskWaitHandle.SafeWaitHandle.IsInvalid);
EventWaitHandle.Set(m_dispatchTaskWaitHandle.SafeWaitHandle); // <-------
m_dispatchTask.Wait();
m_dispatchTask = null;
}
} We don't hit the @lambdageek or @lateralusX do you have any intuition about what might be going on here? I went digging into the wait handle alloc logic in Mono, but it might take me a while to pinpoint what's going on. My proposal is that we turn this test off for Mono and file an issue to identify the root cause of this failure. |
Are we content merging this with the test turned off for Mono? I don't believe the test failure is due to the code changes in this PR. |
@josalem, so this test has never been run in the past on CoreCLR or Mono? If so, I guess the issue would just reproduce by re-enabling the test on Mono, without the other changes in this PR? |
Hmm, so this line Line 160 in 57bfe47
will go through the above callstack on Mono Unix, but what is returned from get GetWaitHandle is a wait handle and will not map to an underlying managed object, but a MonoW32Handle * created by On Windows System.Threading.WaitSubsystem:SetEvent won't go through the handle manager, but directly to OS API's, meaning that handle is the OS primitive, but on Unix, it will try to convert it into a GC handle and then get back the target WaitableObject, runtime/src/libraries/System.Private.CoreLib/src/System/Threading/WaitSubsystem.Unix.cs Line 223 in 91da4ac
We already have comments in this code to handle the event in native code, so maybe it is safer to replace the GetWaitHandle with a set and wait icalls/qcalls calls instead making sure we get to the underlying implementation when doing this calls? @josalem something you could look at as part of this PR or I can make that change and push it into this or create another PR and re-enable the test on Mono, what ever suites you best. |
Looks like there was a rather big change in the WaitHandle, EventWaitHandle ~15 months ago, 33c7c4d, that probably affected the underlying assumptions around what GetWaitHandle could return and used in EventPipeEventDispatcher.cs. Moving away from handling the event altogether in managed code is probably the right way to go. |
On Mono,
|
Guess this has not been seen on Mono in the past since the runtime tests was not run in the past and using EventListener to track inproc native runtime events has probably not been used on Mono yet, most of the EventPipe work so far is normally consumed using diagnostic server sessions. |
Distilling a conversation @lateralusX and I had offline:
|
I can work on that PR in parallel, but it will need to go after this one has been merged. |
IntPtr Overlapped, | ||
ushort ClrInstanceID = DefaultClrInstanceId) | ||
{ | ||
if (!IsEnabled(EventLevel.Verbose, Keywords.ThreadingKeyword)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit, it may be unnecessary to check this here. When used, I figure this method would look similar to the private ThreadPoolIODequeue
event method, where it is supported by a public helper method that would take the managed Overlapped
object and pass in the hash code to this event method. Then it would be better to check IsEnabled
in the public helper method before getting the hash code. Same comment for the other file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this check can be removed now since it's checked in the public method
...vate.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.NativeSinks.cs
Outdated
Show resolved
Hide resolved
...vate.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.NativeSinks.cs
Outdated
Show resolved
Hide resolved
...s/System.Private.CoreLib/src/System/Threading/NativeRuntimeEventSource.PortableThreadPool.cs
Outdated
Show resolved
Hide resolved
if (!IsEnabled(EventLevel.Verbose, Keywords.ThreadingKeyword)) | ||
return; | ||
|
||
EventData* data = stackalloc EventData[3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably doesn't matter currently, but just curious. It looks like the other event methods in this file, such as ThreadPoolIODequeue
, are forwarding to the native side to fire the event. This particular event is currently only fired from the native side in coreclr, but in the future if this method would be called in coreclr from the managed side, would the event need to be fired from the native side like for the other events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code needs to be moved to the runtime specific implementations of this class, e.g., NativeRuntimeEventSource.NativeSinks.{Mono|CoreCLR|NativeAOT}.cs
. In the Mono case, this is only fired from managed code, so it only needs the managed implementation. For CoreCLR, it's only fired from native code, so we can use a NotImplementedException
stub. I'll need to look at what NativeAOT does to see what's needed there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should follow the pattern used by other threadpool events when implementing this new event on Mono, instead of doing formatting in managed it should call a new icall, LogThreadPoolIOPack that works the same as all the other LogThreadPool*. Since it is not implemented from managed in CoreCLR I believe its fine to do the not implemented, but maybe it just as simple to just do the qcall and open up the ability to log this event from CoreCLR as well, even if its currently not used, to keep consistency with other thread pool events.
...c/System/Diagnostics/Tracing/NativeRuntimeEventSource.PortableThreadPool.NativeSinks.Mono.cs
Show resolved
Hide resolved
I don't think we need the full extent of these changes for a backport. A subset of the changes should remove the NRE that folks will see in [Event(65, Level = EventLevel.Verbose, Message = Messages.IO, Task = Tasks.ThreadPool, Opcode = Opcodes.IOPack, Version = 0, Keywords = Keywords.ThreadingKeyword)]
private unsafe void ThreadPoolIOPack(
IntPtr NativeOverlapped,
IntPtr Overlapped,
ushort ClrInstanceID = DefaultClrInstanceId)
{
throw new NotImplementedException();
} to This patch contains a more complete fix, but has grown beyond something worth taking to tactics when this can be fixed with ~10 LOC instead of the full ~200 LOC. Thoughts @noahfalk @kouvel? |
Yea just the minimum for backport makes sense |
Yep, I'd do the minimal change that fixes the issue for the backport |
@lateralusX, I'm merging, so let's rebase #68320 on top of this. |
fixes #65630
Will require backport
(copy/paste of root cause explanation from issue)
In .net6 we added a managed implementation of the thread pool. To send the same events under the same provider we had to do some duplication of event definitions so that managed code had the ability to send the same events. Every event needs to be represented in managed code for
EventListener
to know how to parse the events. To that end, there is a python script that generates managed function stubs with the correct event IDs and parameters. For components that transition from native to managed, these stubs would be filled in by hand. For example, the thread pool had all the events sent from managed code filled out.After that conversion, it looks like any event with
ThreadPool
in its name gets left out of the code generation phase:runtime/src/coreclr/scripts/genRuntimeEventSources.py
Lines 88 to 91 in 0d59046
This specific event, however, is only sent inside an FCALL related to overlapped IO and wasn't implemented in managed code:
sent from:
runtime/src/coreclr/vm/nativeoverlapped.cpp
Lines 91 to 149 in 3fe2959
As a result, managed code doesn't have any metadata for this event. When the event is dispatched to managed code,
EventSource
attempts to read the struct inm_eventData
which was zero initialized, so it NREs.The reason this was only repro-ing for a web app is that ASP.NET Core on Windows uses a lot of overlapped IO so it sends these events while a basic console app does not.
This issue only presents for an in-process
EventListener
. Out-of-process listeners will not be causing NREs in the target runtime.CC @dotnet/dotnet-diag @tommcdon
fixes #12141