-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror failure #80666
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Test output indicates that the test is crashing with a seg fault and createdump fails either because the process is gone or there is a signing/entitlement issue. @hoyosjs @mikem8361
|
It's an entitlement issue. The regex is looking for a string that can hit on any osx coreclr crash. Adjusting the string since the issue is also not OSX specific. |
Can we disable the test until it's fixed? UPD: #81094 |
Stacktrace of the above crash (read of invalid GC handle or GC handle table corruption):
|
culprit.txt The dump grabbed from the sigsegv crash wasn't useful
From the crashreport, the crashing thread 0x16c corresponds to
which corresponds to waiting for the createdump child process to finish, which according to the logs, returned successfully From the first native thread, it looks like this crash is occuring during
WaitAllCore and WaitAllCoreBlocking below
Theres still a discrepancy between IpcTraceTest.cs in dotnet/diagnostics and the version in dotnet/runtime. Is the version in dotnet/diagnostics more resilient, or any other obvious low-hanging improvements we could make to IpcTraceTest.cs in dotnet/runtime to catch/prevent these crashes? @noahfalk @davmason |
Re-enabled eventsourceerror on non-arm32 unix platforms in #105033 |
This crash reproduced in #105127 on osx-x64.
It is the same problem as in #80666 (comment) (read of invalid GC handle or GC handle table corruption). |
failed in outerloop: runtime-coreclr outerloop/20240721.3 Failed configs:
|
Based solely on the console logs from the recent failures on OSX, it looks like the readerTask does complete, so it's just stopTask that is causing the crash,
@davmason , do you happen to recall any recent modifications to the DiagnosticsClient EventPipeSession that could be causing crashes for specifically the eventsourceerror test? |
This issue was opened 1.5 years ago. Based on the symptoms, it is an intermittent crash that can come and go due to unrelated changes in timing, etc. |
The crash happens at this line: Line 65 in 3ec6286
We are crashing because the GCHandle was freed already. We are freeing the GCHandle here: Line 97 in 3ec6286
I do not see anything in the implementation of EventPipeInternal.DeleteProvider to guarantee there are no callback in flight when it returns. In fact, runtime/src/native/eventpipe/ep.c Lines 1317 to 1322 in 3ec6286
@mdh1418 Does this make sense - could you please take it from here? |
Thanks for pinpointing the crash @jkotas! After some offline discussions with @noahfalk , I'm seeing that ETW has the behavior of blocking during EventProvider.Dispose() until all already in-flight callbacks are finished, so if any callbacks happen to be initiated after the EventListenersLock in EventProvider.Dispose(), then this p/invoke will block here at Interop.Advapi32.EventUnregister until the ETW command releases the ETW lock as described in the comment. Would parity between ETW and EventPipe in that aspect be better than deferring deleting the GC handle, as in having |
I do not have an opinion. If you think that synchronous blocking would be better, it is fine with me.
I think that the synchronous blocking would have to be in the unmanaged event pipe code. I do not see how you can use locks on the managed side to fix this. |
Yeah, I want to do it unmanaged code because I want to match behavior between EtwEventProvider.Unregister() and EventPipeProvider.Unregister(). Both of them invoke a p/invoke to do the unregister work and I am recommending we give those p/invokes identical blocking behavior. |
The attempt to introduce a new lock to coordinate in-flight callbacks with provider disposal broke concurrent callbacks. Still looking to block provider disposal with a weaker blocking behavior |
Signal wait/set blocking implementation: #106040 |
Build Information
Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=137937
Build error leg or test failing: Build / coreclr Pri1 Runtime Tests Run osx x64 checked / Send tests to Helix (Unix)
Pull request: #80626
Error Message
Fill the error message using known issues guidance.
Known issue validation
Build: 🔎⚠️ Validation could not be done without an Azure DevOps build URL on the issue. Please add it to the "Build: 🔎" line.
Result validation:
Validation performed at: 7/21/2024 2:00:18 AM UTC
Report
Summary
The text was updated successfully, but these errors were encountered: