-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh #54801
Comments
@davidwrighton @josalem related to #51477? |
Run runtime-coreclr gcstress0x3-gcstress0xc 20210711.1
Error message
|
It's possible. This isn't runtime code; this is Microsoft/PerfView code on the stack. I'd be surprised if there was an AV in TraceEvent that we haven't hit before now given the number of tests that use it. Especially since we haven't changed the version in recent memory. This failure has a dump associated with it, so I should be able to dig into it a little more. I've got a few things on my plate, so I'll put this on the stack to investigate. (CC @brianrob) To elaborate on the issue text, the two mentioned tests have different failures. The error message in the issue text is for the rundown validation test. This is the error message for the ContextualReflection test:
I'm not sure if these are related. |
Took a look at the history for the rundown validation test. It's been intermittently failing since last Friday with varying errors. There's the above AV, but there is also the following error:
It's not immediately obvious to me what assert failed, though: runtime/src/libraries/System.Private.CoreLib/src/System/Collections/Hashtable.cs Lines 395 to 408 in 738b09b
I'm guessing something happened during the QCall for runtime/src/coreclr/vm/comutilnative.cpp Line 1984 in 57bfe47
|
I investigated the
I'm not convinced the pointer arithmetic/bookkeeping is wrong. It checks bounds and whatnot. When the logic goes to dereference the pointer into the pinned buffer, however, it seg faults. The address is not included in the dump created by createdump, but is in the system created dump. That being said, the system created dump seems to have the values at the address all zeroed out, which may indicate that it is junk data and that is why it isn't in the createdump dump. The long and short of it is: this call stack is segfaulting on what should be an address into a pinned GC handle for a |
I investigated a little further and this looks like this might be caused by the GC collecting data while it is still in use. The code is in the middle of
One of those entries is supposed to be where the pinned buffer pointed to by the faulting address lived. None of the above logic should be pointing to arbitrary data not allocated by itself, nor is it code that has changed recently. I don't believe that this is being caused by TraceEvent itself. @Maoni0 or @cshung is this an instance where something marked as pinned got collected, or is it possible that the pinning isn't working because the pinning APIs are misused? I've got this dump open if there is any info you would need to determine yes or no. |
if you have a pinned handle, it's a strong handle so GC will treat it as live till you call Free on it. it would be helpful to show the code where you are pinning the object of interest and how it's related to this EventMarker array. |
Here is the relevant code I found digging through TraceEvent: internal class EventCacheThread
{
public Queue<EventMarker> Events = new Queue<EventMarker>();
public int SequenceNumber;
public long LastCachedEventTimestamp;
}
internal class EventMarker
{
public EventMarker(PinnedBuffer buffer)
{
Buffer = buffer;
}
public EventPipeEventHeader Header;
public PinnedBuffer Buffer;
}
internal class PinnedBuffer
{
public PinnedBuffer(byte[] data)
{
Data = data;
PinningHandle = GCHandle.Alloc(data, GCHandleType.Pinned);
}
~PinnedBuffer()
{
PinningHandle.Free();
}
public byte[] Data { get; private set; }
public GCHandle PinningHandle { get; private set; }
} Those are stored in this dictionary on Dictionary<long, EventCacheThread> _threads = new Dictionary<long, EventCacheThread>(); |
@josalem From what you're reporting here, this doesn't look like a bug in the GC proper. This looks like a bug in the PerfView repo. In particular, the EventMarker object isn't kept alive through the call to the OnEvent call in SortAndDispatch, and the implementation of the OnEvent method ( I expect that the fix is to add a call to The big problem with using finalizers is that the runtime is free to execute them much, MUCH earlier than a developer might expect. I've event seen cases where a finalizer can execute before a method has finished its constructor. |
Thanks for explanation @davidwrighton! So it sounds like this is all because CC @noahfalk |
@josalem, that's what it looks like to me. In general, its very tricky to use types with finalizers correctly, if the type exposes details which are supposed to be protected by the finalizer (such as the PinningHandle property). For instance, this is why we moved to using SafeHandles internally for FileStreams, and why use of the Handle property on a FileStream is so strongly discouraged. Audits for correctness tend to fail to find issues, failures tend to be extremely hard to isolate, and the behavior of code will change unpredictably from version of .NET to version of .NET. |
Thanks @davidwrighton! I feel confident now, that the To that end, I'll open an issue to track changing this logic in Microsoft/PerfView, and disable this test for gcstress runs. If we notice instability in other tracing tests under gcstress due to the same issue, we can disable them as well until an updated |
Thanks @davidwrighton ! Btw if you've got questions about that code feel free to ask, I am the evil dev who authored it ; ) I dislike the degree to which TraceEvent resorts to raw pointer manipulations but for better or worse they are baked into the interfaces between components somewhat pervasively :/ If we continue to find bugs similar to this one it will increase the justification to try addressing the problem more broadly. |
I don't really have any significant questions, it was more of an educational note I was writing for John about how when he sees a finalizer in code behaving with this kind of problem he should focus in on that as the most likely cause of the problem. It didn't even seem like the code was really terrible, its just that that sort of code is extremely easy to mess up, and a better approach would involve use of a fixed statement instead of a pinned gchandle if a pointer is really the right solution. (Mostly because with a fixed statement lifetime is clearly handled.) Obviously actually doing so would require significant refactoring in the trace event logic, and I really don't blame you for not wanting to take that task on. |
Failed again in runtime-coreclr gcstress0x3-gcstress0xc 20210727.1 Failed test:
Error message:
|
@VincentBu I don't think the ContextualReflection test and the TraceValidation test failures are related. I think it would be worth it to open a separate issue for the reflection failure. |
I'll be moving this issue to 7 in order to track completion of microsoft/perfview/issues/1474 and updating dotnet/runtime to the subsequent TraceEvent version. |
Failed again in: runtime 20210928.21 Failed test:
Error message:
|
According to runfo this is failing multiple times per week (not the same stack as in the initial post but the same as the post above): https://runfo.azurewebsites.net/search/tests/?q=started%3A%7E7+definition%3Aruntime+name%3Arundownvalidation |
Looking at the linked failures in the runfo page, I think there are two different failures and neither is the one tracked in this issue. If I recall, there's a different issue tracking the mono timeout failure that makes up the majority of them, but the two coreclr Windows x86 failures looks new. When I get back to my desk I'll see about rearranging things so the right issue is tracked. |
@ViktorHofer I think the problem that runfo is tracking in this Issue is actually #59296. CC @lateralusX since this appears to be affecting Mono |
Can this be disabled while investigation is pending? similarly to #59994 |
Looks like this issue has diverged from its original report (that was a CoreCLR crash) and is now most likely the same underlying issue as #59296. |
I add it to the exclude list as well. |
The underlying issue has been fixed. The hits this month were for a PR. |
Run: runtime-coreclr gcstress0x3-gcstress0xc 20210627.1
Failed test:
Error message:
Runfo Tracking Issue: tracing/eventpipe/rundownvalidation/rundownvalidation/rundownvalidation.sh
Build Result Summary
The text was updated successfully, but these errors were encountered: