Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on EventListeners being enabled/ re-enabled #49804

Closed
djluck opened this issue Mar 18, 2021 · 3 comments
Closed

Deadlock on EventListeners being enabled/ re-enabled #49804

djluck opened this issue Mar 18, 2021 · 3 comments
Assignees
Milestone

Comments

@djluck
Copy link

djluck commented Mar 18, 2021

Description

When enabling or re-enabling event listeners under high-throughput scenarios, a deadlock can occur which breaks event listener functionality.

This code reliably reproduces the issue in a varying time frame, anywhere from immediately to 5 minutes after running:

using System;
using System.Diagnostics.Tracing;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;

namespace RecyleRootCause2
{
    class Program
    {
        static async Task Main(string[] args)
        {
            // Spawn a number of tasks that will rapidly throw exceptions (and therefore generate runtime events)
            var tasks = Enumerable.Range(1, Environment.ProcessorCount - 1)
                .Select(_ => Task.Run(() =>
                {
                    while (true)
                    {
                        try
                        {
                            var divide = 0;
                            var result = 1 / divide;
                        }
                        catch
                        {
                        }

                        Thread.Sleep(1);
                    }
                }))
                .ToArray();

            // Create our even listener (listen for exceptions)
            // At this point, it is possible for enabling events to hang (just not as common as when re-enabling events)
            var r = new Random();
            var listener = new EventListener("Microsoft-Windows-DotNETRuntime", EventLevel.Error, (EventKeywords) 32768);

            // Repeatedly re-enable event capturing after a short delay. 
            while (true)
            {
                var sleepFor = (int)2000 + r.Next(1, 1000);
                Log($"Listening for events! Sleeping for {sleepFor}");

                await Task.Delay(sleepFor);
                Log($"Captured {listener.EventsCaptured} events!");
		
                listener.ReenableEvents();
            }
        }

        private static void Log(string msg)
        {
            Console.WriteLine($"{DateTime.UtcNow:O}: {msg}");
        }
        
        public class EventListener : System.Diagnostics.Tracing.EventListener
        {
            private readonly string _sourceName;
            private readonly EventLevel _level;
            private readonly EventKeywords _keywords;
            private EventSource _source;
	
            public EventListener(string sourceName, EventLevel level, EventKeywords keywords)
            {
                _sourceName = sourceName;
                _level = level;
                _keywords = keywords;
                EventSourceCreated += OnEventSourceCreated;
            }

            public int EventsCaptured { get; private set; }

	
            private void OnEventSourceCreated(object? sender, EventSourceCreatedEventArgs e)
            {
                if (e.EventSource.Name == _sourceName)
                {
                    _source = e.EventSource;
                    ReenableEvents();
                }
            }

            public void ReenableEvents()
            {
                EventsCaptured = 0;
                Log($"Enabling events for {_source.Name} @ {_level} ({_keywords})");
                EnableEvents(_source, _level, _keywords);
                Log($"Enabled events successfully for {_source.Name} @ {_level} ({_keywords})");
            }

            protected override void OnEventWritten(EventWrittenEventArgs eventData)
            {
                EventsCaptured++;
            }}
        }
    }
}

I've varied factors including:

  • Delays between re-enabling events
  • Volume of exceptions being thrown
  • Events being listened to

None have helped- if the event throughput is for a listener is high enough, then re-enabling events will eventually deadlock.

This is a problem because EventListeners suffer from a bug that causes CPU consumption to rise over time and the recommend work around is to re-enable event listeners.

This issue only affects .net core 3.1. Tested on .net 5.0 without issue.

Configuration

  • Which version of .NET is the code running on? .NET 3.1.11 runtime
  • What OS and version, and what distro if applicable? Windows 10 + Ubuntu
  • What is the architecture (x64, x86, ARM, ARM64)? X64

Other information

Stack trace of the deadlock:

Thread  12
Current frame: ntdll!NtDelayExecution + 0x14
Child-SP         RetAddr          Caller, Callee
0000005A89C7DF60 00007ff9462ad84e KERNELBASE!SleepEx + 0x9e, calling ntdll!NtDelayExecution
0000005A89C7DF90 00007ff9488bb86b ntdll!RtlAllocateHeap + 0xecb, calling ntdll!RtlSetUserValueHeap + 0x2470
0000005A89C7E000 00007ff8cf0ed763 coreclr!GetCLRRuntimeHost + 0x161333, calling KERNEL32!SleepEx
0000005A89C7E0B0 00007ff9488bb3c7 ntdll!RtlAllocateHeap + 0xa27, calling ntdll!RtlAllocateHeap + 0xcb0
0000005A89C7E100 00007ff8cee41d4c coreclr!coreclr_initialize + 0xaf86c, calling ntdll!RtlAllocateHeap
0000005A89C7E130 00007ff9488b47b1 ntdll!RtlFreeHeap + 0x51, calling ntdll!RtlGetCurrentServiceSessionId + 0xec0
0000005A89C7E170 00007ff8cf0ebf08 coreclr!GetCLRRuntimeHost + 0x15fad8, calling KERNEL32!HeapFree
0000005A89C7E1C0 00007ff8cedca7f7 coreclr!coreclr_initialize + 0x38317, calling ntdll!RtlAllocateHeap
0000005A89C7E1F0 00007ff8cf095539 coreclr!GetCLRRuntimeHost + 0x109109, calling coreclr!coreclr_shutdown_2 + 0xf550
0000005A89C7E200 00007ff8cf095284 coreclr!GetCLRRuntimeHost + 0x108e54, calling coreclr!GetCLRRuntimeHost + 0x109004
0000005A89C7E230 00007ff8cf0936a6 coreclr!GetCLRRuntimeHost + 0x107276, calling coreclr!GetCLRRuntimeHost + 0x1090c0
0000005A89C7E2B0 00007ff8cefdf7b6 coreclr!GetCLRRuntimeHost + 0x53386, calling coreclr!GetCLRRuntimeHost + 0x1610a0
0000005A89C7E2E0 00007ff8cefdf38d coreclr!GetCLRRuntimeHost + 0x52f5d, calling coreclr!GetCLRRuntimeHost + 0x532a8
0000005A89C7E330 00007ff8ca7d328f (MethodDesc 00007ff86f482ed8 + 0x7f System.Threading.Tasks.Task.InternalWaitCore(Int32, System.Threading.CancellationToken)), calling System_Private_CoreLib + 0x28822
0000005A89C7E390 00007ff8cefdf678 coreclr!GetCLRRuntimeHost + 0x53248, calling coreclr!GetCLRRuntimeHost + 0x52efc
0000005A89C7E3B0 00007ff8ca7dda95 (MethodDesc 00007ff86f482e90 + 0x35 System.Threading.Tasks.Task.Wait(Int32, System.Threading.CancellationToken)), calling System_Private_CoreLib + 0x28822
0000005A89C7E3E0 00007ff8cf08d4d8 coreclr!GetCLRRuntimeHost + 0x1010a8, calling coreclr!GetCLRRuntimeHost + 0x531cc
0000005A89C7E420 00007ff8ca9c36e7 (MethodDesc 00007ff86f545170 + 0x57 System.Diagnostics.Tracing.EventPipeEventDispatcher.CommitDispatchConfiguration())
0000005A89C7E470 00007ff8ceed65d5 coreclr!coreclr_shutdown_2 + 0xb495, calling coreclr!coreclr_initialize + 0x63d30
0000005A89C7E480 00007ff8ca9c36da (MethodDesc 00007ff86f545170 + 0x4a System.Diagnostics.Tracing.EventPipeEventDispatcher.CommitDispatchConfiguration()), calling coreclr!coreclr_shutdown_2 + 0xc410
0000005A89C7E520 00007ff8ca9c3567 (MethodDesc 00007ff86f545140 + 0x87 System.Diagnostics.Tracing.EventPipeEventDispatcher.SendCommand(System.Diagnostics.Tracing.EventListener, System.Diagnostics.Tracing.EventCommand, Boolean, System.Diagnostics.Tracing.EventLevel, System.Diagnostics.Tracing.EventKeywords)), calling System_Private_CoreLib + 0x28822
0000005A89C7E580 00007ff8ca78334e (MethodDesc 00007ff86f46f128 + 0xbe System.Diagnostics.Tracing.EventListener.EnableEvents(System.Diagnostics.Tracing.EventSource, System.Diagnostics.Tracing.EventLevel, System.Diagnostics.Tracing.EventKeywords, System.Collections.Generic.IDictionary`2<System.String,System.String>)), calling System_Private_CoreLib + 0x28822
0000005A89C7E600 00007ff8ca783282 (MethodDesc 00007ff86f46f110 + 0x12 System.Diagnostics.Tracing.EventListener.EnableEvents(System.Diagnostics.Tracing.EventSource, System.Diagnostics.Tracing.EventLevel, System.Diagnostics.Tracing.EventKeywords)), calling System_Private_CoreLib + 0x28822
0000005A89C7E630 00007ff86f3c9d9d (MethodDesc 00007ff86f4858d0 + 0xed RecyleRootCause2.Program+EventListener.ReenableEvents()), calling 00007ff86f3bfcc8 (stub for System.Diagnostics.Tracing.EventListener.EnableEvents(System.Diagnostics.Tracing.EventSource, System.Diagnostics.Tracing.EventLevel, System.Diagnostics.Tracing.EventKeywords))
0000005A89C7E6C0 00007ff86f3c2303 (MethodDesc 00007ff86f483f10 + 0x433 RecyleRootCause2.Program+<Main>d__0.MoveNext()), calling (MethodDesc 00007ff86f4858d0 + 0 RecyleRootCause2.Program+EventListener.ReenableEvents())
0000005A89C7E800 00007ff8caa03098 (MethodDesc 00007ff86f54a708 + 0x18 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].ExecutionContextCallback(System.Object)), calling 00007ff86f2a3200
0000005A89C7E830 00007ff86f3e2eac (MethodDesc 00007ff86f4b9ba8 + 0x8c System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
0000005A89C7E8A0 00007ff8caa02f12 (MethodDesc 00007ff86f54a748 + 0x82 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].MoveNext(System.Threading.Thread)), calling (MethodDesc 00007ff86f4b9ba8 + 0 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
0000005A89C7E910 00007ff8caa02fed (MethodDesc 00007ff86f54a738 + 0xd System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].MoveNext()), calling System_Private_CoreLib + 0x28822
0000005A89C7E940 00007ff8ca7f7276 (MethodDesc 00007ff86f54ade8 + 0x56 System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean)), calling System_Private_CoreLib + 0x28832
0000005A89C7E9A0 00007ff86f3e4f66 (MethodDesc 00007ff86f483028 + 0xd6 System.Threading.Tasks.Task.RunContinuations(System.Object)), calling 00007ff86f3da4b0 (stub for System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean))
0000005A89C7E9F0 00007ff8ceed65d5 coreclr!coreclr_shutdown_2 + 0xb495, calling coreclr!coreclr_initialize + 0x63d30
0000005A89C7EA60 00007ff8ca8d2298 (MethodDesc 00007ff86f482fb0 + 0x78 System.Threading.Tasks.Task.TrySetResult()), calling (MethodDesc 00007ff86f483010 + 0 System.Threading.Tasks.Task.FinishContinuations())
0000005A89C7EAA0 00007ff8ca8d5011 (MethodDesc 00007ff86f548528 + 0x11 System.Threading.Tasks.Task+DelayPromise.CompleteTimedOut()), calling System_Private_CoreLib + 0x28822
0000005A89C7EAD0 00007ff8ca8d50b9 (MethodDesc 00007ff86f549040 + 0x19 System.Threading.Tasks.Task+DelayPromise+<>c.<.ctor>b__1_0(System.Object)), calling System_Private_CoreLib + 0x28822
0000005A89C7EB00 00007ff8ca8c93c8 (MethodDesc 00007ff86f548db0 + 0xa8 System.Threading.TimerQueueTimer.CallCallback(Boolean))
0000005A89C7EB60 00007ff8ca8c91f5 (MethodDesc 00007ff86f548d80 + 0x75 System.Threading.TimerQueueTimer.Fire(Boolean)), calling System_Private_CoreLib + 0x28822
0000005A89C7EBD0 00007ff8ca8bead1 (MethodDesc 00007ff86f5492f8 + 0x231 System.Threading.TimerQueue.FireNextTimers()), calling System_Private_CoreLib + 0x28822
0000005A89C7EC60 00007ff8ca8be869 (MethodDesc 00007ff86f5491c0 + 0x29 System.Threading.TimerQueue.AppDomainTimerCallback(Int32)), calling System_Private_CoreLib + 0x28822
0000005A89C7EC90 00007ff8ceed6ce3 coreclr!coreclr_shutdown_2 + 0xbba3
0000005A89C7ECD0 00007ff8cee6cc42 coreclr!coreclr_initialize + 0xda762, calling coreclr!coreclr_shutdown_2 + 0xbb20
0000005A89C7EE10 00007ff8ceecc3c9 coreclr!coreclr_shutdown_2 + 0x1289, calling coreclr!coreclr_initialize + 0xda5cc
0000005A89C7EE30 00007ff8ced94b72 coreclr!coreclr_initialize + 0x2692, calling coreclr!coreclr_shutdown_2 + 0xcf80
0000005A89C7EF00 00007ff8cee793a2 coreclr!coreclr_initialize + 0xe6ec2, calling ntdll!RtlRetrieveNtUserPfn + 0x500
0000005A89C7F020 00007ff8cee7928f coreclr!coreclr_initialize + 0xe6daf, calling coreclr!coreclr_initialize + 0xe6e44
0000005A89C7F040 00007ff8ced94cae coreclr!coreclr_initialize + 0x27ce, calling coreclr!coreclr_initialize + 0x27e4
0000005A89C7F0C0 00007ff8cee785af coreclr!coreclr_initialize + 0xe60cf, calling coreclr!coreclr_initialize + 0xe6d00
0000005A89C7F120 00007ff8cee77908 coreclr!coreclr_initialize + 0xe5428, calling ntdll!RtlRetrieveNtUserPfn + 0x500
0000005A89C7F150 00007ff8cee1a72d coreclr!coreclr_initialize + 0x8824d, calling ntdll!RtlRetrieveNtUserPfn + 0x500
0000005A89C7F200 00007ff8cee1b140 coreclr!coreclr_initialize + 0x88c60, calling ntdll!RtlRetrieveNtUserPfn + 0x500
0000005A89C7F250 00007ff9488a791d ntdll!LdrShutdownThread + 0x4cd, calling ntdll!RtlActivateActivationContextUnsafeFast
0000005A89C7F258 00007ff9488a793d ntdll!LdrShutdownThread + 0x4ed, calling ntdll!RtlDeactivateActivationContextUnsafeFast
0000005A89C7F270 00007ff948904fad ntdll!LdrInitializeThunk + 0x49d, calling ntdll!NtTestAlert
0000005A89C7F2E0 00007ff8cee8c8a8 coreclr!MetaDataGetDispenser + 0x6088, calling KERNEL32!HeapFree
0000005A89C7F320 00007ff8ceec34aa coreclr!MetaDataGetDispenser + 0x3cc8a, calling ntdll!RtlRetrieveNtUserPfn + 0x500
0000005A89C7F340 00007ff948904b28 ntdll!LdrInitializeThunk + 0x18, calling ntdll!NtContinue
0000005A89C7F7A0 00007ff8ceec3487 coreclr!MetaDataGetDispenser + 0x3cc67, calling coreclr!coreclr_shutdown_2 + 0x20b70
0000005A89C7F7E0 00007ff9486f7034 KERNEL32!BaseThreadInitThunk + 0x14, calling ntdll!RtlRetrieveNtUserPfn + 0x500
0000005A89C7F810 00007ff9488e2651 ntdll!RtlUserThreadStart + 0x21, calling ntdll!RtlRetrieveNtUserPfn + 0x500

@dotnet-issue-labeler dotnet-issue-labeler bot added area-Tracing-coreclr untriaged New issue has not been triaged by the area owner labels Mar 18, 2021
@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Mar 19, 2021
@tommcdon tommcdon added this to the 6.0.0 milestone Mar 19, 2021
djluck added a commit to djluck/prometheus-net.DotNetRuntime that referenced this issue Mar 28, 2021
## Summary
A large refactor that aims to make this library far more stable and performant by default. Event counters are now the default source of metrics while more detailed events can be enabled manually when required (see `CaptureLevel`).

## Changes
### Breaking changes
- Dropped support for `prometheus-net` v2.
- Dropped support for `netcoreapp2.2`
- `WithThreadPoolSchedulingStats` has been removed- it was both a performance hog and incorrect (the IDs of the start/stop events were not stable). May consider adding this in a later release as .NET 5.0 should have fixed the stable IDs issue.
- `DotNetRuntimeStatsBuilder.Default()` now only uses event counters to generate metrics. JIT metrics will not be collected (there are no JIT-related event counters in .NET core 3.1). Plan to add support for .NET 5.0 in a later release. You can restore more detailed metrics by using `DotNetRuntimeStatsBuilder.Customize()` and passing a custom `CaptureLevel`.
- Renamed `dotnet_gc_collection_reasons_total` -> `dotnet_gc_collection_count_total`

## Additions/ enhancements
- Added new threadpool metrics: `dotnet_threadpool_throughput_total`, `dotnet_threadpool_queue_length` and `dotnet_threadpool_timer_count`
- Added `dotnet_gc_memory_total_available_bytes` to track the total amount of memory .NET can allocate to (this takes into account docker memory limits)
- Added ability to configure the source of majority of collectors- can either be driven solely by event counters (`CaptureLevel.Counters`) or event listeners for more detailed metrics.
- Added support for recycling `EventListener`s periodically (`net5.0` only as `netcoreapp3.1` is impacted by dotnet/runtime#49804).
 - Improved the collection of debugging metrics available
- Added documentation around metrics exposed
- Added an example `docker-compose` stack that can be used for testing and experimentation

## Fixes
- #9
- #10
- #20
- #33
- #35
- #39
djluck added a commit to djluck/prometheus-net.DotNetRuntime that referenced this issue Mar 28, 2021
* prometheus-net.DotNetRuntime v4.0.0

## Summary
A large refactor that aims to make this library far more stable and performant by default. Event counters are now the default source of metrics while more detailed events can be enabled manually when required (see `CaptureLevel`).

## Changes
### Breaking changes
- Dropped support for `prometheus-net` v2.
- Dropped support for `netcoreapp2.2`
- `WithThreadPoolSchedulingStats` has been removed- it was both a performance hog and incorrect (the IDs of the start/stop events were not stable). May consider adding this in a later release as .NET 5.0 should have fixed the stable IDs issue.
- `DotNetRuntimeStatsBuilder.Default()` now only uses event counters to generate metrics. JIT metrics will not be collected (there are no JIT-related event counters in .NET core 3.1). Plan to add support for .NET 5.0 in a later release. You can restore more detailed metrics by using `DotNetRuntimeStatsBuilder.Customize()` and passing a custom `CaptureLevel`.
- Renamed `dotnet_gc_collection_reasons_total` -> `dotnet_gc_collection_count_total`

## Additions/ enhancements
- Added new threadpool metrics: `dotnet_threadpool_throughput_total`, `dotnet_threadpool_queue_length` and `dotnet_threadpool_timer_count`
- Added `dotnet_gc_memory_total_available_bytes` to track the total amount of memory .NET can allocate to (this takes into account docker memory limits)
- Added ability to configure the source of majority of collectors- can either be driven solely by event counters (`CaptureLevel.Counters`) or event listeners for more detailed metrics.
- Added support for recycling `EventListener`s periodically (`net5.0` only as `netcoreapp3.1` is impacted by dotnet/runtime#49804).
 - Improved the collection of debugging metrics available
- Added documentation around metrics exposed
- Added an example `docker-compose` stack that can be used for testing and experimentation

## Fixes
- #9
- #10
- #20
- #33
- #35
- #39

* Adding dedicated `netcoreapp3.1` and `net5.0` test jobs

* Fixing test job names

* Specifying framework to run tests for

* Hopefully this get's things working..

* Excluding mysteriously failing test

* Fixing filter flag
@vangarp
Copy link

vangarp commented May 28, 2021

@josalem have you had a chance to look into this? Unless we have a workaround for this, we can't really consume the runtime metrics without running into #43985

@josalem
Copy link
Contributor

josalem commented Jun 2, 2021

It looks like you may be hitting: #53564.

@josalem
Copy link
Contributor

josalem commented Jul 27, 2021

The fix for what I believe you were hitting went out int he July servicing update of .net3.1 and .net5.0. I'll close this issue for now, but feel free to reopen it if you encounter this again.

@josalem josalem closed this as completed Jul 27, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants