Test hang in System.Diagnostics.Tracing.EventProvider.EventUnregister #35451

jkotas · 2020-04-25T12:56:31Z

      Half-way through waiting for remote process.
      Memory load: 14
      
      Image Name                     PID Services                                    
      ========================= ======== ============================================
      svchost.exe                    620 BrokerInfrastructure, DcomLaunch, LSM,      
                                         PlugPlay, Power, SystemEventsBroker         
      svchost.exe                    652 RpcEptMapper, RpcSs                         
      svchost.exe                    776 Dhcp, EventLog, lmhosts, vmictimesync,      
                                         Wcmsvc                                      
      svchost.exe                    808 AeLookupSvc, Appinfo, BITS, CertPropSvc,    
                                         gpsvc, iphlpsvc, LanmanServer, ProfSvc,     
                                         sacsvr, Schedule, SENS, SessionEnv,         
                                         ShellHWDetection, Themes, Winmgmt           
      svchost.exe                    888 EventSystem, FontCache, netprofm, nsi,      
                                         W32Time, WinHttpAutoProxySvc                
      svchost.exe                    976 CryptSvc, Dnscache, LanmanWorkstation,      
                                         NlaSvc, WinRM                               
      svchost.exe                    592 BFE, DPS, MpsSvc, pla                       
      svchost.exe                   1136 AppHostSvc                                  
      svchost.exe                   1164 DiagTrack                                   
      svchost.exe                   1328 TrkWks, UALSVC, UmRdpService,               
                                         vmickvpexchange, vmicshutdown, vmicvss      
      svchost.exe                   1352 W3SVC, WAS                                  
      svchost.exe                   2632 TermService                                 
      svchost.exe                   2672 vmicheartbeat, vmicrdv                      
      Timed out at 4/25/2020 1:55:40 AM after 60000ms waiting for remote process.
      Wrote mini dump to: C:\h\w\AE3609A5\w\B05309CA\uploads\3476.zuzob0j0.l20.dmp
      	Process ID: 3476
      	Handle: 1004
      	Name: dotnet
      	MainModule: C:\h\w\AE3609A5\p\dotnet.exe
      	StartTime: 4/25/2020 1:54:40 AM
      	TotalProcessorTime: 00:00:00.2343750
      	Threads:
      		Thread #1 (OS 0x560)   
      			[InlinedCallFrame] (System.Environment._Exit)
      			[InlinedCallFrame] (System.Environment._Exit)
      			System.Environment.Exit(Int32)
      			Microsoft.DotNet.RemoteExecutor.Program.Main(System.String[])
      		Thread #2 (OS 0xA38) [Finalizer] [Background] [MTA]
      			[InlinedCallFrame] (Interop+Advapi32.EventUnregister)
      			[InlinedCallFrame] (Interop+Advapi32.EventUnregister)
      			System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister(Int64)
      			System.Diagnostics.Tracing.EventProvider.EventUnregister(Int64)
      			System.Diagnostics.Tracing.EventProvider.Dispose(Boolean)
      			System.Diagnostics.Tracing.EventProvider.Dispose()
      			System.Diagnostics.Tracing.EventSource.Dispose(Boolean)
      			System.Diagnostics.Tracing.EventListener.DisposeOnShutdown(System.Object, System.EventArgs)
      			System.AppContext.OnProcessExit()
      		Thread #3 (OS 0x700) [Thread pool worker] [Background] [MTA]
      		Thread #4 (OS 0xE50) [Thread pool worker] [Background] [MTA]

The text was updated successfully, but these errors were encountered:

Dotnet-GitSync-Bot · 2020-04-25T12:56:41Z

I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.

jkotas · 2020-04-25T12:57:03Z

Hit by #35427 in https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-35427-merge-6fdb541e69394f00ba/System.IO.FileSystem.Tests/console.e8f28856.log?sv=2019-02-02&se=2020-05-05T01%3A21%3A24Z&sr=c&sp=rl&sig=Aalm%2BlDqE2HM88uUCuqEEJtCb%2F6EW2vB%2BQH9i9G1F6Y%3D

ghost · 2020-04-25T12:57:16Z

Tagging subscribers to this area: @tarekgh, @tommcdon
Notify danmosemsft if you want to be subscribed.

jkotas · 2020-04-25T12:58:06Z

Hit by #35427 in https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-35427-merge-6fdb541e69394f00ba/System.Threading.Thread.Tests/console.72689c94.log?sv=2019-02-02&se=2020-05-05T01%3A21%3A24Z&sr=c&sp=rl&sig=Aalm%2BlDqE2HM88uUCuqEEJtCb%2F6EW2vB%2BQH9i9G1F6Y%3D (test System.Threading.Threads.Tests.ThreadTests.WindowsPrincipalPolicyTest_Windows)

jkotas · 2020-04-25T13:19:57Z

Saved dump at \jkotas9\drops\issue-35451

We are stuck inside GetProcAddress for no apparent reason:

 # Child-SP          RetAddr           Call Site
00 0000000f`f392ef48 00007ffa`a429de39 ntdll!ZwWaitForSingleObject+0xa [d:\blue.obj.amd64fre\minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 198] 
01 0000000f`f392ef50 00007ffa`a429b7e4 ntdll!RtlpWaitOnCriticalSection+0xe1 [d:\blue\minkernel\ntdll\resource.c @ 771] 
02 0000000f`f392f020 00007ffa`a429b6dc ntdll!RtlpEnterCriticalSectionContended+0xa4 [d:\blue\minkernel\ntdll\resource.c @ 1220] 
03 0000000f`f392f060 00007ffa`a429f6be ntdll!LdrpAcquireLoaderLock+0x2c [d:\blue\minkernel\ntdll\ldr.c @ 250] 
04 0000000f`f392f0a0 00007ffa`a17d58f0 ntdll!LdrGetProcedureAddressForCaller+0x19e [d:\blue\minkernel\ntdll\ldrapi.c @ 1901] 
05 0000000f`f392f290 00007ffa`803ee358 KERNELBASE!GetProcAddressForCaller+0x70 [d:\blue\minkernel\kernelbase\module.c @ 1891] 
06 0000000f`f392f2e0 00007ffa`803ee17d coreclr!NDirectMethodDesc::FindEntryPoint+0x4c [F:\workspace\_work\1\s\src\coreclr\src\vm\method.cpp @ 5339] 
07 0000000f`f392f340 00007ffa`803edb8b coreclr!NDirect::NDirectLink+0x119 [F:\workspace\_work\1\s\src\coreclr\src\vm\dllimport.cpp @ 7121] 
08 0000000f`f392f670 00007ffa`804a2ae3 coreclr!NDirectImportWorker+0x7b [F:\workspace\_work\1\s\src\coreclr\src\vm\dllimport.cpp @ 7271] 
09 0000000f`f392f6f0 00007ffa`2093d471 coreclr!NDirectImportThunk+0x43 [F:\workspace\_work\1\s\src\coreclr\src\vm\amd64\AsmHelpers.asm @ 198] 
0a 0000000f`f392f760 000e000f`d1613df0 System_Private_CoreLib!System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister(Int64)+0x81 [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventProvider.cs @ 1286]

This may be duplicate of #13610

tommcdon · 2020-04-25T22:29:10Z

@noahfalk for thoughts on what might be causing this

noahfalk · 2020-04-28T10:55:32Z

I'll see if I can take a closer look tomorrow (I'll need to get back on corpnet to access the dump). Off the cuff that callstack looked like the thread was blocked on the loader lock so probably some other thread is holding it. If so we'll spot the culprit in the dump.

noahfalk · 2020-04-29T01:45:54Z

@jkotas - I wasn't able to access the dump on your share. I can list directories in \\jkotas9 and \\jkotas9\public, but I get access denied trying to open \\jkotas9\drops. For comparison I am able to access shares on \\cpvsbuild\ so I suspect this is a security configuration issue on your share rather than a broader issue with my credentials or VPN connection.

noahfalk · 2020-04-29T05:50:13Z

Attaching windbg showed the loader lock was unowned:

0:002> dx -r1 ((ntdll!_RTL_CRITICAL_SECTION *)0x7ffaa438e8a8)
((ntdll!_RTL_CRITICAL_SECTION *)0x7ffaa438e8a8)                 : 0x7ffaa438e8a8 [Type: _RTL_CRITICAL_SECTION *]
    [+0x000] DebugInfo        : 0x7ffaa438ebe0 [Type: _RTL_CRITICAL_SECTION_DEBUG *]
    [+0x008] LockCount        : 0 [Type: long]
    [+0x00c] RecursionCount   : 0 [Type: long]
    [+0x010] OwningThread     : 0x0 [Type: void *]
    [+0x018] LockSemaphore    : 0x168 [Type: void *]
    [+0x020] SpinCount        : 0x4000000 [Type: unsigned __int64]

Best guess the loader lock was owned earlier causing the OS to switch the thread out, later the lock was released and for whatever reason the OS hasn't yet switched the thread back in.

Jan's suggestion of #13610 sounds very plausible to me. If we wanted more data on this failure (or failures of this class) we'd probably need to upgrade the telemetry to determine if the thread is making progress and what behavior is happening machine wide. A better set of telemetry for a suspected hung test to capture would be:

process dump
ETW trace including stack sampling and context switches machine wide for ~30 sec
another process dump
This would clearly show if progress was being made, as well as what else the machine is doing.

Closing this as duplicate of #13610

jkotas · 2020-04-30T00:23:03Z

This was most likely introduced by #33307 according the CI telemetry. See #13610 (comment) for more context.

safern · 2020-04-30T02:07:12Z

Another data point for this, based on the Kusto data I see, this only happens in Windows7 and Windows8.1 -- I tried to repro locally on Windows 10 and after 150 runs it didn't repro.

noahfalk · 2020-04-30T03:19:45Z

Thanks @safern! That's very likely to save @josalem from hitting some deadends.

josalem · 2020-04-30T16:14:56Z

this only happens in Windows7 and Windows8.1 -- I tried to repro locally on Windows 10 and after 150 runs it didn't repro.

Interesting!

I'll take a look at this today. Thanks for triaging things @safern! I'm not sure what would be causing this issue, but based on the versions being hit, I'm guessing it might be something with Overlapped IO in old versions of Windows. I'll have to do some investigating.

josalem · 2020-05-01T01:44:50Z

I spun up a win7 VM and I'm trying to catch this, but I can't seem to reproduce the failure. I am running the System.Collections.Concurrent.Tests suite in a loop and having it ping me if the test takes longer than 60 seconds. So far I've noticed that the tests either take almost exactly 30 seconds or they take almost exactly 65 seconds. My VM is 2 vCPU 4GB, so it's a little bigger than the ones in CI unfortunately.

I've narrowed the test cycle down to just the System.Collections.Concurrent.Tests.EtwTests test that I saw a failure on in the other thread. That test usually completes in <1 second, but every once in a while takes >10 seconds (1 in 750+ loops so far). I'll leave it running overnight and see if it gets hit more often. I might try artificially bogging down the machine as well and see if that changes things.

@safern, were you ever able to locally reproduce this?

While I try to get a repro set up, I'll take another look at the dump that Jan shared. Based on what I've already seen and what Noah found, I'm not sure there is enough in the dump to root cause this. The thing I found interesting, is that the dump shows that the DiagnosticsServer is inactive, i.e., it's in the middle of a Poll call that is using WaitForMultipleEvents to wait on a Named Pipe handle. That PR doesn't include any changes to eventing, and since the monitor configuration isn't enabled, it shouldn't be doing anything new. I'm pretty sure this also means that we haven't reached the VM's shutdown code, so we wouldn't have called DiagnosticServer::Shutdown yet, so it should be fine that the server is waiting. I'll can try bisecting my changes with ifdefs to see if something is exacerbating the problem, but I can't do too much until I can reliably reproduce the error. I'll keep digging.

josalem · 2020-05-01T02:04:45Z

By "ping" I meant that it literally made a ping sound from the console. I had stepped away from my computer for a while, and it started beeping. It seems like after 1000 loops most of the runs are taking >10 seconds! The xunit output claims the tests ran in 5 seconds (still a big increase over the 0.6 second runtime from the first 900 loops), but the RuntTests.cmd script didn't finish before 10 seconds. The logs also say that the tests passed. I'll dig into this more tomorrow with some better logging and see if it actually means anything (and mute my computer before I call it a day 😝 ).

josalem · 2020-05-01T17:55:03Z

I left my script running overnight. I'm seeing some interesting behavior that I'm going to try and reproduce the rest of today.

I wrote up a powershell script does the following in an infinite loop:

$Now = Get-Date
$job = Start-Job -ScriptBlock {
    # invoke RuntTests.cmd scoped to just System.Collections.Concurrent.Tests.EtwTests
}

While ($job.State -eq 'Running') {
    if ((Get-Date).AddSeconds(-10) -gt $Now) {
		# beep and log
    }
}

$Output = $job | Receive-Job
Write-Host $Output

The time check is wall clock time, so it isn't timing how long xunit claims the tests executed for. For the first ~1000 iterations, the job would compelete in <10s and xunit would claim ~0.6s of test time. After that, almost every job took >10s and xunit claimed the tests tool 3-6s, which is 10x time increase at the upper end. After ~3000 iterations, it looks like something went haywire and the powershell job died, but was still in the running state. The time check loop never exited, but the state was still in 'Running' and there wasn't a dotnet.exe process running on the system.

I want to build a more robust script that does some more interesting logging, especially better timing. Right now, my thinking is there might be some kind of OS scheduling issue that we are exacerbating with high volumes of RemoteExecutor tests which causes performance of the tests to degrade over time. I'll try running this same experiment with another test suite and see if I'm seeing similar behavior. I think it might be possible that the tests are hitting a timeout cumulatively as this degradation happens. When we then take a dump we keep seeing the EventUnregister stack because that method is specifically waiting on the OS to wake it up, but the scheduling degradation is causing a consistent delay so that our dump logic is invoked before the OS "gets around to it". I need more evidence to back this up though since it does seem to happen more often after #33307 despite that code not touching this logic and being inert through the test I used.

stephentoub · 2020-05-03T01:28:52Z

Can we revert the offending PR? This is causing at least one leg in the majority of PRs to fail. It's going to make it much more likely that other regressions slip in, as we're having to rerun lots of legs and merge with failures that might be masking other failures, and it's going to make it more likely devs assume that failures aren't theirs even if they are in fact caused by their PR. Thanks!

safern · 2020-05-03T02:48:53Z

I agree this is being hit in 50+ runs a day and we’re having a 0% pass rate now in rolling builds as well.

jkotas · 2020-05-03T04:20:46Z

Reverting in #35767

jkotas · 2020-05-03T04:31:54Z

I have stepped through the Event Pipe IPC shutdown sequence. I believe that there is a clear HANDLE use-after-close bug:

Handle that is being closed here:

00 CoreCLR!IpcStream::DiagnosticsIpc::Close+0xbb [D:\runtime\src\coreclr\src\debug\debug-pal\win\diagnosticsipc.cpp @ 213] 
01 CoreCLR!IpcStreamFactory::ConnectionState::Close+0x42 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.h @ 38] 
02 CoreCLR!IpcStreamFactory::CloseConnections+0x71 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.cpp @ 115] 
03 CoreCLR!DiagnosticServer::Shutdown+0x1e8 [D:\runtime\src\coreclr\src\vm\diagnosticserver.cpp @ 229] 
04 CoreCLR!EEShutDownHelper+0x1b6 [D:\runtime\src\coreclr\src\vm\ceemain.cpp @ 1212] 
05 CoreCLR!EEShutDown+0x346 [D:\runtime\src\coreclr\src\vm\ceemain.cpp @ 1677] 
06 CoreCLR!CorHost2::UnloadAppDomain2+0xfb [D:\runtime\src\coreclr\src\vm\corhost.cpp @ 942]

is being waited on - or about to be waited on - here:

00 KERNELBASE!WaitForMultipleObjects [minkernel\kernelbase\synch.c @ 1401] 
01 CoreCLR!IpcStream::DiagnosticsIpc::Poll+0x31f [D:\runtime\src\coreclr\src\debug\debug-pal\win\diagnosticsipc.cpp @ 311] 
02 CoreCLR!IpcStreamFactory::GetNextAvailableStream+0x1c2 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.cpp @ 162] 
03 CoreCLR!DiagnosticServer::DiagnosticsServerThread+0x379 [D:\runtime\src\coreclr\src\vm\diagnosticserver.cpp @ 50] 
04 KERNEL32!BaseThreadInitThunk+0x14 [base\win32\client\thread.c @ 64] 
05 ntdll!RtlUserThreadStart+0x21 [minkernel\ntdll\rtlstrt.c @ 1153]

How does this explain the hang:

The handle gets reused for auto-reset event allocated by OS loader code
The Wait operation in IpcStream will reset the auto-reset event
The OS loader code will hang waiting for the event forever becase of it was reset already

Why is it more likely to repro on Win7/8:

Differences in the OS scheduler and differences in the OS handle allocation algorithm

The shutdown sequence of the event pipe need careful review. Also, it would be a good idea to minimize work done during shutdown to only what's absolutely necessary, e.g. these event handles do not really need to be closed since the OS will close them anyway as part of the process shutdown.

jkotas · 2020-05-03T04:38:40Z

The dump at \\jkotas9\drops\issue-35451 confirms this: pHandles[0] on thread 3 is 0x168, CriticalSection.LockSemaphore on thread 2 is 0x168

stephentoub · 2020-05-03T10:29:21Z

Thanks, Jan!

josalem · 2020-05-04T23:08:36Z

Oof! That's a tricky one! Thanks for pinpointing the issue @jkotas! I'm writing a fix for it now. Earlier iterations of the PR specifically didn't close all the handles and didn't attempt to cancel any waits, but I recalled #12991 which was "fixed" by closing handles to break out of a blocking. I'll come up with a more elegant approach that cancels the blocking Poll call and doesn't attempt subsequent cleanup.

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 25, 2020

jkotas added the area-System.Diagnostics.Tracing label Apr 25, 2020

This was referenced Apr 25, 2020

Fix coreclr Test570 due to ICU on Windows feature #35435

Merged

Emit shorter opcodes in ILGenerator.Emit(OpCode, int) #35427

Merged

tommcdon added this to the 5.0 milestone Apr 25, 2020

tommcdon removed the untriaged New issue has not been triaged by the area owner label Apr 25, 2020

jkotas mentioned this issue Apr 27, 2020

Test failed: System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout #13610

Closed

GrabYourPitchforks mentioned this issue Apr 27, 2020

Test failure: System.Reflection.Tests.IsCollectibleTests.MemberInfo_IsCollectibleFalse_WhenUsingAssemblyLoad(memberName: \"MyGenericMethod\") #35502

Closed

noahfalk closed this as completed Apr 29, 2020

jkotas reopened this Apr 30, 2020

jkotas added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Apr 30, 2020

jkotas mentioned this issue May 2, 2020

CI test failure: System.IO.FileSystem.Tests #35743

Closed

safern mentioned this issue May 2, 2020

Use GDI+1.1 API set in Windows 8 and later #35169

Merged

jkotas mentioned this issue May 3, 2020

Tests timing out with incomplete log #35752

Open

jkotas mentioned this issue May 3, 2020

Revert "Add Reverse Diagnostics Server" #35767

Merged

jkotas closed this as completed in #35767 May 3, 2020

josalem mentioned this issue May 5, 2020

Modified PR for reverse diagnostics server #35850

Merged

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

noahfalk added area-System.Diagnostics.EventLog and removed area-System.Diagnostics.Tracing labels Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test hang in System.Diagnostics.Tracing.EventProvider.EventUnregister #35451

Test hang in System.Diagnostics.Tracing.EventProvider.EventUnregister #35451

jkotas commented Apr 25, 2020

Dotnet-GitSync-Bot commented Apr 25, 2020

jkotas commented Apr 25, 2020

ghost commented Apr 25, 2020

jkotas commented Apr 25, 2020

jkotas commented Apr 25, 2020

tommcdon commented Apr 25, 2020

noahfalk commented Apr 28, 2020

noahfalk commented Apr 29, 2020 •

edited

Loading

noahfalk commented Apr 29, 2020

jkotas commented Apr 30, 2020

safern commented Apr 30, 2020

noahfalk commented Apr 30, 2020

josalem commented Apr 30, 2020

josalem commented May 1, 2020

josalem commented May 1, 2020

josalem commented May 1, 2020

stephentoub commented May 3, 2020

safern commented May 3, 2020

jkotas commented May 3, 2020

jkotas commented May 3, 2020 •

edited

Loading

jkotas commented May 3, 2020 •

edited

Loading

stephentoub commented May 3, 2020

josalem commented May 4, 2020 •

edited

Loading

Test hang in System.Diagnostics.Tracing.EventProvider.EventUnregister #35451

Test hang in System.Diagnostics.Tracing.EventProvider.EventUnregister #35451

Comments

jkotas commented Apr 25, 2020

Dotnet-GitSync-Bot commented Apr 25, 2020

jkotas commented Apr 25, 2020

ghost commented Apr 25, 2020

jkotas commented Apr 25, 2020

jkotas commented Apr 25, 2020

tommcdon commented Apr 25, 2020

noahfalk commented Apr 28, 2020

noahfalk commented Apr 29, 2020 • edited Loading

noahfalk commented Apr 29, 2020

jkotas commented Apr 30, 2020

safern commented Apr 30, 2020

noahfalk commented Apr 30, 2020

josalem commented Apr 30, 2020

josalem commented May 1, 2020

josalem commented May 1, 2020

josalem commented May 1, 2020

stephentoub commented May 3, 2020

safern commented May 3, 2020

jkotas commented May 3, 2020

jkotas commented May 3, 2020 • edited Loading

jkotas commented May 3, 2020 • edited Loading

stephentoub commented May 3, 2020

josalem commented May 4, 2020 • edited Loading

noahfalk commented Apr 29, 2020 •

edited

Loading

jkotas commented May 3, 2020 •

edited

Loading

jkotas commented May 3, 2020 •

edited

Loading

josalem commented May 4, 2020 •

edited

Loading