Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SafeHandle use-after-dispose in FileSystemWatcher on OSX #30056

Closed
stephentoub opened this issue Jun 26, 2019 · 97 comments · Fixed by #34589, #52019 or #52275
Closed

SafeHandle use-after-dispose in FileSystemWatcher on OSX #30056

stephentoub opened this issue Jun 26, 2019 · 97 comments · Fixed by #34589, #52019 or #52275
Assignees
Labels
area-System.IO blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' bug os-mac-os-x macOS aka OSX
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Jun 26, 2019

OSX.1013.Amd64.Open-x64:Debug
https://mc.dot.net/#/user/dotnet-bot/pr~2Fdotnet~2Fcorefx~2Frefs~2Fpull~2F38953~2Fmerge/test~2Ffunctional~2Fcli~2Finnerloop~2F/20190626.70/workItem/System.IO.FileSystem.Watcher.Tests/wilogs

Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at Interop.RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(Object args) in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 247
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs:line 201
Fatal error. Internal CLR error. (0x80131506)
   at Interop+RunLoop.CFRunLoopRun()
   at Interop+RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher+RunningInstance+StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(System.Object)
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)

cc: @JeremyKuhne, @carlossanlop

Runfo Tracking Issue: system.io.filesystem.watcher.tests work item

Build Definition Kind Run Name Console Core Dump Test Results Run Client
1082967 runtime PR 51078 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1082899 runtime PR 51099 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump
1082624 runtime PR 51090 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1080937 runtime Rolling net5.0-Linux-Release-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-16.04-helix-arm64v8-bfcd90a-20200127194925 console.log runclient.py
1072066 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump runclient.py
1072066 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump
1071508 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump
1071508 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump
1069555 runtime Rolling net6.0-windows-Release-x64-CoreCLR_checked-Windows.10.Amd64.Open console.log runclient.py
1069328 runtime Rolling net6.0-windows-Release-x86-CoreCLR_release-Windows.10.Amd64.ServerRS5.Open console.log runclient.py
1068131 runtime PR 50257 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1066426 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump
1066426 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump
1066285 runtime PR 50507 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log runclient.py
1065570 runtime PR 50483 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1065167 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1015.Amd64.Open console.log core dump runclient.py
1063683 runtime PR 48527 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1059736 runtime PR 50232 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1059736 runtime PR 50232 net6.0-windows-Debug-x64-CoreCLR_checked-Windows.10.Amd64.Open console.log runclient.py
1059736 runtime PR 50232 net6.0-Linux-Debug-x64-CoreCLR_checked-Ubuntu.1804.Amd64.Open console.log core dump runclient.py
1059736 runtime PR 50232 net6.0-Linux-Debug-x64-CoreCLR_checked-(Alpine.312.Amd64.Open)ubuntu.1604.amd64.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-20200602002622-e06dc59 console.log core dump runclient.py
1058168 runtime PR 50280 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1057844 runtime PR 50275 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1055598 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1050603 runtime PR 49990 net6.0-windows-Debug-x64-CoreCLR_checked-Windows.10.Amd64.Open console.log
1050314 runtime PR 48601 net6.0-windows-Debug-x64-CoreCLR_checked-Windows.10.Amd64.Open console.log runclient.py
1049560 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1047468 runtime PR 49545 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1047249 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1013.Amd64.Open console.log core dump runclient.py
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-mono_interpreter_release-Debian.9.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-(Centos.8.Amd64.Open)Ubuntu.1604.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:centos-8-helix-20201229003624-c1bf759 console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-RedHat.7.Amd64.Open console.log core dump runclient.py
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-(Debian.10.Amd64.Open)ubuntu.1604.amd64.open@mcr.microsoft.com/dotnet-buildtools/prereqs:debian-10-helix-amd64-bfcd90a-20200121150006 console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-Ubuntu.1604.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-Ubuntu.1804.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-SLES.15.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-(Fedora.30.Amd64.Open)ubuntu.1604.amd64.open@mcr.microsoft.com/dotnet-buildtools/prereqs:fedora-30-helix-20200512010621-4f8cef7 console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-16.04-helix-arm64v8-20210106155927-56c6673 console.log core dump
1046026 runtime PR 49740 net6.0-OSX-Debug-x64-Mono_release-OSX.1014.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-OSX-Debug-x64-Mono_release-OSX.1015.Amd64.Open console.log core dump
1045010 runtime PR 48369 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log runclient.py
1044737 runtime PR 49373 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1015.Amd64.Open console.log core dump runclient.py
1044726 runtime PR 49532 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1043361 runtime PR 49747 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1042998 runtime PR 49741 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1041823 runtime PR 49251 net6.0-windows-Release-x86-CoreCLR_checked-Windows.10.Amd64.Open console.log runclient.py
1039858 runtime PR 43706 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump
1039858 runtime PR 43706 net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump
1039185 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1038808 runtime PR 49613 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1037652 runtime PR 49529 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1037058 runtime PR 41097 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1015.Amd64.Open console.log core dump runclient.py

Build Result Summary

Day Hit Count Week Hit Count Month Hit Count
3 4 33
@stephentoub
Copy link
Member Author

stephentoub commented Jun 26, 2019

cc: @MaximLipnin
Maybe related to dotnet/corefx#33865

@danmoseley
Copy link
Member

@stephentoub do you know what could cause a NRE in a native call into macOS? Is the interop layer translating some kind of errno returned from the OS here?

Then I see we get a COR_E_EXECUTIONENGINE while trying to deal with the exception.

@stephentoub
Copy link
Member Author

Not sure; @jkoritzinsky, @AaronRobinsonMSFT, ideas?

@AaronRobinsonMSFT
Copy link
Member

@danmosemsft and @stephentoub The COR_E_EXECUTIONENGINE can be caused by an invalid P/Invoke stub or invalid translation as well. I did recently fix a few issues in macOS regarding race conditions in shutdown - dotnet/coreclr#24847. How often is this happening? dotnet/corefx#33865 does look promising too. Can someone point to the P/Invoke that is causing this issue?

@danmoseley
Copy link
Member

For future reference, it seems console output cannot directly be queried in Kusto. The best we can easily do is look for 134

WorkItems 
| where Started > now(-10d)  
| where FriendlyName == "System.IO.FileSystem.Watcher.Tests"
| where ExitCode == 134
| join kind= inner (
   Jobs  | where Started > now(-10d) | project  QueueName , JobId, Build, Type, Source 
) on JobId
| summarize count() by ExitCode, QueueName
| order by count_ desc

ExitCode QueueName count_
134 Ubuntu.1604.Arm32.Open 28
134 Ubuntu.1604.Arm64.Docker.Open 25
134 Ubuntu.1804.Amd64.Open 6
134 OSX.1013.Amd64.Open 1

compared to ~5000 with 0 exit code.

@AaronRobinsonMSFT
Copy link
Member

@danmosemsft Do we know what version of coreclr is being used to run the tests themselves? Is it possible the coreclr being used in this instance is older than the date at which dotnet/coreclr#24847 was checked in?

@stephentoub
Copy link
Member Author

We'd have to dig to see which exact version, but this was 5 days ago, and corefx ingested a new coreclr 5 days ago, 6 days ago, and 7 days ago, so it definitely has that change from end of May.

@danmoseley
Copy link
Member

@mjanecke can you tell from the output at https://mc.dot.net/#/user/dotnet-bot/pr~2Fdotnet~2Fcorefx~2Frefs~2Fpull~2F38953~2Fmerge/test~2Ffunctional~2Fcli~2Finnerloop~2F/20190626.70/workItem/System.IO.FileSystem.Watcher.Tests/wilogs whether a dump got uploaded? I don't see the usual dumpling song and dance.

@missymessa
Copy link
Member

@danmosemsft, for that test, it looks like the failure happened on a Mac, which I don't think @epananth has finished implementing for dump collection yet.

@danmoseley
Copy link
Member

@mjanecke good point, and last I heard we did not have a way to create a dump on Mac (that could be debugged with managed debuggers). @mikem8361 any update on that? Is our only option to catch this under an attached debugger?

(It's quite possible this isn't Mac specific actually but I'm interested anyway)

@mikem8361
Copy link
Member

mikem8361 commented Jul 2, 2019 via email

@danmoseley
Copy link
Member

Since this is clearly mac specific we can confidently say it has only happened 1 time in 10 days.

@danmoseley
Copy link
Member

No more hits in the last week. Query:

WorkItems 
| where Started > now(-10d)  
| where FriendlyName == "System.IO.FileSystem.Watcher.Tests"
| where ExitCode == 134
| join kind= inner (
   Jobs  | where Started > now(-10d) | project  QueueName , JobId, Build, Type, Source 
) on JobId
| summarize count() by ExitCode, QueueName
| order by count_ desc
       
  ExitCode QueueName count_
  134 Ubuntu.1604.Arm32.Open 3

@carlossanlop
Copy link
Member

Seems to be infrastructure related. Moving to 5.0.

@jkotas
Copy link
Member

jkotas commented Nov 9, 2019

Failed in dotnet/corefx#42512:

  Starting:    System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Unhandled exception. System.ObjectDisposedException: Safe handle has been closed.
Object name: 'SafeHandle'.
   at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success) in /_/src/System.Private.CoreLib/shared/System/Runtime/InteropServices/SafeHandle.cs:line 156
   at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success) in /_/src/System.Private.CoreLib/src/System/StubHelpers.cs:line 1744
   at Interop.EventStream.FSEventStreamUnscheduleFromRunLoop(SafeEventStreamHandle streamRef, IntPtr runLoop, SafeCreateHandle runLoopMode)
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.UnscheduleFromRunLoop(SafeEventStreamHandle eventStream) in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 215
   at System.IO.FileSystemWatcher.RunningInstance.CancellationCallback() in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 276
   at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 370
   at Interop.RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(Object args) in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 249
   at System.Threading.ThreadHelper.ThreadStart_Context(Object state) in /_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs:line 50
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs:line 201
Fatal error. Internal CLR error. (0x80131506)
   at Interop+RunLoop.CFRunLoopRun()
   at Interop+RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher+RunningInstance+StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(System.Object)
   at System.Threading.ThreadHelper.ThreadStart_Context(System.Object)
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Threading.ThreadHelper.ThreadStart(System.Object)

@danmoseley
Copy link
Member

@wfurt

@wfurt
Copy link
Member

wfurt commented Nov 10, 2019

I will take a look. We still have few flaky watcher tests on OSX and my plan was to run whole set in loop and chase them down.

@wfurt wfurt self-assigned this Nov 10, 2019
@ericstj
Copy link
Member

ericstj commented Nov 20, 2019

Failed again in #136. Same stack. Here's a dump: https://helix.dot.net/api/2019-06-17/jobs/b40ed2a0-5a9f-4d81-9bb5-62900426af75/workitems/System.IO.FileSystem.Watcher.Tests/files/core.18074 (it's 3.4 GB! get it while its hot / still available)

@wfurt
Copy link
Member

wfurt commented Nov 20, 2019

The core file is incomplete or corrupted:

(lldb) target create --core "/Users/furt/Downloads/core.18074"
warning: (x86_64) /Users/furt/Downloads/core.18074 load command 3055 LC_SEGMENT_64 has a fileoff + filesize (0xda14f000) that extends beyond the end of the file (0xda14e000), the segment will be truncated to match
warning: (x86_64) /Users/furt/Downloads/core.18074 load command 3056 LC_SEGMENT_64 has a fileoff (0xda14f000) that extends beyond the end of the file (0xda14e000), ignoring this section
Core file '/Users/furt/Downloads/core.18074' (x86_64) was loaded.
(lldb) plugin load /Users/furt/github/diagnostics/artifacts/bin/OSX.x64.Debug/libsosplugin.dylib
(lldb) clrstack
Failed to load data access module, 0x80004005
Can not load or initialize libmscordaccore.dylib. The target runtime may not be initialized.
ClrStack  failed

We should never limit core size as that makes them unusable. Also it is full of zeros and compressed size is only ~ 400M. We should not dump them raw to storage if we care about size. Should we open new issue to make core dumps on OSX usable @ericstj ? And if so I'm not what is correct repo and team to work on this.

@wfurt
Copy link
Member

wfurt commented Nov 20, 2019

cc: @epananth for core file part.

@ericstj
Copy link
Member

ericstj commented Nov 20, 2019

I'm guessing https://github.com/dotnet/arcade and tag @mjanecke @alexperovich @ChadNedzlek

@mikem8361
Copy link
Member

mikem8361 commented Nov 20, 2019 via email

@xieofxie
Copy link

xieofxie commented Dec 5, 2019

@jkotas, hi, I met the same issue as you posted on macOS (on Windows it works fine..). Is it being fixed? Thanks~

@wfurt
Copy link
Member

wfurt commented Dec 5, 2019

no. If you have repro, please share it @xieofxie .

@xieofxie
Copy link

xieofxie commented Dec 6, 2019

@wfurt It is generated by a validation in azure devops for the pr above.. I don't know how to reproduce it steadily..

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 5.0 milestone Feb 1, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 22, 2021
@ViktorHofer
Copy link
Member

Unsure if this is the same issue. I see a NullReferenceException in FileSystemWatcher.RunningInstance.FileSystemEventCallback:

Unhandled Exception:
System.NullReferenceException: Object reference not set to an instance of an object
   at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 354
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(ManualResetEventSlim runLoopStarted, SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 255
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.<>c.<ScheduleEventStream>b__3_0(Object args) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 205
   at System.Threading.Thread.StartCallback() in /_/src/mono/System.Private.CoreLib/src/System/Threading/Thread.Mono.cs:line 236

core.28810:
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-51891-merge-4df95a776f8845c08e/Microsoft.Extensions.Hosting.Unit.Tests/core.28810?sv=2019-07-07&se=2021-05-16T21%3A33%3A56Z&sr=c&sp=rl&sig=H2ZL3fnJrWeUiUgDH0X%2F%2Fph9du9TWqcIHWXCt0ta%2B2c%3D

console.fe66311c.log:
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-51891-merge-4df95a776f8845c08e/Microsoft.Extensions.Hosting.Unit.Tests/console.fe66311c.log?sv=2019-07-07&se=2021-05-16T21%3A33%3A56Z&sr=c&sp=rl&sig=H2ZL3fnJrWeUiUgDH0X%2F%2Fph9du9TWqcIHWXCt0ta%2B2c%3D

Configuration: net6.0-OSX-Debug-x64-Mono_release-OSX.1015.Amd64.Open

@danmoseley
Copy link
Member

@ViktorHofer not sure that's the same thing, but I don't know why it could be either -- there is no obvious way that code can ever NRE:

if (!_weakWatcher.TryGetTarget(out FileSystemWatcher? watcher))

I see it segfaulted though -- is that what Mono test runs do when there's an unhandled exceptoin that brings down the process (like in this case) or a separate segfault?

@danmoseley
Copy link
Member

Oh, I see @jkotas you believe this is a dupe?

@jkotas
Copy link
Member

jkotas commented Apr 27, 2021

I think so.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 29, 2021
@eerhardt
Copy link
Member

eerhardt commented May 4, 2021

@jkotas - I see the FileWatcher tests failing on OSX in this PR:

https://github.com/dotnet/runtime/pull/52252/checks?check_run_id=2502547808

runtime (Libraries Test Run release coreclr OSX x64 Debug)
.packages/microsoft.dotnet.helix.sdk/6.0.0-beta.21222.1/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets#L78
.packages/microsoft.dotnet.helix.sdk/6.0.0-beta.21222.1/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(78,5): error : (NETCORE_ENGINEERING_TELEMETRY=Test) Work item System.IO.FileSystem.Watcher.Tests in job d239ae94-91a2-4a9c-b9e1-2191863d7385 has failed.
Failure log: https://helix.dot.net/api/2019-06-17/jobs/d239ae94-91a2-4a9c-b9e1-2191863d7385/workitems/System.IO.FileSystem.Watcher.Tests/console

/private/tmp/helix/working/AB6B08F2/w/A87108BA/e /private/tmp/helix/working/AB6B08F2/w/A87108BA/e
  Discovering: System.IO.FileSystem.Watcher.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.IO.FileSystem.Watcher.Tests (found 149 of 183 test cases)
  Starting:    System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Process terminated. Assertion failed.
   at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 386
   at Interop.RunLoop.CFRunLoopRun()
   at Interop.RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(ManualResetEventSlim runLoopStarted, SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 254
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.<>c.<ScheduleEventStream>b__3_0(Object args) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 204
   at System.Threading.Thread.StartCallback() in /_/src/coreclr/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs:line 105
./RunTests.sh: line 162:  4690 Abort trap: 6           (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.IO.FileSystem.Watcher.Tests.runtimeconfig.json --depsfile System.IO.FileSystem.Watcher.Tests.deps.json xunit.console.dll System.IO.FileSystem.Watcher.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/private/tmp/helix/working/AB6B08F2/w/A87108BA/e
----- end Tue May 4 09:55:46 PDT 2021 ----- exit code 134 ----------------------------------------------------------
exit code 134 means SIGABRT Abort. Managed or native assert, or runtime check such as heap corruption, caused call to abort(). Core dumped.
ulimit -c value: unlimited

Is that this same issue? I hope we got a dump of the process so we can figure the issue out.

@jkotas
Copy link
Member

jkotas commented May 4, 2021

Yes, it is the same issue.

This assert says that the file watcher callback can be called even after the EventStream was stopped using FSEventStreamStop. So we have to be robust against that situation and use a different scheme to perform the cleanup.

@jkotas jkotas reopened this May 4, 2021
@eerhardt
Copy link
Member

eerhardt commented May 4, 2021

@jaredpar - for my education, how do you find that link?

@jaredpar
Copy link
Member

jaredpar commented May 4, 2021

Clicked on the checks tab and went to the AzDO build

https://dev.azure.com/dnceng/public/_build/results?buildId=1120772&view=logs&jobId=4724cd84-0a26-56bc-9b1d-ca1983f340b0

They key part there is the build id is 1120772. From there just ask runfo to dump the data

https://runfo.azurewebsites.net/view/build/?number=1120772

It will list all of the failed tests and for the FileWatcherTests there was a heap dump noted and linked to

image

The link I posted is that one.

@jaredpar
Copy link
Member

jaredpar commented May 4, 2021

Truthfully I was a bit lazier than that. I didn't click on the checks tab. Really I just went here

https://runfo.azurewebsites.net/view/pr-builds/?Repository=runtime&Number=52252

Basically that dumps all of the builds that happened for the PR #52252. I happened to know that test ran on the runtime definition so I click here

image

That is the link I posted above .

https://runfo.azurewebsites.net/view/build/?number=1120772

@danmoseley
Copy link
Member

Even easier than that, in the dotnet/runtime repo we generate a file containing commands that you can copy/paste to get from zero to the debugger:

image

Try it out :)

@jaredpar
Copy link
Member

jaredpar commented May 4, 2021

Haha, can't debug though cause Windows does not like the core dump here

image

@jaredpar
Copy link
Member

jaredpar commented May 4, 2021

Even easier than that, in the dotnet/runtime repo we generate a file containing commands that you can copy/paste to get from zero to the debugger:

From PR to that display is seven clicks, runfo is two. And one of those clicks is a spinning wheel click waiting for that display to render :)

Joking aside I'm glad that display is getting better but it does take a lot of clicks to get to. It also takes time to render and is not 100% reliable. Found several times over the years where it won't display anything and errors out.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 4, 2021
@jkotas
Copy link
Member

jkotas commented May 4, 2021

Another attempt to fix this #52275

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 5, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jun 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.