Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for the overlapped operation to be signaled as canceled #326

Merged
merged 2 commits into from
May 31, 2023
Merged

Wait for the overlapped operation to be signaled as canceled #326

merged 2 commits into from
May 31, 2023

Conversation

razvan-pricope
Copy link
Contributor

let the OS set the internal flag to STATUS_CANCELED). If the wil::unique_folder_change_reader_nothrow handle is destroyed before doing that, a random memory corruption or crash might occur.

… OS set the `internal` flag to STATUS_CANCELED). If the wil::unique_folder_change_reader_nothrow handle is destroyed before that, a random memory corruption or crash might occur.
@jonwis
Copy link
Member

jonwis commented May 22, 2023

Neat, thanks - MSDN says

The cancel operation for all pending I/O operations issued by the calling process for the specified file handle was successfully requested. The application must not free or reuse the OVERLAPPED structure associated with the canceled I/O operations until they have completed. The thread can use the GetOverlappedResult function to determine when the I/O operations themselves have been completed.

And passing TRUE there means "wait for callbacks to complete."

Good catch!

One common problem with these callback types is if they are destroyed on the same thread as the callback. So like if the watcher is invoked on a thread, then as part of its operation it initiates teardown, that can cause these "block for completion" parts of destructors to hang. A common workaround is to track the thread ID in the "handle callback" and if the destroying thread is the same as the callback thread, don't block.

I haven't done enough overlapped IO to know if GetOverlappedResult has that problem.

include/wil/filesystem.h Outdated Show resolved Hide resolved
@razvan-pricope
Copy link
Contributor Author

One common problem with these callback types is if they are destroyed on the same thread as the callback. So like if the watcher is invoked on a thread, then as part of its operation it initiates teardown, that can cause these "block for completion" parts of destructors to hang. A common workaround is to track the thread ID in the "handle callback" and if the destroying thread is the same as the callback thread, don't block.

I haven't done enough overlapped IO to know if GetOverlappedResult has that problem.

I don't think that GetOverlappedResult is affected by this issue because you are not creating any threads when calling ReadDirectoryChangesW. You are esentially "stealing" whatever thread you want to wait for the operation completion and are responsible for the lifetime of the thread.
However, WaitForThreadpoolIoCallbacks might deadlock in the above scenario and there is a comment just before calling this function that describes the problem.

@jonwis jonwis merged commit 56b2de5 into microsoft:master May 31, 2023
@DHowett
Copy link
Member

DHowett commented Nov 1, 2023

This might seem wild, but I think we (Windows Terminal) are hitting deadlocks on shutdown after upgrading to a version of WIL that has this change.

The callstack points to...

03 00000055`5c4ff900 00007ffe`906cadc6     KERNELBASE!GetOverlappedResult+0xcf [minkernel\kernelbase\error.c @ 678] 
04 00000055`5c4ff940 00007ffe`906cb3a6     TerminalApp!wil::details::folder_change_reader_state::~folder_change_reader_state+0x8e [C:\__w\1\s\packages\Microsoft.Windows.ImplementationLibrary.1.0.230824.2\include\wil\filesystem.h @ 636] 
05 00000055`5c4ff980 00007ffe`906cab18     TerminalApp!wil::details::folder_change_reader_state::`scalar deleting destructor'+0xe

How are we holding it wrong? I'd love to hold it right :)

@razvan-pricope
Copy link
Contributor Author

razvan-pricope commented Nov 2, 2023

We are seeing the same behavior during our internal tests and I analyzed a memory dump. What I found is that GetLastError() is ERROR_NOT_FOUND. I don't know if GetOverlappedResults (or the WaitForSingleObject inside it) sets the last error value, but peeking through the assembly seems to suggest that it doesn't. So, most probably, CancelIoEx returned FALSE && GetLastError() == ERROR_NOT_FOUND
The documentation for CancelIoEx suggests that this error code can be triggered when there is no pending operation to be canceled.

Based on the above, I reproduced synthetically a scenario when the above hang can take place. There is a race condition where if you receive a monitor notification and you delete the directory before issuing a new call to StartIo, the ReadDirectoryChanges will return ACCESS_DENIED. In this case, no pending asynchronous operation is in progress, so GetOverlappedResult hangs.
If seems that the fix would be to check the return value of CancelIoEx and only in case of success to call GetOverlappedResult.
I will create a pull request with a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants