Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for pending IO in the portable thread pool's worker threads #82245

Merged
merged 2 commits into from
Mar 1, 2023

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Feb 16, 2023

  • When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.
  • Added a check that was missing in the portable thread pool implementation to prevent exiting a worker thread when it has pending IO

Fixes #82207

- When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.
- Added a check that was missing in the portable thread pool implementation to prevent exiting a worker thread when it has pending IO

Fixes dotnet#82207
@kouvel kouvel added this to the 8.0.0 milestone Feb 16, 2023
@kouvel kouvel requested review from eduardo-vp and mangod9 February 16, 2023 17:54
@kouvel kouvel self-assigned this Feb 16, 2023
@ghost
Copy link

ghost commented Feb 16, 2023

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details
  • When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.
  • Added a check that was missing in the portable thread pool implementation to prevent exiting a worker thread when it has pending IO

Fixes #82207

Author: kouvel
Assignees: kouvel
Labels:

area-System.Threading

Milestone: 8.0.0

@stephentoub
Copy link
Member

When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.

Just to build confidence, do we know why this happens / can we find out why this happens? Is it documented somewhere?

@kouvel
Copy link
Member Author

kouvel commented Feb 16, 2023

Just to build confidence, do we know why this happens / can we find out why this happens? Is it documented somewhere?

I don't know yet, still looking into it.

@kouvel
Copy link
Member Author

kouvel commented Feb 23, 2023

Looks like the failures are unrelated.

@kouvel
Copy link
Member Author

kouvel commented Mar 1, 2023

Just to build confidence, do we know why this happens / can we find out why this happens? Is it documented somewhere?

There is a brief note in AcceptEx docs:

Note: All I/O initiated by a given thread is canceled when that thread exits. For overlapped sockets, pending asynchronous operations can fail if the thread is closed before the operations complete. See ExitThread for more information.

I'm not sure about the details though, I have a question out to Windows folks.

I'll go ahead and merge this for now, as it seems like something that should be fixed anyway.

@kouvel kouvel merged commit 0a15c3b into dotnet:main Mar 1, 2023
@kouvel kouvel deleted the IoPendingFix branch March 1, 2023 17:53
@davidfowl
Copy link
Member

We're patching this in 7.0 right?

@kouvel
Copy link
Member Author

kouvel commented Mar 2, 2023

We're patching this in 7.0 right?

I have PRs open for 7.0 and 6.0, they're currently under consideration for servicing.

@davidfowl
Copy link
Member

cc @mconnew

@mconnew
Copy link
Member

mconnew commented Mar 6, 2023

This looks like it has a pathological case where multiple IO operations can be pending for long periods of time with no real work to complete. This will cause a tight loop spinning CPU.

The known scenario I'm concerned about is CoreWCF (or WCF client) which both tend to make IO calls on the IO thread pool. This is because after completing a pending receive, they send back an ACK and make the next pending receive call on the same thread that the completion happened on. The NetTcp transport has a default receive timeout of 10 minutes. For each active connection (which can be thousands in the server case, and in a 3-tier app the middle tier can have thousands of WCF clients) we have a pending receive.

We had a customer issue with WCF where their employees would all log in to the client app within a short period of time (around 200,000 new client connections within about 15 minutes) which would have a burst of threads handling the connection handshakes. That case ended up being an lsass bottleneck, but their application would hit a problem with this implementation. In that scenario, you would have 200,000 pending IO connections across 400+ IO threads, so every IO thread has many pending IO operations. Once things settled down, the request rate was very low and the IO thread pool would shrink over time.

The IO thread pool was able to shrink as new work was only getting executed via a call to GetQueuedCompletionStatus. The special OS behavior of threads bound to an IO completion port means a thread won't be preemptively context switched away from if it's not in an alertable state (doing actual work) which means all the extra IO threads calling GetQueuedCompletionStatus wouldn't end up running any new work. With this new implementation, that constraint doesn't exist as the work is dispatched to a non IOCP thread, which means the "extra" threads are still going to pick up "new" work, and that new work will be able to initiate new IO, and keeping the thread around. You could end up with hundreds of threads spinning consuming CPU, with any IO completions being evenly distributed between them. With enough concurrent connections, the threads will be self maintaining and never clean up, spending most of their time consuming CPU.

kouvel added a commit to kouvel/runtime that referenced this pull request Mar 8, 2023
- Port of dotnet#82245
- When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.
- Added a check that was missing in the portable thread pool implementation to prevent exiting a worker thread when it has pending IO

Port of fix for dotnet#82207
kouvel added a commit to kouvel/runtime that referenced this pull request Mar 8, 2023
- Port of dotnet#82245
- When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.
- Added a check that was missing in the portable thread pool implementation to prevent exiting a worker thread when it has pending IO

Port of fix for dotnet#82207
carlossanlop pushed a commit that referenced this pull request Mar 9, 2023
…ds (#82248)

* [6.0] Check for pending IO in the portable thread pool's worker threads

- Port of #82245
- When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.
- Added a check that was missing in the portable thread pool implementation to prevent exiting a worker thread when it has pending IO

Port of fix for #82207

* Refactor Windows-specific code
carlossanlop pushed a commit that referenced this pull request Mar 9, 2023
…ds (#82246)

* [7.0] Check for pending IO in the portable thread pool's worker threads

- Port of #82245
- When Resource Monitor is attached, some async IO operations are bound to the thread that issued it even though the IO handle is bound to an IOCP. If the thread exits, the async IO operation is aborted. This can lead to hangs or unexpected exceptions.
- Added a check that was missing in the portable thread pool implementation to prevent exiting a worker thread when it has pending IO

Port of fix for #82207

* Refactor Windows-specific code
@ghost ghost locked as resolved and limited conversation to collaborators Apr 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Net 7 Kestrel windows service hangs after a period of time
4 participants