Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mitigation for quic tests hangs #55985

Merged
merged 1 commit into from
Jul 20, 2021
Merged

mitigation for quic tests hangs #55985

merged 1 commit into from
Jul 20, 2021

Conversation

wfurt
Copy link
Member

@wfurt wfurt commented Jul 20, 2021

It seems like the hang is when we run synchronous conformance tests.
We receive notification, we reset the completion source but the WriteAsync task would never wake up and be finished.

> threadpool -wi
logStart: 191
logSize: 200
CPU utilization: 0 %
Worker Thread: Total: 2 Running: 2 Idle: 0 MaxLimit: 32767 MinLimit: 2
Work Request in Queue: 0

Queued work items:
Queue Address Work Item
[Global] 00007f4468a32b20 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.Net.Quic.Implementations.MsQuic.MsQuicStream+<WriteAsync>d__21, System.Net.Quic]]
[Global] 00007f4468a352e8 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.Net.Quic.Implementations.MsQuic.MsQuicStream+<WriteAsync>d__21, System.Net.Quic]]
[Global] 00007f44688ab968 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.Net.Quic.Tests.QuicStreamTests`1+<>c__DisplayClass14_0+<<ReadWrite_Random_Success>b__1>d[[System.Net.Quic.Tests.MsQuicProviderFactory, System.Net.Quic.Functional.Tests]], System.Net.Quic.Functional.Tests]]
[Global] 00007f44689e8d68 System.Threading.QueueUserWorkItemCallbackDefaultContext`1[[System.Object, System.Private.CoreLib]] => System.Threading.Tasks.ValueTask`1+ValueTaskSourceAsTask+<>c[[System.Int32, System.Private.CoreLib]].<.cctor>b__4_0(System.Object)
Statistics:
MT Count TotalSize Class Name
00007f4497eb4a00 1 32 System.Threading.QueueUserWorkItemCallbackDefaultContext`1[[System.Object, System.Private.CoreLib]]
00007f4497f4c580 1 96 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.Net.Quic.Tests.QuicStreamTests`1+<>c__DisplayClass14_0+<<ReadWrite_Random_Success>b__1>d[[System.Net.Quic.Tests.MsQuicProviderFactory, System.Net.Quic.Functional.Tests]], System.Net.Quic.Functional.Tests]]
00007f4497b02620 2 192 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.Net.Quic.Implementations.MsQuic.MsQuicStream+<WriteAsync>d__21, System.Net.Quic]]
Total 4 objects

big thanks to @stephentoub who pointed to #53471 and suggested this mitigation.
This is not final fix but I did few hundreds runs without problem when it would generally hang in less then 10 before on my dev box. This should at least improve PR experience and CI failure rate.
I will follow-up with details to help find better fix e.g. thread pool.

contributes to #55642

@wfurt wfurt requested review from stephentoub, kouvel and a team July 20, 2021 05:41
@wfurt wfurt self-assigned this Jul 20, 2021
@ghost
Copy link

ghost commented Jul 20, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

It seems like the hang is when we run synchronous conformance tests.
We receive notification, we reset the completion source but the WriteAsync task would never wake up and be finished.

> threadpool -wi
logStart: 191
logSize: 200
CPU utilization: 0 %
Worker Thread: Total: 2 Running: 2 Idle: 0 MaxLimit: 32767 MinLimit: 2
Work Request in Queue: 0

Queued work items:
Queue Address Work Item
[Global] 00007f4468a32b20 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.Net.Quic.Implementations.MsQuic.MsQuicStream+<WriteAsync>d__21, System.Net.Quic]]
[Global] 00007f4468a352e8 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.Net.Quic.Implementations.MsQuic.MsQuicStream+<WriteAsync>d__21, System.Net.Quic]]
[Global] 00007f44688ab968 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.Net.Quic.Tests.QuicStreamTests`1+<>c__DisplayClass14_0+<<ReadWrite_Random_Success>b__1>d[[System.Net.Quic.Tests.MsQuicProviderFactory, System.Net.Quic.Functional.Tests]], System.Net.Quic.Functional.Tests]]
[Global] 00007f44689e8d68 System.Threading.QueueUserWorkItemCallbackDefaultContext`1[[System.Object, System.Private.CoreLib]] => System.Threading.Tasks.ValueTask`1+ValueTaskSourceAsTask+<>c[[System.Int32, System.Private.CoreLib]].<.cctor>b__4_0(System.Object)
Statistics:
MT Count TotalSize Class Name
00007f4497eb4a00 1 32 System.Threading.QueueUserWorkItemCallbackDefaultContext`1[[System.Object, System.Private.CoreLib]]
00007f4497f4c580 1 96 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.Net.Quic.Tests.QuicStreamTests`1+<>c__DisplayClass14_0+<<ReadWrite_Random_Success>b__1>d[[System.Net.Quic.Tests.MsQuicProviderFactory, System.Net.Quic.Functional.Tests]], System.Net.Quic.Functional.Tests]]
00007f4497b02620 2 192 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[System.Net.Quic.Implementations.MsQuic.MsQuicStream+<WriteAsync>d__21, System.Net.Quic]]
Total 4 objects

big thanks to @stephentoub who pointed to #53471 and suggested this mitigation.
This is not final fix but I did few hundreds runs without problem when it would generally hang in less then 10 before on my dev box. This should at least improve PR experience and CI failure rate.
I will follow-up with details to help find better fix e.g. thread pool.

contributes to #55642

Author: wfurt
Assignees: wfurt
Labels:

area-System.Net.Quic

Milestone: -

@geoffkizer
Copy link
Contributor

Can you fill me in on why this is an improvement? And can we add a comment with that info too?

Is this what we should be doing in general when we have to do a sync wait on an async Task? There are other places we do this.

@wfurt
Copy link
Member Author

wfurt commented Jul 20, 2021

I'm putting details on #55642 as follow up. The old code should work just fine but it does not. The improvement is that the tests no longer hang.

@stephentoub
Copy link
Member

Can you fill me in on why this is an improvement? And can we add a comment with that info too?
Is this what we should be doing in general when we have to do a sync wait on an async Task? There are other places we do this.

As I told @wfurt, we shouldn't ship this way, but it's a fine stopgap temporarily to unblock tests if it helps. This was an experiment I suggested @wfurt try based on what I saw in the dump, which was that the thread pool had a few tasks in its global queue waiting to be run, very few threads in the system, nothing actively being executed, and yet those work items in the thread pool not being processed. I suspected it might have to do with @kouvel's recent changes to (ironically) more aggressively introduce threads to better handle sync-over-async, which hooks synchronously waiting on Task. So to see if that might be the cause, I suggested changing it to still block but do so via a WaitHandle rather than Task.Wait, and it apparently made the issue go away.

@karelz karelz requested a review from geoffkizer July 20, 2021 16:47
@geoffkizer
Copy link
Contributor

Ok.

@wfurt Can you add a comment in the code with a link to the original issue, something like // ACTIVE ISSUE: http://github.com....

I want to make sure we don't lose track of this.

@wfurt wfurt merged commit 0a5e93b into dotnet:main Jul 20, 2021
@wfurt wfurt deleted the syncQuic_55642 branch July 20, 2021 17:07
@ManickaP
Copy link
Member

Should we close #55642?
If not what's missing to get it closed? We should probably update the description at least.

@wfurt
Copy link
Member Author

wfurt commented Jul 20, 2021

I would keep it open to 1) verify the stability and 2) have proper fix

@ManickaP
Copy link
Member

  1. have proper fix
  • what would that be? I thought the problem is in thread pool not in our code...

@wfurt
Copy link
Member Author

wfurt commented Jul 20, 2021

  1. have proper fix
  • what would that be? I thought the problem is in thread pool not in our code...

likely? and we should revert this change before we ship.

@ManickaP
Copy link
Member

I think this discussion should be move to the original issue then 😄

@karelz karelz added this to the 6.0.0 milestone Jul 22, 2021
kouvel added a commit to kouvel/runtime that referenced this pull request Jul 27, 2021
- In dotnet#53471 the thread count goal was moved out of `ThreadCounts`, it turns out that are a few subtle races that it was avoiding. There are other ways to fix it, but I've added the goal back into `ThreadCounts` for now.
- Reverted PR dotnet#55985, which worked around the issue in the CI

Fixes dotnet#55642
kouvel added a commit to kouvel/runtime that referenced this pull request Jul 27, 2021
@kouvel kouvel mentioned this pull request Jul 27, 2021
kouvel added a commit that referenced this pull request Jul 27, 2021
* Fix thread pool hang

- In #53471 the thread count goal was moved out of `ThreadCounts`, it turns out that are a few subtle races that it was avoiding. There are other ways to fix it, but I've added the goal back into `ThreadCounts` for now.
- Reverted PR #55985, which worked around the issue in the CI

Fixes #55642

* Revert "mitigation for quic tests hangs (#55985)"

This reverts commit 0a5e93b.
@ghost ghost locked as resolved and limited conversation to collaborators Aug 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants