Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition in the thread pool #68171

Merged
merged 3 commits into from
Apr 19, 2022
Merged

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Apr 18, 2022

There is a case where on a work-stealing queue, both LocalPop() and TrySteal() fail when running concurrently, and lead to a case where there is a work item but no threads are released to process it. Fixed to always ensure that there's a thread request when there was a missed steal. Also when LocalPop() fails, the thread does not attempt to pop anymore and that can be an issue if that thread is the last thread to look for work items. Fixed to always check the local queue. The issues were introduced by #64834.

Fixes #67545

@kouvel kouvel added this to the 7.0.0 milestone Apr 18, 2022
@kouvel kouvel requested review from janvorli and mangod9 April 18, 2022 19:29
@kouvel kouvel self-assigned this Apr 18, 2022
@ghost
Copy link

ghost commented Apr 18, 2022

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

There is a case where on a work-stealing queue, both LocalPop() and TrySteal() may fail when running concurrently, and lead to a case where there is a work item but no threads are released to process it. Fixed to always ensure that there's a thread request when there was a missed steal. Also when LocalPop() fails, the thread does not attempt to pop anymore and that can be an issue if that thread is the last thread to look for work items. Fixed to always check the local queue. The issues were introduced by #64834.

Fixes #67545

Author: kouvel
Assignees: kouvel
Labels:

area-System.Threading

Milestone: 7.0.0

@danmoseley
Copy link
Member

Is a unit test feasible here? If nothing else, something derived from the perf test, even if it only would catch this bug sporadically.

@adamsitnik
Copy link
Member

@kouvel @mangod9 is there any chance we could include this fix in Preview4 (snaps today at 2 PM?)? This would allow to cover it in Perf Run for Preview4

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

There is a case where on a work-stealing queue, both `LocalPop()` and `TrySteal()` may fail when running concurrently, and lead to a case where there is a work item but no threads are released to process it. Fixed to always ensure that there's a thread request when there was a missed steal. Also when `LocalPop()` fails, the thread does not attempt to pop anymore and that can be an issue if that thread is the last thread to look for work items. Fixed to always check the local queue.

Fixes dotnet#67545
@kouvel
Copy link
Member Author

kouvel commented Apr 19, 2022

is there any chance we could include this fix in Preview4 (snaps today at 2 PM?)? This would allow to cover it in Perf Run for Preview4

The test appears to occasionally be timing out on Linux_musl x64. I'd like to get this fix into preview 4 if possible. I think I'll disable it for this PR with an active issue and investigate that separately, if it's a product issue it would likely be a different one.

@danmoseley
Copy link
Member

You can add PlatformDetection.IsNotAlpine and [ConditionalFact(typeof(PlatformDetection), nameof(PlatformDetection.IsNotAlpine))] of course.

@kouvel
Copy link
Member Author

kouvel commented Apr 19, 2022

It didn't fail in the first set of runs and failed in the second set. I'm not sure where else it may fail occasionally.

@kouvel
Copy link
Member Author

kouvel commented Apr 19, 2022

It might just be a few too many iterations on some machines, especially with a checked runtime. I'll try reducing the iterations.

@kouvel
Copy link
Member Author

kouvel commented Apr 19, 2022

Combined with the previous CI runs and the latest harmless commit I'll go ahead and merge to make it into preview 4

@kouvel kouvel merged commit 765ecde into dotnet:main Apr 19, 2022
@kouvel kouvel deleted the DeadlockFix branch April 19, 2022 20:58

static string CreateFileWithRandomContent(int fileSize)
{
string filePath = Path.GetTempFileName();
Copy link
Member

@danmoseley danmoseley Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside, better to use the TempFile class, which allows the using pattern instead of writing out try/finally. It also avoids the problematic Path.GetTempFileName() which fails if someone else has leaked temp files.
It has a ctor that takes an array, too.

public TempFile(string path, byte[] data)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, I'll fix it up in a follow-up PR

adamsitnik pushed a commit to dotnet/performance that referenced this pull request Apr 20, 2022
directhex pushed a commit to directhex/runtime that referenced this pull request Apr 21, 2022
* Fix a race condition in the thread pool

There is a case where on a work-stealing queue, both `LocalPop()` and `TrySteal()` may fail when running concurrently, and lead to a case where there is a work item but no threads are released to process it. Fixed to always ensure that there's a thread request when there was a missed steal. Also when `LocalPop()` fails, the thread does not attempt to pop anymore and that can be an issue if that thread is the last thread to look for work items. Fixed to always check the local queue.

Fixes dotnet#67545
kouvel added a commit to kouvel/runtime that referenced this pull request Apr 26, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 20, 2022
@AndyAyersMS
Copy link
Member

@kouvel there seem to be some persistent regressions in ping-pong tests that may be a result of this change: dotnet/perf-autofiling-issues#4817

Is this expected?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[arm64] Perf_FileStream.FlushAsync benchmark hangs on Debian 11
6 participants