Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DATAS BGC thread synchronization fix #109804

Merged
merged 5 commits into from
Nov 25, 2024
Merged

Conversation

Maoni0
Copy link
Member

@Maoni0 Maoni0 commented Nov 13, 2024

this problem with BGC thread synchronization is similar to the SVR GC one I fixed but has its own complications due to the fact BGC threads are only created on demands. this was work I wanted to do in .NET9 but didn't get time to.

it can cause the following symptoms -

  • deadlock because a BGC thread is erroneously terminated so then the next BGC join cannot finish
  • AV because a BGC thread could be seeing settings.concurrent as true and actually running blocking GC code!

I moved setting the idle event code to be inside a GC instead of outside because we don't necessarily trigger a BGC in-between HC changes. and if we don't, we'd need to compensate for the idle count that we deducted and this is difficult to track. moving this to the place where we already decided to do a BGC makes it much simpler. I am setting all the required idle events sequentially instead of per heap (there isn't a convenient way to set it per heap without introducing yet more sync cost) which does incur some perf cost - however this cost is small and we only need to pay for it when the following conditions are true - 1) we actually changed HC; 2) we did do a BGC that observed this HC change; 3) we need to wake up threads that participated in the last BGC.

also added some stress code to help with reproing the problem and stressing the new code.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

src/coreclr/gc/gc.cpp Outdated Show resolved Hide resolved
src/coreclr/gc/gc.cpp Outdated Show resolved Hide resolved
Copy link
Member

@mangod9 mangod9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks for fixing.

@Maoni0
Copy link
Member Author

Maoni0 commented Nov 22, 2024

last commit I pushed will be cleaned up - I left some code there to help with stress.

Copy link
Member

@mrsharm mrsharm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Extensively stress tested with the latest commit and found no issue.

@Maoni0 Maoni0 merged commit 8f68b59 into dotnet:main Nov 25, 2024
90 checks passed
@Maoni0
Copy link
Member Author

Maoni0 commented Nov 25, 2024

/backport to release/9.0-staging

Copy link
Contributor

Started backporting to release/9.0-staging: https://github.com/dotnet/runtime/actions/runs/12020222869

Copy link
Contributor

@Maoni0 backporting to release/9.0-staging failed, the patch most likely resulted in conflicts:

$ git am --3way --empty=keep --ignore-whitespace --keep-non-patch changes.patch

Applying: fix
Applying: with stress to help with repro
error: sha1 information is lacking or useless (src/coreclr/gc/gc.cpp).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0002 with stress to help with repro
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

Copy link
Contributor

@Maoni0 an error occurred while backporting to release/9.0-staging, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.

Maoni0 added a commit that referenced this pull request Nov 27, 2024
Backport of #109804 to release/9.0-staging

/cc @mangod9 @mrsharm

Customer Impact
 Customer reported
 Found internally
Original issue: #109804. Customers discovered a hang because one of the BGC threads had disappeared which shouldn't have happened - this is due to a BGC thread that was not needed during a previous BGC (because DATAS only needed a subset of BGC threads for that BGC to run) ran at a time when settings.concurrent was FALSE so it exited.

Regression
 Yes
 No
Testing
I have added stress code in GC (under STRESS_DYNAMIC_HEAP_COUNT) to make the repro quite easy.

Risk
Low. This has gone through extensive testing on both Windows and Linux.
mikelle-rogers pushed a commit to mikelle-rogers/runtime that referenced this pull request Dec 10, 2024
this problem with BGC thread synchronization is similar to the SVR GC one I fixed but has its own complications due to the fact BGC threads are only created on demands. 

it can cause the following symptoms -

+ deadlock because a BGC thread is erroneously terminated so then the next BGC join cannot finish
+ AV because a BGC thread could be seeing settings.concurrent as true and actually running blocking GC code!

I moved setting the idle event code to be inside a GC instead of outside because we don't necessarily trigger a BGC in-between HC changes. and if we don't, we'd need to compensate for the idle count that we deducted and this is difficult to track. moving this to the place where we already decided to do a BGC makes it much simpler. 

I am setting all the required idle events sequentially instead of per heap (there isn't a convenient way to set it per heap without introducing yet more sync cost) which does incur some perf cost - however this cost is small and we only need to pay for it when all the following conditions are true - 1) we actually changed HC; 2) we did do a BGC that observed this HC change; 3) we need to wake up threads that participated in the last BGC.

also added some stress code to help with reproing the problem and stressing the new code; otherwise it's very difficult to repro this kind of problems.
@github-actions github-actions bot locked and limited conversation to collaborators Dec 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants