-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DATAS BGC thread synchronization fix #109804
Conversation
Tagging subscribers to this area: @dotnet/gc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Thanks for fixing.
last commit I pushed will be cleaned up - I left some code there to help with stress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Extensively stress tested with the latest commit and found no issue.
/backport to release/9.0-staging |
Started backporting to release/9.0-staging: https://github.com/dotnet/runtime/actions/runs/12020222869 |
@Maoni0 backporting to release/9.0-staging failed, the patch most likely resulted in conflicts: $ git am --3way --empty=keep --ignore-whitespace --keep-non-patch changes.patch
Applying: fix
Applying: with stress to help with repro
error: sha1 information is lacking or useless (src/coreclr/gc/gc.cpp).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0002 with stress to help with repro
Error: The process '/usr/bin/git' failed with exit code 128 Please backport manually! |
@Maoni0 an error occurred while backporting to release/9.0-staging, please check the run log for details! Error: git am failed, most likely due to a merge conflict. |
Backport of #109804 to release/9.0-staging /cc @mangod9 @mrsharm Customer Impact Customer reported Found internally Original issue: #109804. Customers discovered a hang because one of the BGC threads had disappeared which shouldn't have happened - this is due to a BGC thread that was not needed during a previous BGC (because DATAS only needed a subset of BGC threads for that BGC to run) ran at a time when settings.concurrent was FALSE so it exited. Regression Yes No Testing I have added stress code in GC (under STRESS_DYNAMIC_HEAP_COUNT) to make the repro quite easy. Risk Low. This has gone through extensive testing on both Windows and Linux.
this problem with BGC thread synchronization is similar to the SVR GC one I fixed but has its own complications due to the fact BGC threads are only created on demands. it can cause the following symptoms - + deadlock because a BGC thread is erroneously terminated so then the next BGC join cannot finish + AV because a BGC thread could be seeing settings.concurrent as true and actually running blocking GC code! I moved setting the idle event code to be inside a GC instead of outside because we don't necessarily trigger a BGC in-between HC changes. and if we don't, we'd need to compensate for the idle count that we deducted and this is difficult to track. moving this to the place where we already decided to do a BGC makes it much simpler. I am setting all the required idle events sequentially instead of per heap (there isn't a convenient way to set it per heap without introducing yet more sync cost) which does incur some perf cost - however this cost is small and we only need to pay for it when all the following conditions are true - 1) we actually changed HC; 2) we did do a BGC that observed this HC change; 3) we need to wake up threads that participated in the last BGC. also added some stress code to help with reproing the problem and stressing the new code; otherwise it's very difficult to repro this kind of problems.
this problem with BGC thread synchronization is similar to the SVR GC one I fixed but has its own complications due to the fact BGC threads are only created on demands. this was work I wanted to do in .NET9 but didn't get time to.
it can cause the following symptoms -
settings.concurrent
as true and actually running blocking GC code!I moved setting the idle event code to be inside a GC instead of outside because we don't necessarily trigger a BGC in-between HC changes. and if we don't, we'd need to compensate for the idle count that we deducted and this is difficult to track. moving this to the place where we already decided to do a BGC makes it much simpler. I am setting all the required idle events sequentially instead of per heap (there isn't a convenient way to set it per heap without introducing yet more sync cost) which does incur some perf cost - however this cost is small and we only need to pay for it when the following conditions are true - 1) we actually changed HC; 2) we did do a BGC that observed this HC change; 3) we need to wake up threads that participated in the last BGC.
also added some stress code to help with reproing the problem and stressing the new code.