-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release/7.0] Fix pthread_cond_wait race on macOS #82893
Conversation
The native runtime event implementations for nativeaot and GC use pthread_cond_wait to wait for the event and pthread_cond_broadcast to signal that the event was set. While the usage of the pthread_cond_broadcast conforms with the documentation, it turns out that glibc before 2.25 had a race in the implementation that can cause the pthread_cond_broadcast to be unnoticed and the wait waiting forever. It turns out that macOS implementation has the same issue. The fix for the issue is to call pthread_cond_broadcast while the related mutex is taken. This change fixes intermittent crossgen2 hangs with nativeaot build of crossgen2 reported in #81570. I was able to repro the hang locally in tens of thousands of iterations of running crossgen2 without any arguments (the hang occurs when server GC creates threads). With this fix, it ran without problems over the weekend, passing 5.5 million iterations.
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue DetailsBackport of #82709 to release/7.0 /cc @janvorli Customer ImpactTestingRiskIMPORTANT: Is this backport for a servicing release? If so and this change touches code that ships in a NuGet package, please make certain that you have added any necessary package authoring and gotten it explicitly reviewed.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved. we will take for consideration in 7.0.x
Approved by Tactics. |
Backport of #82709 to release/7.0
/cc @janvorli
Customer Impact
Applications compiled with NativeAOT can hang intermittently at startup on macOS. This was occurring with our own crossgen2 in the CI.
The problem is caused by the implementation of
pthread_cond_broadcast
not adhering to the documentation in a race condition case. There is a tiny window of opportunity within which the relatedpthread_cond_wait
isn't woken by thepthread_cond_broadcast
when the latter is not invoked with the related mutex taken.Testing
Stress testing running of crossgen2 compiled with NativeAOT on macOS without any arguments. Without the fix, it hanged in tens or hundreds of thousands of iterations. With the fix, it was running ok for 5.5 million of iterations.
Risk
Very low, the change just moves
pthread_cond_broadcast
inside of a mutex and the doc for that method says it should not matter whether it is called inside of the mutex or not.