-
-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix two race conditions where an ASIO wakeup notifications can be lost #2561
Fix two race conditions where an ASIO wakeup notifications can be lost #2561
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me. Thanks for cleaning things up and making the code sane again!
@slfritchie says he's running some microbenchmarks to test the performance impact of this change before merging. So, let's put the "don't merge" tag on this for now until his investigation is complete. |
I'm pretty sure there's a race condition with a single active thread, something I've seen 3x now.
|
fc036a0
to
243298d
Compare
These are important runtime bugfixes. The performance observations that I've been able to make over the last week are:
The
|
Those perf numbers look good to me and I wouldn't oppose merging on the basis of them. |
If y'all are confident, feel free to remove the "DO NOT MERGE" label and merge. |
@slfritchie Thanks for putting in all the effort to get this working and the code cleaned up. I appreciate all your help in helping fix the swiss cheese that has been this dynamic scheduler scaling in relation to scheduler thread 0 suspending. The performance difference is acceptable to me given that the logic changes are necessary for improving stability of scheduler thread 0 being able to suspend which I believe is overall a useful feature to have. |
…lang#2561) Prior to this commit, there were two race conditions where an ASIO event could be received and not result in a wake up notification to scheduler 0. One was because scheduler 0 had not actually suspended yet so the ASIO thread didn't send a notification at all. The other was because scheduler thread 0 had not released the mutex so the ASIO thread wasn't able to send a notification because it wasn't able to acquire the mutex and didn't retry. Both would result in hangs due to the lost wake ups. This commit fixes things by ensuring that even after scheduler 0 decides to suspend, it does one final check of the inject queue to see if any new work has been queued by the ASIO thread. If yes, it doesn't suspend to prevent lost wake ups and the associated hangs. It also changes the logic so that the ASIO thread will retry waking up scheduler thread 0 if it is unable to acquire the mutex until it is successful. Additional refactorings include: * make pop_global() more flexible to adapt to 3 callers in steal() * refactor 2 pieces of identical code into new suspend_scheduler() function * refactor 2 pieces of identical code into new maybe_suspend_scheduler() function Notable runtime behavior change: * Change default value of ponyminthreads from 0 -> 1 There remains at least one bug in thread wakeup when only scheduler 0 is awake. To circumvent this bug(s), Dipin suggested changing the default value of ponyminthreads to 1. The command line option `--ponyminthreads=0` may be used, but it must be considered experimental and may result in misbehavior by the scheduler. Squashed from commit 6bd6e05 backward to dipinhora/fix_dynamix_scheduling_race_condition commit 01ab145
6bd6e05
to
5a11aff
Compare
@dipinhora Would you mind taking a look at the new squashed & rebased commit message? If it's alright and if Appveyor is happy, then please merge. |
Dipin's last review found a path where DTrace |
ponylang#2561) Fix race conditions with scheduler 0 suspending and ASIO events (ponylang#2561) Prior to this commit, there were two race conditions where an ASIO event could be received and not result in a wake up notification to scheduler 0. One was because scheduler 0 had not actually suspended yet so the ASIO thread didn't send a notification at all. The other was because scheduler thread 0 had not released the mutex so the ASIO thread wasn't able to send a notification because it wasn't able to acquire the mutex and didn't retry. Both would result in hangs due to the lost wake ups. This commit fixes things by ensuring that even after scheduler 0 decides to suspend, it does one final check of the inject queue to see if any new work has been queued by the ASIO thread. If yes, it doesn't suspend to prevent lost wake ups and the associated hangs. It also changes the logic so that the ASIO thread will retry waking up scheduler thread 0 if it is unable to acquire the mutex until it is successful. Additional refactorings include: * make pop_global() more flexible to adapt to 3 callers in steal() * refactor 2 pieces of identical code into new suspend_scheduler() function * refactor 2 pieces of identical code into new maybe_suspend_scheduler() function Notable runtime behavior change: * Change default value of ponyminthreads from 0 -> 1 There remains at least one bug in thread wakeup when only scheduler 0 is awake. To circumvent this bug(s), Dipin suggested changing the default value of ponyminthreads to 1. The command line option `--ponyminthreads=0` may be used, but it must be considered experimental and may result in misbehavior by the scheduler.
Thanks to @slfritchie's hard work in ponylang#2926 and ponylang#2561 to find and fix the bugs in the dynamic scheduler scaling logic, scheduler thread 0 now also suspends and resumes correctly without issues (in my testing). This commit changes the default for `--ponyminthreads` from `1` to `0`. The last time @slfritchie ran into issues was in March and left a comment with the commands he used at: ponylang#2561 (comment) I ran the commands from @slfritchie's comment (with minor changes to ensure `--ponyminthreads=0`) using Pony 0.26.0 for a few hours without any issues.
Thanks to @slfritchie's hard work in ponylang#2926 and ponylang#2561 to find and fix the bugs in the dynamic scheduler scaling logic, scheduler thread 0 now also suspends and resumes correctly without issues (in my testing). This commit changes the default for `--ponyminthreads` from `1` to `0`. The last time @slfritchie ran into issues was in March and left a comment with the commands he used at: ponylang#2561 (comment) I ran the commands from @slfritchie's comment (with minor changes to ensure `--ponyminthreads=0`) using Pony 0.26.0 for a few hours without any issues.
Thanks to @slfritchie's hard work in #2926 and #2561 to find and fix the bugs in the dynamic scheduler scaling logic, scheduler thread 0 now also suspends and resumes correctly without issues (in my testing). This commit changes the default for `--ponyminthreads` from `1` to `0`. The last time @slfritchie ran into issues was in March and left a comment with the commands he used at: #2561 (comment) I ran the commands from @slfritchie's comment (with minor changes to ensure `--ponyminthreads=0`) using Pony 0.26.0 for a few hours without any issues.
@dipinhora's commit (described immediately below) is the minimum required to fix the two race conditions that I'd found in the Pony 0.21.3 runtime that would cause occasional failures of the Wallaroo integration test suite.
After Dipin's work above, I had suggested and implemented three changes:
steal()
steal()
into a new functionsuspend_scheduler()
steal()
into a new functionmaybe_suspend_scheduler()
Dipin gave an informal "there might be a performance regression" to my 3 review suggestions, which are also included in this branch, informal approval to number 2, and hasn't yet looked at number 3. Reviewers are welcome to critique any/all of them. Taken all together, they make
steal()
much easier to read, IMHO.Regarding performance regression, I've seen on:
examples/message-ubench
with 14 different combinations of pinger actors, messages sent, and Pony runtime threads used. OR ELSE this branch is about 5% faster, which is both nice and also not well understood.examples/message-ubench
on a 36 core AWS c4.8xlarge instance now.