-
-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix pthread condition variable usage for dynamic scheduler scaling #2472
Fix pthread condition variable usage for dynamic scheduler scaling #2472
Conversation
Restarting linux travis builds that failed due to llvm download timeout. |
Restarting linux travis builds that failed due to llvm download timeout again. |
ad1d46a
to
b99f954
Compare
@dipinhora, I compiled ponyc with release usihg scheduler_scaling_pthreads:
And I compililed pony-ring using that compiler and then ran it. It failed after about 2hrs:
And below is the gdb session, the weirdest thing is that "p *this_scheduler" is bad:
I've started it again and we'll see how it goes when I wake up:
|
So it ran 7.5hrs+ no hangs, here is the log :( |
@winksaville i'm confused. "no hangs" and ":(". i would think no hangs means :). am i missing something? are you saying this fixed the problem with pthreads or no? |
Sad, because yesterday (see above) I "may" have caught a hang, if @dipinhora confirms it was a hang, then we've got a heisenbug, hence the :( |
@winksaville Two things:
|
I tested stdlib and it doesn't look good. Unless I've made a mistake, which is certainly possible, I don't think this is ready to release. Here's what I did for stdlib, first I built a release version of stdlib using a ponyc release build I built yesterday. I ran it and it failed after about a minute. To make double check I rebuilt the compiler, stdlib and ran it a second time. It failed after a couple minutes. Here is the branch I'm working with. Here is the gdb output after the first hang:
I removed build/ and here is the output from rebuilding the compiler and stdlib:
Here is the gdb output from the second run of stdlib in a loop where it hung after 2 minutes:
The results look the "same" to me. The first thread, main, is waiting to join. The second thread is "running". And the third thread is waiting on a condition. Also, in both cases I had System Monitor running and when it hung one thread was consuming 100% of one CPU, probably thread 2. |
@winksaville Oh well. Thanks for confirming it's not ready. I don't think you made any mistakes in your testing. Yes, it's thread 2. It's likely in an optimized/inlined version of the I just pushed a commit that changes the atomics memory order to be the strictest possible (to try and rule out any multithreading date race related issues). Can you run your test with |
Adding the atomics change, didn't help stdlib still hung, here are the steps I did. Merged the lastest 2472 branch with Atomics into master plus support-openssl_1.1.0 pushed it to a branch on my fork. Below is the short log:
Removed build and rebuilt compiler and stdlib
Below is the output from "ponyc --version" and "stdlib --ponyversion" showing stdlib was compiled with the above compiler:
I than ran stdlib in the loop here is the log and it hung in less than a minute, below is the gdb info:
|
@winksaville Thanks for testing and confirming that didn't help any. Frankly, I'm running out of ideas and don't really understand what is going on. One idea that just came to me is that maybe somehow the fact that some scheduler threads could be terminating prior to other scheduler threads waking is somehow screwing things up. I've pushed a commit to try and rule that out by changing it so that the It would be great if you are able to test running |
@winksaville Don't bother testing this latest commit. It's got issues (as the CI results show). |
Maybe its time to try to simplify the code and add debug logging
information so we have some history. I suggest the logging be just to a
circular buffer in memory of a fixed size with a fixed sized record that we
can dump from the debugger. I'm not sure what can be simplified, but what's
the simplest scheduler that can be made?
So we could start with maybe 3 threads; the first thread is main, second it
asio and all actors run on the third thread. In other words the absolutely
simplest thing that is "correct" and to hell with efficiency. Then we make
small steps toward something that's efficient and fast while building up a
"scheduler test-suite" pounding on every nook and crany we can think of.
So what is the simplest possible scheduler that you can think of?
…On Mon, Jan 8, 2018 at 10:12 PM Dipin Hora ***@***.***> wrote:
@winksaville <https://github.com/winksaville> Don't bother testing this
latest commit. It's got issues (as the CI results show).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2472 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA-hHNpOXDoKyyyxENORz34L1IcR5PErks5tIwNHgaJpZM4RVZi8>
.
|
@winksaville We already have a perfectly functional scheduling mechanism in place (i.e. the normal scheduling without the dynamic scheduler scaling logic) as a starting point. You can try out the equivalent of your suggestion by running with I have another thought/idea on this and how to fix this dynamic scheduler scaling hanging issue but it's going to require some re-architecting of how it's implemented. Maybe we should change the default for @ponylang/core what do you think about effectively disabling the dynamic scheduler scaling? |
SG, I'm going to retest --ponyminthreads, just to make sure :) |
So I've run pony-ring for an hour and ./stdlib for 45+ minutes based on sha1 1c0e24fbc.
@dipinhora, my suggestion is to just include the first two changes and defaulting ponyminthreads=9999 in this change. The Atomics change doesn't seem to make any difference, so I'd skip that one. Let me know if there is anything else you'd like me to do for this round. |
89048a3
to
47e8d14
Compare
Prior to this commit, the pthreads variant of the dynamic scheduler scaling functionality didn't use pthread condition variables correctly for the logic required. It relied on a single pthread condition variable for all threads to sleep against. Unfortunately, there's no way to wake up a specific thread when relying on pthread condition variables and the operating system is free to wake up any thread that is sleeping against a pthread condition variable. This resulted in out of order waking of sleeping threads which broke one of the invariants for how dynamic scheduler scaling is supposed to work. This commit fixes the code to create a separate pthread condition variable for each scheduler thread ensure that we can wake up a specific single thread at a time correctly. This commit also adds some error handling and asssertions to help catch any other logic error that might exist. Thanks to @winksaville for all his help in testing and debugging this. It likely wouldn't have been found/fixed so soon if it weren't for him. NOTE: While this fixes at least one of the causes of ponylang#2451, I don't believe it resolves the entire issue. Hopefully @winksaville is able to test and confirm if the issue still exists or not.
Prior to this commit, there existed a race condition between the SCHED_SUSPEND message getting processed and the decrement of the active_scheduler_count that could result in the runtime not reaching quiescence properly. There were also a couple of edge cases around the ack_count that is part of the CNF/ACK protocol for quiescense where the runtime could potentially think it had enough acks when it really didn't due to a scheduler thread dynamically waking or sleeping at the wrong time. This commit resolves all of the above. It resolves the ack_count issue by always reseting the ack_count to 0 prior to starting a new CNF/ACK cycle since every CNF/ACK cycle always asks for all active scheduler threads to reponde with an ACK. It resolves the SCHED_SUSPEND/active_scheduler_count decrement race condition by making sure that the active_scheduler_count is decremented prior to the SCHED_SUSPEND message being sent.
This effectively disables dynamic scheduler scaling by default. The main reason for this is issue ponylang#2451 which only surfaces when dynamic scheduler scaling is used. We can revert this and re-enable dynamic scheduler scaling by default once the issue is resolved.
47e8d14
to
c6f9387
Compare
@winksaville Thanks for testing again and confirming that the hang issue doesn't exist when the dynamic scheduler scaling is effectively disabled. I've rebased and force pushed an update removing the atomics change and my last experimental commit. I've also added a commit that defaults |
@ponylang/core Please let us know how you would like to proceed regarding this. The situation is the following:
My, and @winksaville's recommendation is to disable the dynamic scheduler scaling logic (as the last commit in this PR does) until issue #2451 is resolved. |
Restarting linux travis builds that failed due to llvm download timeout. |
Restarting linux travis builds again due to llvm download timeout. |
Progress.. went from 2 failing jobs to 1 failing job.. Restarting the last linux travis build again due to llvm download timeout. |
And again........................................ |
Prior to this commit, the pthreads variant of the dynamic scheduler
scaling functionality didn't use pthread condition variables
correctly for the logic required. It relied on a single pthread
condition variable for all threads to sleep against. Unfortunately,
there's no way to wake up a specific thread when relying on pthread
condition variables and the operating system is free to wake up any
thread that is sleeping against a pthread condition variable. This
resulted in out of order waking of sleeping threads which broke one
of the invariants for how dynamic scheduler scaling is supposed to
work.
This commit fixes the code to create a separate pthread condition
variable for each scheduler thread ensure that we can wake up a
specific single thread at a time correctly.
This commit also adds some error handling and asssertions to help
catch any other logic error that might exist.
Thanks to @winksaville for all his help in testing and debugging
this. It likely wouldn't have been found/fixed so soon if it
weren't for him.
NOTE: While this fixes at least one of the causes of #2451, I don't
believe it resolves the entire issue. Hopefully @winksaville is
able to test and confirm if the issue still exists or not.