Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pthread condition variable usage for dynamic scheduler scaling #2472

Merged
merged 3 commits into from
Jan 10, 2018

Conversation

dipinhora
Copy link
Contributor

Prior to this commit, the pthreads variant of the dynamic scheduler
scaling functionality didn't use pthread condition variables
correctly for the logic required. It relied on a single pthread
condition variable for all threads to sleep against. Unfortunately,
there's no way to wake up a specific thread when relying on pthread
condition variables and the operating system is free to wake up any
thread that is sleeping against a pthread condition variable. This
resulted in out of order waking of sleeping threads which broke one
of the invariants for how dynamic scheduler scaling is supposed to
work.

This commit fixes the code to create a separate pthread condition
variable for each scheduler thread ensure that we can wake up a
specific single thread at a time correctly.

This commit also adds some error handling and asssertions to help
catch any other logic error that might exist.

Thanks to @winksaville for all his help in testing and debugging
this. It likely wouldn't have been found/fixed so soon if it
weren't for him.

NOTE: While this fixes at least one of the causes of #2451, I don't
believe it resolves the entire issue. Hopefully @winksaville is
able to test and confirm if the issue still exists or not.

@dipinhora dipinhora mentioned this pull request Jan 6, 2018
@dipinhora
Copy link
Contributor Author

Restarting linux travis builds that failed due to llvm download timeout.

@dipinhora
Copy link
Contributor Author

Restarting linux travis builds that failed due to llvm download timeout again.

@dipinhora dipinhora force-pushed the fix_pthread_scheduler_scaling branch from ad1d46a to b99f954 Compare January 7, 2018 00:55
@winksaville
Copy link
Contributor

@dipinhora, I compiled ponyc with release usihg scheduler_scaling_pthreads:

wink@wink-desktop:~/prgs/pony/ponyc (debug-2451-fix_pthread_condition_variable_usage_for_dynamic_scheduler_scaling-pr_2472)
$ make default_pic=true config=release use=scheduler_scaling_pthreads
Makefile:281: WARNING: LLVM 5 support is experimental and may result in decreased performance or crashes
mpmcq.c
cpu.c
mutemap.c
start.c
scheduler.c
delta.c
cycle.c
gc.c
serialise.c
actormap.c
trace.c
objectmap.c
asio.c
epoll.c
event.c
messageq.c
actor.c
socket.c
io.c
ssl.c
except_try_catch.ll
posix_except.c
directory.c
time.c
stat.c
paths.c
lsda.c
stdfd.c
ponyassert.c
threads.c
fun.c
stack.c
list.c
hash.c
alloc.c
pool.c
heap.c
pagemap.c
options.c
Linking libponyrt
blake2b-ref.c
Linking libblake2
call.c
type.c
control.c
fun.c
lookup.c
sanitise.c
alias.c
compattype.c
assemble.c
cap.c
reify.c
safeto.c
typeparam.c
subtype.c
matchtype.c
viewpoint.c
postfix.c
array.c
match.c
call.c
reference.c
control.c
ffi.c
operator.c
literal.c
lambda.c
docgen.c
flatten.c
pass.c
serialisers.c
syntax.c
finalisers.c
expr.c
traits.c
scope.c
casemethod.c
sugar.c
names.c
refer.c
import.c
verify.c
genexe.c
genoperator.c
genmatch.c
gencontrol.c
gendebug.cc
gendesc.c
genlib.c
codegen.c
genbox.c
genobj.c
genident.c
genserialise.c
genfun.c
genexpr.c
genname.c
host.cc
genopt.cc
gentrace.c
gentype.c
genreference.c
genprim.c
gencall.c
genheader.c
genjit.c
paint.c
reach.c
subtype.c
paths.c
ifdef.c
package.c
use.c
platformfuns.c
buildflagset.c
program.c
ponyc.c
source.c
frame.c
treecheck.c
ast.c
lexint.c
stringtab.c
parser.c
printbuf.c
symtab.c
bnfprint.c
error.c
lexer.c
token.c
parserapi.c
id.c
Linking libponyc
mpmcq.c
cpu.c
mutemap.c
start.c
scheduler.c
delta.c
cycle.c
gc.c
serialise.c
actormap.c
trace.c
objectmap.c
asio.c
epoll.c
event.c
messageq.c
actor.c
socket.c
io.c
ssl.c
except_try_catch.ll
posix_except.c
directory.c
time.c
stat.c
paths.c
lsda.c
stdfd.c
ponyassert.c
threads.c
fun.c
stack.c
list.c
hash.c
alloc.c
pool.c
heap.c
pagemap.c
options.c
Linking libponyrt-pic
gtest-all.cc
Linking libgtest
gbenchmark_main.cc
gbenchmark-all.cc
Linking libgbenchmark
main.c
Linking ponyc
codegen_optimisation.cc
local_inference.cc
annotations.cc
ffi.cc
codegen_final.cc
lambda.cc
token.cc
codegen_identity.cc
lexer.cc
id.cc
matchtype.cc
scope.cc
finalisers.cc
parse_entity.cc
iftype.cc
codegen_trace.cc
symtab.cc
traits.cc
type_check_bind.cc
option_parser.cc
lexint.cc
literal_inference.cc
codegen.cc
util.cc
recover.cc
program.cc
type_check_subtype.cc
verify.cc
literal_limits.cc
use.cc
badpony.cc
flatten.cc
bare.cc
reach.cc
sugar_expr.cc
dontcare.cc
array.cc
codegen_ffi.cc
signature.cc
buildflagset.cc
sugar.cc
parse_expr.cc
chain.cc
sugar_traits.cc
paint.cc
compiler_serialisation.cc
Linking libponyc.tests
error.cc
util.cc
fun.cc
hash.cc
list.cc
heap.cc
pagemap.cc
pool.cc
Linking libponyrt.tests
common.cc
Linking libponyc.benchmarks
common.cc
hash.cc
heap.cc
pool.cc
Linking libponyrt.benchmarks

And I compililed pony-ring using that compiler and then ran it. It failed after about 2hrs:

wink@wink-desktop:~/prgs/pony/ring (master)
$ make comment=release ponyc=../ponyc/build/release-scheduler_scaling_pthreads/ponyc test
../ponyc/build/release-scheduler_scaling_pthreads/ponyc .
Building builtin -> /home/wink/prgs/pony/ponyc/packages/builtin
Building . -> /home/wink/prgs/pony/ring
Generating
 Reachability
 Selector painting
 Data prototypes
 Data types
 Function prototypes
 Functions
 Descriptors
Optimising
Writing ./ring.o
Linking ./ring
while true; do echo "Date: `date +%y%m%d-%H%M%S.%N`"; time ./ring --ponynoblock --size 1000 --count 1000 --pass 10; done 2>&1 | tee pony-ring-release-ponynoblock-size1000-count1000-pass10-`date +%y%m%d-%H%M%S.%N`.txt
Date: 180107-195429.238758674
266239: setup_ring:+
495426311: setup_ring:-
855695365: start:+
855856114: start:-
857435434: complete: DONE

real	0m0.875s
user	0m1.677s
sys	0m0.147s
Date: 180107-195430.115542377
40818: setup_ring:+
445600130: setup_ring:-
807894113: start:+
808040881: start:-
809711854: complete: DONE

real	0m0.816s
user	0m1.596s
sys	0m0.126s
Date: 180107-195430.933733528
163537: setup_ring:+
387907817: setup_ring:-
1138812178: start:+
1138925035: start:-
1286276139: complete: DONE
...

Here is the complete log

And below is the gdb session, the weirdest thing is that "p *this_scheduler" is bad:

wink@wink-desktop:~/prgs/pony/ring (master)
$ pid=$(ps ax | awk '{if ($5 == "./ring") {print $1}}'); sudo gdb ./ring $pid
[sudo] password for wink: 
GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./ring...(no debugging symbols found)...done.
Attaching to program: /home/wink/prgs/pony/ring/ring, process 32344
[New LWP 32346]
[New LWP 32350]
[New LWP 32351]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f84b412043d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f84b4725980 (LWP 32344) "ring" 0x00007f84b412043d in pthread_join () from /usr/lib/libpthread.so.0
  2    Thread 0x7f84aabec700 (LWP 32346) "ring" 0x00005633ecc2f3c9 in run_thread ()
  3    Thread 0x7f8490bec700 (LWP 32350) "ring" 0x00007f84b412538d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
  4    Thread 0x7f8498bec700 (LWP 32351) "ring" 0x00007f84b412538d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f84b4725980 (LWP 32344))]
#0  0x00007f84b412043d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f84b412043d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x00005633ecc36eeb in ponyint_thread_join ()
#2  0x00005633ecc2e493 in ponyint_sched_shutdown ()
#3  0x00005633ecc2e757 in ponyint_sched_start ()
#4  0x00005633ecc2e3d9 in pony_start ()
#5  0x00005633ecc2d9d9 in main ()
(gdb) p *this_scheduler 
Cannot access memory at address 0xd5b8a1349720
(gdb) p this_scheduler 
Cannot access memory at address 0xd5b8a1349720
(gdb) p *this_scheduler 
Cannot access memory at address 0xd5b8a1349720
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f84aabec700 (LWP 32346))]
#0  0x00005633ecc2f3c9 in run_thread ()
(gdb) bt
#0  0x00005633ecc2f3c9 in run_thread ()
#1  0x00007f84b411f08a in start_thread () from /usr/lib/libpthread.so.0
#2  0x00007f84b36ef42f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
Cannot access memory at address 0xd5b8978104a0
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f8490bec700 (LWP 32350))]
#0  0x00007f84b412538d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f84b412538d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x00005633ecc36f3f in ponyint_thread_suspend ()
#2  0x00005633ecc2f6ee in run_thread ()
#3  0x00007f84b411f08a in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007f84b36ef42f in clone () from /usr/lib/libc.so.6
(gdb) thread 4
[Switching to thread 4 (Thread 0x7f8498bec700 (LWP 32351))]
#0  0x00007f84b412538d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f84b412538d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x00005633ecc36f3f in ponyint_thread_suspend ()
#2  0x00005633ecc2f6ee in run_thread ()
#3  0x00007f84b411f08a in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007f84b36ef42f in clone () from /usr/lib/libc.so.6
(gdb) thread 5
Unknown thread 5.
(gdb) p *this_scheduler 
Cannot access memory at address 0xd5b8858104a0
(gdb) printf "asio_cpu=%d scheduler_count=%d min_scheduler_count=%d active_scheduler_count=%d scheduler_count_changing=%d detect_quiescence=%d use_yield=%d\n", asio_cpu,scheduler_count,min_scheduler_count,active_scheduler_count,scheduler_count_changing,detect_quiescence,use_yield
asio_cpu=-1 scheduler_count=6 min_scheduler_count=1 active_scheduler_count=4 scheduler_count_changing=1 detect_quiescence=1 use_yield=257

I've started it again and we'll see how it goes when I wake up:

wink@wink-desktop:~/prgs/pony/ring (master)
$ make comment=release ponyc=../ponyc/build/release-scheduler_scaling_pthreads/ponyc test
while true; do echo "Date: `date +%y%m%d-%H%M%S.%N`"; time ./ring --ponynoblock --size 1000 --count 1000 --pass 10; done 2>&1 | tee pony-ring-release-ponynoblock-size1000-count1000-pass10-`date +%y%m%d-%H%M%S.%N`.txt
Date: 180108-002433.764219239
390936: setup_ring:+
414904672: setup_ring:-
1164910213: start:+
1165015088: start:-
1312095775: complete: DONE
...

@winksaville
Copy link
Contributor

So it ran 7.5hrs+ no hangs, here is the log :(

@SeanTAllen SeanTAllen added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label Jan 8, 2018
@SeanTAllen
Copy link
Member

@winksaville i'm confused. "no hangs" and ":(". i would think no hangs means :). am i missing something? are you saying this fixed the problem with pthreads or no?

@winksaville
Copy link
Contributor

Sad, because yesterday (see above) I "may" have caught a hang, if @dipinhora confirms it was a hang, then we've got a heisenbug, hence the :(

@dipinhora
Copy link
Contributor Author

@winksaville Two things:

  1. Can you try running stdlib again to confirm that doesn't hang either since that's what you originally discovered the hang with?

  2. Regardless of whether there is still a heisenbug or not, does this PR improve things? If yes, we can merge it while we keep working to find and fix the heisenbug?

@winksaville
Copy link
Contributor

I tested stdlib and it doesn't look good. Unless I've made a mistake, which is certainly possible, I don't think this is ready to release.

Here's what I did for stdlib, first I built a release version of stdlib using a ponyc release build I built yesterday. I ran it and it failed after about a minute. To make double check I rebuilt the compiler, stdlib and ran it a second time. It failed after a couple minutes.

Here is the branch I'm working with.

Here is the gdb output after the first hang:

wink@wink-desktop:~/prgs/pony/ponyc (debug-2451-fix_pthread_condition_variable_usage_for_dynamic_scheduler_scaling-pr_2472)
$ pid=$(ps ax | awk '{if ($5 == "./stdlib") {print $1}}'); sudo gdb ./stdlib $pid
[sudo] password for wink: 
GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./stdlib...(no debugging symbols found)...done.
Attaching to program: /home/wink/prgs/pony/ponyc/stdlib, process 12353
[New LWP 12355]
[New LWP 12360]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007fbfa3b6544d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7fbfa4cdf6c0 (LWP 12353) "stdlib" 0x00007fbfa3b6544d in pthread_join ()
   from /usr/lib/libpthread.so.0
  2    Thread 0x7fbf9a632700 (LWP 12355) "stdlib" 0x000056393b833129 in run_thread ()
  3    Thread 0x7fbf6fe32700 (LWP 12360) "stdlib" 0x00007fbfa3b6a39d in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fbfa4cdf6c0 (LWP 12353))]
#0  0x00007fbfa3b6544d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fbfa3b6544d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x000056393b83b41b in ponyint_thread_join ()
#2  0x000056393b8321f3 in ponyint_sched_shutdown ()
#3  0x000056393b8324b7 in ponyint_sched_start ()
#4  0x000056393b832139 in pony_start ()
#5  0x000056393b823d97 in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fbf9a632700 (LWP 12355))]
#0  0x000056393b833129 in run_thread ()
(gdb) bt
#0  0x000056393b833129 in run_thread ()
#1  0x00007fbfa3b6408c in start_thread () from /usr/lib/libpthread.so.0
#2  0x00007fbfa3134e1f in clone () from /usr/lib/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fbf6fe32700 (LWP 12360))]
#0  0x00007fbfa3b6a39d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fbfa3b6a39d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x000056393b83b46f in ponyint_thread_suspend ()
#2  0x000056393b83344e in run_thread ()
#3  0x00007fbfa3b6408c in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007fbfa3134e1f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
Cannot access memory at address 0xd5f8ab4054a0
(gdb) printf "asio_cpu=%d scheduler_count=%d min_scheduler_count=%d active_scheduler_count=%d scheduler_count_changing=%d detect_quiescence=%d use_yield=%d\n", asio_cpu,scheduler_count,min_scheduler_count,active_scheduler_count,scheduler_count_changing,detect_quiescence,use_yield
asio_cpu=-1 scheduler_count=6 min_scheduler_count=1 active_scheduler_count=5 scheduler_count_changing=1 detect_quiescence=1 use_yield=257

I removed build/ and here is the output from rebuilding the compiler and stdlib:

$ make default_pic=true config=release use=scheduler_scaling_pthreads OPENSSL=-Dopenssl_1.1.0 stdlib
Makefile:281: WARNING: LLVM 5 support is experimental and may result in decreased performance or crashes
Makefile:296: targeting openssl 1.1.0
mpmcq.c
cpu.c
mutemap.c
start.c
scheduler.c
delta.c
cycle.c
gc.c
serialise.c
actormap.c
trace.c
objectmap.c
asio.c
epoll.c
event.c
messageq.c
actor.c
socket.c
io.c
ssl.c
except_try_catch.ll
posix_except.c
directory.c
time.c
stat.c
paths.c
lsda.c
stdfd.c
ponyassert.c
threads.c
fun.c
stack.c
list.c
hash.c
alloc.c
pool.c
heap.c
pagemap.c
options.c
Linking libponyrt
blake2b-ref.c
Linking libblake2
call.c
type.c
control.c
fun.c
lookup.c
sanitise.c
alias.c
compattype.c
assemble.c
cap.c
reify.c
safeto.c
typeparam.c
subtype.c
matchtype.c
viewpoint.c
postfix.c
array.c
match.c
call.c
reference.c
control.c
ffi.c
operator.c
literal.c
lambda.c
docgen.c
flatten.c
pass.c
serialisers.c
syntax.c
finalisers.c
expr.c
traits.c
scope.c
casemethod.c
sugar.c
names.c
refer.c
import.c
verify.c
genexe.c
genoperator.c
genmatch.c
gencontrol.c
gendebug.cc
gendesc.c
genlib.c
codegen.c
genbox.c
genobj.c
genident.c
genserialise.c
genfun.c
genexpr.c
genname.c
host.cc
genopt.cc
gentrace.c
gentype.c
genreference.c
genprim.c
gencall.c
genheader.c
genjit.c
paint.c
reach.c
subtype.c
paths.c
ifdef.c
package.c
use.c
platformfuns.c
buildflagset.c
program.c
ponyc.c
source.c
frame.c
treecheck.c
ast.c
lexint.c
stringtab.c
parser.c
printbuf.c
symtab.c
bnfprint.c
error.c
lexer.c
token.c
parserapi.c
id.c
Linking libponyc
mpmcq.c
cpu.c
mutemap.c
start.c
scheduler.c
delta.c
cycle.c
gc.c
serialise.c
actormap.c
trace.c
objectmap.c
asio.c
epoll.c
event.c
messageq.c
actor.c
socket.c
io.c
ssl.c
except_try_catch.ll
posix_except.c
directory.c
time.c
stat.c
paths.c
lsda.c
stdfd.c
ponyassert.c
threads.c
fun.c
stack.c
list.c
hash.c
alloc.c
pool.c
heap.c
pagemap.c
options.c
Linking libponyrt-pic
gtest-all.cc
Linking libgtest
gbenchmark_main.cc
gbenchmark-all.cc
Linking libgbenchmark
main.c
Linking ponyc
codegen_optimisation.cc
local_inference.cc
annotations.cc
ffi.cc
codegen_final.cc
lambda.cc
token.cc
codegen_identity.cc
lexer.cc
id.cc
matchtype.cc
scope.cc
finalisers.cc
parse_entity.cc
iftype.cc
codegen_trace.cc
symtab.cc
traits.cc
type_check_bind.cc
option_parser.cc
lexint.cc
literal_inference.cc
codegen.cc
util.cc
recover.cc
program.cc
type_check_subtype.cc
verify.cc
literal_limits.cc
use.cc
badpony.cc
flatten.cc
bare.cc
reach.cc
sugar_expr.cc
dontcare.cc
array.cc
codegen_ffi.cc
signature.cc
buildflagset.cc
sugar.cc
parse_expr.cc
chain.cc
sugar_traits.cc
paint.cc
compiler_serialisation.cc
Linking libponyc.tests
error.cc
util.cc
fun.cc
hash.cc
list.cc
heap.cc
pagemap.cc
pool.cc
Linking libponyrt.tests
common.cc
Linking libponyc.benchmarks
common.cc
hash.cc
heap.cc
pool.cc
Linking libponyrt.benchmarks
build/release-scheduler_scaling_pthreads/ponyc -Dopenssl_1.1.0 --checktree --verify packages/stdlib
Building builtin -> /home/wink/prgs/pony/ponyc/packages/builtin
Building packages/stdlib -> /home/wink/prgs/pony/ponyc/packages/stdlib
Building ponytest -> /home/wink/prgs/pony/ponyc/packages/ponytest
Building time -> /home/wink/prgs/pony/ponyc/packages/time
Building collections -> /home/wink/prgs/pony/ponyc/packages/collections
Building assert -> /home/wink/prgs/pony/ponyc/packages/assert
Building encode/base64 -> /home/wink/prgs/pony/ponyc/packages/encode/base64
Building buffered -> /home/wink/prgs/pony/ponyc/packages/buffered
Building builtin_test -> /home/wink/prgs/pony/ponyc/packages/builtin_test
Building bureaucracy -> /home/wink/prgs/pony/ponyc/packages/bureaucracy
Building promises -> /home/wink/prgs/pony/ponyc/packages/promises
Building capsicum -> /home/wink/prgs/pony/ponyc/packages/capsicum
Building files -> /home/wink/prgs/pony/ponyc/packages/files
Building cli -> /home/wink/prgs/pony/ponyc/packages/cli
Building collections/persistent -> /home/wink/prgs/pony/ponyc/packages/collections/persistent
Building random -> /home/wink/prgs/pony/ponyc/packages/random
Building crypto -> /home/wink/prgs/pony/ponyc/packages/crypto
Building format -> /home/wink/prgs/pony/ponyc/packages/format
Building debug -> /home/wink/prgs/pony/ponyc/packages/debug
Building glob -> /home/wink/prgs/pony/ponyc/packages/glob
Building regex -> /home/wink/prgs/pony/ponyc/packages/regex
Building net/http -> /home/wink/prgs/pony/ponyc/packages/net/http
Building net -> /home/wink/prgs/pony/ponyc/packages/net
Building net/ssl -> /home/wink/prgs/pony/ponyc/packages/net/ssl
Building ini -> /home/wink/prgs/pony/ponyc/packages/ini
Building itertools -> /home/wink/prgs/pony/ponyc/packages/itertools
Building json -> /home/wink/prgs/pony/ponyc/packages/json
Building logger -> /home/wink/prgs/pony/ponyc/packages/logger
Building math -> /home/wink/prgs/pony/ponyc/packages/math
Building options -> /home/wink/prgs/pony/ponyc/packages/options
Building ponybench -> /home/wink/prgs/pony/ponyc/packages/ponybench
Building term -> /home/wink/prgs/pony/ponyc/packages/term
Building strings -> /home/wink/prgs/pony/ponyc/packages/strings
Building signals -> /home/wink/prgs/pony/ponyc/packages/signals
Building process -> /home/wink/prgs/pony/ponyc/packages/process
Building backpressure -> /home/wink/prgs/pony/ponyc/packages/backpressure
Building serialise -> /home/wink/prgs/pony/ponyc/packages/serialise
Generating
 Reachability
 Selector painting
 Data prototypes
 Data types
 Function prototypes
 Functions
 Descriptors
Optimising
Verifying
Writing ./stdlib.o
Linking ./stdlib

Here is the gdb output from the second run of stdlib in a loop where it hung after 2 minutes:

GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./stdlib...(no debugging symbols found)...done.
Attaching to program: /home/wink/prgs/pony/ponyc/stdlib, process 16467
[New LWP 16469]
[New LWP 16474]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f1ea697f44d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f1ea7af96c0 (LWP 16467) "stdlib" 0x00007f1ea697f44d in pthread_join ()
   from /usr/lib/libpthread.so.0
  2    Thread 0x7f1e9d44c700 (LWP 16469) "stdlib" 0x00005597f96330c9 in run_thread ()
  3    Thread 0x7f1e82c4c700 (LWP 16474) "stdlib" 0x00007f1ea698439d in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f1ea7af96c0 (LWP 16467))]
#0  0x00007f1ea697f44d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f1ea697f44d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x00005597f963b3bb in ponyint_thread_join ()
#2  0x00005597f9632193 in ponyint_sched_shutdown ()
#3  0x00005597f9632457 in ponyint_sched_start ()
#4  0x00005597f96320d9 in pony_start ()
#5  0x00005597f9623d3a in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f1e9d44c700 (LWP 16469))]
#0  0x00005597f96330c9 in run_thread ()
(gdb) bt
#0  0x00005597f96330c9 in run_thread ()
#1  0x00007f1ea697e08c in start_thread () from /usr/lib/libpthread.so.0
#2  0x00007f1ea5f4ee1f in clone () from /usr/lib/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f1e82c4c700 (LWP 16474))]
#0  0x00007f1ea698439d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f1ea698439d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x00005597f963b40f in ponyint_thread_suspend ()
#2  0x00005597f96333ee in run_thread ()
#3  0x00007f1ea697e08c in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007f1ea5f4ee1f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
Cannot access memory at address 0xd4b67c01f4a0
(gdb) printf "asio_cpu=%d scheduler_count=%d min_scheduler_count=%d active_scheduler_count=%d scheduler_count_changing=%d detect_quiescence=%d use_yield=%d\n", asio_cpu,scheduler_count,min_scheduler_count,active_scheduler_count,scheduler_count_changing,detect_quiescence,use_yield
asio_cpu=-1 scheduler_count=6 min_scheduler_count=1 active_scheduler_count=5 scheduler_count_changing=1 detect_quiescence=1 use_yield=257

The results look the "same" to me. The first thread, main, is waiting to join. The second thread is "running". And the third thread is waiting on a condition. Also, in both cases I had System Monitor running and when it hung one thread was consuming 100% of one CPU, probably thread 2.

@dipinhora
Copy link
Contributor Author

@winksaville Oh well. Thanks for confirming it's not ready. I don't think you made any mistakes in your testing. Yes, it's thread 2. It's likely in an optimized/inlined version of the wake_suspended_threads function.

I just pushed a commit that changes the atomics memory order to be the strictest possible (to try and rule out any multithreading date race related issues). Can you run your test with stdlib again?

@winksaville
Copy link
Contributor

Adding the atomics change, didn't help stdlib still hung, here are the steps I did.

Merged the lastest 2472 branch with Atomics into master plus support-openssl_1.1.0 pushed it to a branch on my fork.

Below is the short log:

wink@wink-desktop:~/prgs/pony/ponyc (debug-2451-fix_pthread_condition_variable_usage_for_dynamic_scheduler_scaling-pr_2472)
$ git --no-pager log -5 --format=short
commit 1c0e24fbc119f2aca44ad7b4df8fdfaba78a9a03 (HEAD -> debug-2451-fix_pthread_condition_variable_usage_for_dynamic_scheduler_scaling-pr_2472)
Author: Wink Saville <wink@saville.com>

    Add support for openssl 1.1.0

commit f45463d0a9e3408a2ac2474c815999c1cd79eab8
Author: Dipin Hora <dipin@sendence.com>

    Atomics memory_order change

commit a042643208872738549b4bac5ccd94327876aacb
Author: Dipin Hora <dipin@sendence.com>

    Fix edge cases around quiescence and dynamic scheduler scaling

commit b99f9545536d684ebcd36a502b94764120c17ee3
Author: Dipin Hora <dipin@sendence.com>

    Fix pthread condition variable usage for dynamic scheduler scaling

commit 28b67dd22e4a91cd736c69b7beaf9e7529c01428 (upstream/master, origin/master, origin/HEAD, dhp/master, master)
Author: Sean T Allen <sean@monkeysnatchbanana.com>

    Test

Removed build and rebuilt compiler and stdlib

wink@wink-desktop:~/prgs/pony/ponyc (debug-2451-fix_pthread_condition_variable_usage_for_dynamic_scheduler_scaling-pr_2472)
$ make default_pic=true config=release use=scheduler_scaling_pthreads OPENSSL=-Dopenssl_1.1.0 stdlib
Makefile:281: WARNING: LLVM 5 support is experimental and may result in decreased performance or crashes
Makefile:296: targeting openssl 1.1.0
mpmcq.c
cpu.c
mutemap.c
start.c
scheduler.c
delta.c
cycle.c
gc.c
serialise.c
actormap.c
trace.c
objectmap.c
asio.c
epoll.c
event.c
messageq.c
actor.c
socket.c
io.c
ssl.c
except_try_catch.ll
posix_except.c
directory.c
time.c
stat.c
paths.c
lsda.c
stdfd.c
ponyassert.c
threads.c
fun.c
stack.c
list.c
hash.c
alloc.c
pool.c
heap.c
pagemap.c
options.c
Linking libponyrt
blake2b-ref.c
Linking libblake2
call.c
type.c
control.c
fun.c
lookup.c
sanitise.c
alias.c
compattype.c
assemble.c
cap.c
reify.c
safeto.c
typeparam.c
subtype.c
matchtype.c
viewpoint.c
postfix.c
array.c
match.c
call.c
reference.c
control.c
ffi.c
operator.c
literal.c
lambda.c
docgen.c
flatten.c
pass.c
serialisers.c
syntax.c
finalisers.c
expr.c
traits.c
scope.c
casemethod.c
sugar.c
names.c
refer.c
import.c
verify.c
genexe.c
genoperator.c
genmatch.c
gencontrol.c
gendebug.cc
gendesc.c
genlib.c
codegen.c
genbox.c
genobj.c
genident.c
genserialise.c
genfun.c
genexpr.c
genname.c
host.cc
genopt.cc
gentrace.c
gentype.c
genreference.c
genprim.c
gencall.c
genheader.c
genjit.c
paint.c
reach.c
subtype.c
paths.c
ifdef.c
package.c
use.c
platformfuns.c
buildflagset.c
program.c
ponyc.c
source.c
frame.c
treecheck.c
ast.c
lexint.c
stringtab.c
parser.c
printbuf.c
symtab.c
bnfprint.c
error.c
lexer.c
token.c
parserapi.c
id.c
Linking libponyc
mpmcq.c
cpu.c
mutemap.c
start.c
scheduler.c
delta.c
cycle.c
gc.c
serialise.c
actormap.c
trace.c
objectmap.c
asio.c
epoll.c
event.c
messageq.c
actor.c
socket.c
io.c
ssl.c
except_try_catch.ll
posix_except.c
directory.c
time.c
stat.c
paths.c
lsda.c
stdfd.c
ponyassert.c
threads.c
fun.c
stack.c
list.c
hash.c
alloc.c
pool.c
heap.c
pagemap.c
options.c
Linking libponyrt-pic
gtest-all.cc
Linking libgtest
gbenchmark_main.cc
gbenchmark-all.cc
Linking libgbenchmark
main.c
Linking ponyc
codegen_optimisation.cc
local_inference.cc
annotations.cc
ffi.cc
codegen_final.cc
lambda.cc
token.cc
codegen_identity.cc
lexer.cc
id.cc
matchtype.cc
scope.cc
finalisers.cc
parse_entity.cc
iftype.cc
codegen_trace.cc
symtab.cc
traits.cc
type_check_bind.cc
option_parser.cc
lexint.cc
literal_inference.cc
codegen.cc
util.cc
recover.cc
program.cc
type_check_subtype.cc
verify.cc
literal_limits.cc
use.cc
badpony.cc
flatten.cc
bare.cc
reach.cc
sugar_expr.cc
dontcare.cc
array.cc
codegen_ffi.cc
signature.cc
buildflagset.cc
sugar.cc
parse_expr.cc
chain.cc
sugar_traits.cc
paint.cc
compiler_serialisation.cc
Linking libponyc.tests
error.cc
util.cc
fun.cc
hash.cc
list.cc
heap.cc
pagemap.cc
pool.cc
Linking libponyrt.tests
common.cc
Linking libponyc.benchmarks
common.cc
hash.cc
heap.cc
pool.cc
Linking libponyrt.benchmarks
build/release-scheduler_scaling_pthreads/ponyc -Dopenssl_1.1.0 --checktree --verify packages/stdlib
Building builtin -> /home/wink/prgs/pony/ponyc/packages/builtin
Building packages/stdlib -> /home/wink/prgs/pony/ponyc/packages/stdlib
Building ponytest -> /home/wink/prgs/pony/ponyc/packages/ponytest
Building time -> /home/wink/prgs/pony/ponyc/packages/time
Building collections -> /home/wink/prgs/pony/ponyc/packages/collections
Building assert -> /home/wink/prgs/pony/ponyc/packages/assert
Building encode/base64 -> /home/wink/prgs/pony/ponyc/packages/encode/base64
Building buffered -> /home/wink/prgs/pony/ponyc/packages/buffered
Building builtin_test -> /home/wink/prgs/pony/ponyc/packages/builtin_test
Building bureaucracy -> /home/wink/prgs/pony/ponyc/packages/bureaucracy
Building promises -> /home/wink/prgs/pony/ponyc/packages/promises
Building capsicum -> /home/wink/prgs/pony/ponyc/packages/capsicum
Building files -> /home/wink/prgs/pony/ponyc/packages/files
Building cli -> /home/wink/prgs/pony/ponyc/packages/cli
Building collections/persistent -> /home/wink/prgs/pony/ponyc/packages/collections/persistent
Building random -> /home/wink/prgs/pony/ponyc/packages/random
Building crypto -> /home/wink/prgs/pony/ponyc/packages/crypto
Building format -> /home/wink/prgs/pony/ponyc/packages/format
Building debug -> /home/wink/prgs/pony/ponyc/packages/debug
Building glob -> /home/wink/prgs/pony/ponyc/packages/glob
Building regex -> /home/wink/prgs/pony/ponyc/packages/regex
Building net/http -> /home/wink/prgs/pony/ponyc/packages/net/http
Building net -> /home/wink/prgs/pony/ponyc/packages/net
Building net/ssl -> /home/wink/prgs/pony/ponyc/packages/net/ssl
Building ini -> /home/wink/prgs/pony/ponyc/packages/ini
Building itertools -> /home/wink/prgs/pony/ponyc/packages/itertools
Building json -> /home/wink/prgs/pony/ponyc/packages/json
Building logger -> /home/wink/prgs/pony/ponyc/packages/logger
Building math -> /home/wink/prgs/pony/ponyc/packages/math
Building options -> /home/wink/prgs/pony/ponyc/packages/options
Building ponybench -> /home/wink/prgs/pony/ponyc/packages/ponybench
Building term -> /home/wink/prgs/pony/ponyc/packages/term
Building strings -> /home/wink/prgs/pony/ponyc/packages/strings
Building signals -> /home/wink/prgs/pony/ponyc/packages/signals
Building process -> /home/wink/prgs/pony/ponyc/packages/process
Building backpressure -> /home/wink/prgs/pony/ponyc/packages/backpressure
Building serialise -> /home/wink/prgs/pony/ponyc/packages/serialise
Generating
 Reachability
 Selector painting
 Data prototypes
 Data types
 Function prototypes
 Functions
 Descriptors
Optimising
Verifying
Writing ./stdlib.o
Linking ./stdlib

Below is the output from "ponyc --version" and "stdlib --ponyversion" showing stdlib was compiled with the above compiler:

wink@wink-desktop:~/prgs/pony/ponyc (debug-2451-fix_pthread_condition_variable_usage_for_dynamic_scheduler_scaling-pr_2472)
$ ./build/release-scheduler_scaling_pthreads/ponyc --version
0.21.2-1c0e24fbc [release]
compiled with: llvm 5.0.1 -- cc (GCC) 7.2.1 20171224
wink@wink-desktop:~/prgs/pony/ponyc (debug-2451-fix_pthread_condition_variable_usage_for_dynamic_scheduler_scaling-pr_2472)
$ ./stdlib --ponyversion
0.21.2-1c0e24fbc [release]
compiled with: llvm 5.0.1 -- cc (GCC) 7.2.1 20171224

I than ran stdlib in the loop here is the log and it hung in less than a minute, below is the gdb info:

GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./stdlib...(no debugging symbols found)...done.
Attaching to program: /home/wink/prgs/pony/ponyc/stdlib, process 18590
[New LWP 18592]
[New LWP 18597]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f93d996344d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f93daadd6c0 (LWP 18590) "stdlib" 0x00007f93d996344d in pthread_join () from /usr/lib/libpthread.so.0
  2    Thread 0x7f93c8430700 (LWP 18592) "stdlib" 0x0000561edf932179 in run_thread ()
  3    Thread 0x7f93adc30700 (LWP 18597) "stdlib" 0x00007f93d996839d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f93daadd6c0 (LWP 18590))]
#0  0x00007f93d996344d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f93d996344d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x0000561edf93a46b in ponyint_thread_join ()
#2  0x0000561edf931253 in ponyint_sched_shutdown ()
#3  0x0000561edf931517 in ponyint_sched_start ()
#4  0x0000561edf931199 in pony_start ()
#5  0x0000561edf922e07 in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f93c8430700 (LWP 18592))]
#0  0x0000561edf932179 in run_thread ()
(gdb) bt
#0  0x0000561edf932179 in run_thread ()
#1  0x00007f93d996208c in start_thread () from /usr/lib/libpthread.so.0
#2  0x00007f93d8f32e1f in clone () from /usr/lib/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f93adc30700 (LWP 18597))]
#0  0x00007f93d996839d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f93d996839d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x0000561edf93a4bf in ponyint_thread_suspend ()
#2  0x0000561edf9324a1 in run_thread ()
#3  0x00007f93d996208c in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007f93d8f32e1f in clone () from /usr/lib/libc.so.6
(gdb) printf "asio_cpu=%d scheduler_count=%d min_scheduler_count=%d active_scheduler_count=%d scheduler_count_changing=%d detect_quiescence=%d use_yield=%d\n", asio_cpu,scheduler_count,min_scheduler_count,active_scheduler_count,scheduler_count_changing,detect_quiescence,use_yield
asio_cpu=-1 scheduler_count=6 min_scheduler_count=1 active_scheduler_count=5 scheduler_count_changing=1 detect_quiescence=1 use_yield=257

@dipinhora
Copy link
Contributor Author

@winksaville Thanks for testing and confirming that didn't help any.

Frankly, I'm running out of ideas and don't really understand what is going on. One idea that just came to me is that maybe somehow the fact that some scheduler threads could be terminating prior to other scheduler threads waking is somehow screwing things up. I've pushed a commit to try and rule that out by changing it so that the SCHED_TERMINATE message isn't sent to the threads until after all threads have been woken. This creates a race condition where a thread might get woken and then go back to sleep again before it receives the SCHED_TERMINATE message but the backtrace should look different if that occurs (and I have an idea on how to deal with that if this does somehow solve the current issue you're experiencing).

It would be great if you are able to test running stdlib once again.

@dipinhora
Copy link
Contributor Author

@winksaville Don't bother testing this latest commit. It's got issues (as the CI results show).

@winksaville
Copy link
Contributor

winksaville commented Jan 9, 2018 via email

@dipinhora
Copy link
Contributor Author

@winksaville We already have a perfectly functional scheduling mechanism in place (i.e. the normal scheduling without the dynamic scheduler scaling logic) as a starting point. You can try out the equivalent of your suggestion by running with --ponythreads 1.

I have another thought/idea on this and how to fix this dynamic scheduler scaling hanging issue but it's going to require some re-architecting of how it's implemented.

Maybe we should change the default for --ponyminthreads to 9999 for now to effectively disable the dynamic scheduler scaling since fixing the hanging issue seems like it might take a while? As you've confirmed previously, the effective disabling of this feature doesn't show any of the hanging issues and it's much simpler than trying to back out the change.

@ponylang/core what do you think about effectively disabling the dynamic scheduler scaling?

@winksaville
Copy link
Contributor

SG, I'm going to retest --ponyminthreads, just to make sure :)

@winksaville
Copy link
Contributor

So I've run pony-ring for an hour and ./stdlib for 45+ minutes based on sha1 1c0e24fbc.

$ ./ring --ponyversion
0.21.2-1c0e24fbc [release]
compiled with: llvm 5.0.1 -- cc (GCC) 7.2.1 20171224
$ ./build/release/ponyc --version
0.21.2-1c0e24fbc [release]
compiled with: llvm 5.0.1 -- cc (GCC) 7.2.1 20171224

@dipinhora, my suggestion is to just include the first two changes and defaulting ponyminthreads=9999 in this change. The Atomics change doesn't seem to make any difference, so I'd skip that one. Let me know if there is anything else you'd like me to do for this round.

@dipinhora dipinhora force-pushed the fix_pthread_scheduler_scaling branch from 89048a3 to 47e8d14 Compare January 10, 2018 04:20
Prior to this commit, the pthreads variant of the dynamic scheduler
scaling functionality didn't use pthread condition variables
correctly for the logic required. It relied on a single pthread
condition variable for all threads to sleep against. Unfortunately,
there's no way to wake up a specific thread when relying on pthread
condition variables and the operating system is free to wake up any
thread that is sleeping against a pthread condition variable. This
resulted in out of order waking of sleeping threads which broke one
of the invariants for how dynamic scheduler scaling is supposed to
work.

This commit fixes the code to create a separate pthread condition
variable for each scheduler thread ensure that we can wake up a
specific single thread at a time correctly.

This commit also adds some error handling and asssertions to help
catch any other logic error that might exist.

Thanks to @winksaville for all his help in testing and debugging
this. It likely wouldn't have been found/fixed so soon if it
weren't for him.

NOTE: While this fixes at least one of the causes of ponylang#2451, I don't
believe it resolves the entire issue. Hopefully @winksaville is
able to test and confirm if the issue still exists or not.
Prior to this commit, there existed a race condition between
the SCHED_SUSPEND message getting processed and the decrement
of the active_scheduler_count that could result in the runtime
not reaching quiescence properly. There were also a couple of
edge cases around the ack_count that is part of the CNF/ACK
protocol for quiescense where the runtime could potentially think
it had enough acks when it really didn't due to a scheduler thread
dynamically waking or sleeping at the wrong time.

This commit resolves all of the above. It resolves the ack_count
issue by always reseting the ack_count to 0 prior to starting a
new CNF/ACK cycle since every CNF/ACK cycle always asks for all
active scheduler threads to reponde with an ACK. It resolves the
SCHED_SUSPEND/active_scheduler_count decrement race condition by
making sure that the active_scheduler_count is decremented prior to
the SCHED_SUSPEND message being sent.
This effectively disables dynamic scheduler scaling by default.

The main reason for this is issue ponylang#2451 which only surfaces when
dynamic scheduler scaling is used. We can revert this and re-enable
dynamic scheduler scaling by default once the issue is resolved.
@dipinhora dipinhora force-pushed the fix_pthread_scheduler_scaling branch from 47e8d14 to c6f9387 Compare January 10, 2018 04:26
@dipinhora
Copy link
Contributor Author

@winksaville Thanks for testing again and confirming that the hang issue doesn't exist when the dynamic scheduler scaling is effectively disabled.

I've rebased and force pushed an update removing the atomics change and my last experimental commit. I've also added a commit that defaults ponyminthreads to be the same as ponythreads to effectively disable dynamic scheduler scaling by default.

@dipinhora
Copy link
Contributor Author

@ponylang/core Please let us know how you would like to proceed regarding this. The situation is the following:

  • Dynamic scheduler scaling mostly works (with both signals and pthreads after the changes in this PR) except for the hangs that issue stdlib in a loop hangs #2451 documents.
  • It seems the hangs only occur on shutdown but it is possible the root cause is something that could allow the dynamic scheduler scaling to misbehave during a running program's lifetime before it reaches quiescence.
  • Disabling dynamic scheduler scaling (by setting ponyminthreads same as ponythreads) avoids issue stdlib in a loop hangs #2451 because it bypasses all of the dynamic scheduler scaling logic that triggers the hangs.
  • I have an idea for a potential solution to stdlib in a loop hangs #2451, but I'm not certain that it will fix the problem. I'm also not sure how long it will take me to implement this alternate solution.

My, and @winksaville's recommendation is to disable the dynamic scheduler scaling logic (as the last commit in this PR does) until issue #2451 is resolved.

@dipinhora
Copy link
Contributor Author

Restarting linux travis builds that failed due to llvm download timeout.

@dipinhora
Copy link
Contributor Author

Restarting linux travis builds again due to llvm download timeout.

@dipinhora
Copy link
Contributor Author

Progress.. went from 2 failing jobs to 1 failing job.. Restarting the last linux travis build again due to llvm download timeout.

@dipinhora
Copy link
Contributor Author

And again........................................

@SeanTAllen SeanTAllen merged commit 26d43da into ponylang:master Jan 10, 2018
ponylang-main added a commit that referenced this pull request Jan 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants