Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdlib in a loop hangs #2451

Closed
winksaville opened this issue Dec 26, 2017 · 31 comments
Closed

stdlib in a loop hangs #2451

winksaville opened this issue Dec 26, 2017 · 31 comments
Labels
triggers release Major issue that when fixed, results in an "emergency" release

Comments

@winksaville
Copy link
Contributor

It appears #2445 isn't solved. Previously I had run for 4 hours but with lots of debug output, yesterday I removed all the debug and then reran the simple while loop test:

while true; do DATE=`date`; echo "Date: ${DATE}"; ./stdlib --sequential; echo "stdlib done"; don

And it hung after 18 minutes, so I used gdb to look at the code here are the logs:

$ sudo gdb ./stdlib 19987
[sudo] password for wink: 
GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./stdlib...(no debugging symbols found)...done.
Attaching to program: /home/wink/prgs/pony/ponyc/stdlib, process 19987
[New LWP 19989]
[New LWP 19994]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007efda419143d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7efda530c6c0 (LWP 19987) "stdlib" 0x00007efda419143d in pthread_join ()
   from /usr/lib/libpthread.so.0
  2    Thread 0x7efd9ac5c700 (LWP 19989) "stdlib" 0x000055d003ad4979 in run_thread ()
  3    Thread 0x7efd7045c700 (LWP 19994) "stdlib" 0x00007efda419a89f in do_sigwait ()
   from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7efda530c6c0 (LWP 19987))]
#0  0x00007efda419143d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007efda419143d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x000055d003adccbb in ponyint_thread_join ()
#2  0x000055d003ad3afb in ponyint_sched_shutdown ()
#3  0x000055d003ad3d87 in ponyint_sched_start ()
#4  0x000055d003ad3a3d in pony_start ()
#5  0x000055d003ac56d5 in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7efd9ac5c700 (LWP 19989))]
#0  0x000055d003ad4979 in run_thread ()
(gdb) bt
#0  0x000055d003ad4979 in run_thread ()
#1  0x00007efda419008a in start_thread () from /usr/lib/libpthread.so.0
#2  0x00007efda376042f in clone () from /usr/lib/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread 0x7efd7045c700 (LWP 19994))]
#0  0x00007efda419a89f in do_sigwait () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007efda419a89f in do_sigwait () from /usr/lib/libpthread.so.0
#1  0x00007efda419a92d in sigwait () from /usr/lib/libpthread.so.0
#2  0x000055d003adcd32 in ponyint_thread_suspend ()
#3  0x000055d003ad4c9d in run_thread ()
#4  0x00007efda419008a in start_thread () from /usr/lib/libpthread.so.0
#5  0x00007efda376042f in clone () from /usr/lib/libc.so.6
(gdb) 

I then ran it again this morning and it hung after about 6 mintes. Below is the gdb output and here are the logs. This is the first time I've seen 8 threads:

$ sudo gdb ./stdlib 8342
GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./stdlib...(no debugging symbols found)...done.
Attaching to program: /home/wink/prgs/pony/ponyc/stdlib, process 8342
[New LWP 8343]
[New LWP 8344]
[New LWP 8345]
[New LWP 8346]
[New LWP 8347]
[New LWP 8348]
[New LWP 8349]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f5742d9043d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f5743f0b6c0 (LWP 8342) "stdlib" 0x00007f5742d9043d in pthread_join ()
   from /usr/lib/libpthread.so.0
  2    Thread 0x7f573a05c700 (LWP 8343) "stdlib" 0x00007f574235f596 in epoll_pwait ()
   from /usr/lib/libc.so.6
  3    Thread 0x7f573985b700 (LWP 8344) "stdlib" 0x00007f5742d99230 in nanosleep ()
   from /usr/lib/libpthread.so.0
  4    Thread 0x7f573105b700 (LWP 8345) "stdlib" 0x00007f5742d99230 in nanosleep ()
   from /usr/lib/libpthread.so.0
  5    Thread 0x7f573085b700 (LWP 8346) "stdlib" 0x00007f5742d99230 in nanosleep ()
   from /usr/lib/libpthread.so.0
  6    Thread 0x7f572005b700 (LWP 8347) "stdlib" 0x00007f5742d99230 in nanosleep ()
   from /usr/lib/libpthread.so.0
  7    Thread 0x7f571785b700 (LWP 8348) "stdlib" 0x00007f5742d9989f in do_sigwait ()
   from /usr/lib/libpthread.so.0
  8    Thread 0x7f570f05b700 (LWP 8349) "stdlib" 0x00007f5742d99230 in nanosleep ()
   from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f5743f0b6c0 (LWP 8342))]
#0  0x00007f5742d9043d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f5742d9043d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x000055afc517bcbb in ponyint_thread_join ()
#2  0x000055afc5172afb in ponyint_sched_shutdown ()
#3  0x000055afc5172d87 in ponyint_sched_start ()
#4  0x000055afc5172a3d in pony_start ()
#5  0x000055afc51646d5 in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f573a05c700 (LWP 8343))]
#0  0x00007f574235f596 in epoll_pwait () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007f574235f596 in epoll_pwait () from /usr/lib/libc.so.6
#1  0x000055afc5177dbb in ponyint_asio_backend_dispatch ()
#2  0x00007f5742d8f08a in start_thread () from /usr/lib/libpthread.so.0
#3  0x00007f574235f42f in clone () from /usr/lib/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f573985b700 (LWP 8344))]
#0  0x00007f5742d99230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f5742d99230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055afc517fe73 in ponyint_cpu_core_pause ()
#2  0x000055afc51739b2 in run_thread ()
#3  0x00007f5742d8f08a in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007f574235f42f in clone () from /usr/lib/libc.so.6
(gdb) thread 4
[Switching to thread 4 (Thread 0x7f573105b700 (LWP 8345))]
#0  0x00007f5742d99230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f5742d99230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055afc517fe73 in ponyint_cpu_core_pause ()
#2  0x000055afc51739b2 in run_thread ()
#3  0x00007f5742d8f08a in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007f574235f42f in clone () from /usr/lib/libc.so.6
(gdb) thread 5
[Switching to thread 5 (Thread 0x7f573085b700 (LWP 8346))]
#0  0x00007f5742d99230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f5742d99230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055afc517fe73 in ponyint_cpu_core_pause ()
#2  0x000055afc51739b2 in run_thread ()
#3  0x00007f5742d8f08a in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007f574235f42f in clone () from /usr/lib/libc.so.6
(gdb) thread 6
[Switching to thread 6 (Thread 0x7f572005b700 (LWP 8347))]
#0  0x00007f5742d99230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f5742d99230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055afc517fe73 in ponyint_cpu_core_pause ()
#2  0x000055afc51739b2 in run_thread ()
#3  0x00007f5742d8f08a in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007f574235f42f in clone () from /usr/lib/libc.so.6
(gdb) thread 7
[Switching to thread 7 (Thread 0x7f571785b700 (LWP 8348))]
#0  0x00007f5742d9989f in do_sigwait () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f5742d9989f in do_sigwait () from /usr/lib/libpthread.so.0
#1  0x00007f5742d9992d in sigwait () from /usr/lib/libpthread.so.0
#2  0x000055afc517bd32 in ponyint_thread_suspend ()
#3  0x000055afc5173c9d in run_thread ()
#4  0x00007f5742d8f08a in start_thread () from /usr/lib/libpthread.so.0
#5  0x00007f574235f42f in clone () from /usr/lib/libc.so.6
(gdb) thread 8
[Switching to thread 8 (Thread 0x7f570f05b700 (LWP 8349))]
#0  0x00007f5742d99230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f5742d99230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055afc517fe73 in ponyint_cpu_core_pause ()
#2  0x000055afc51739b2 in run_thread ()
#3  0x00007f5742d8f08a in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007f574235f42f in clone () from /usr/lib/libc.so.6
(gdb) 
@mfelsche mfelsche added bug: 1 - needs investigation triggers release Major issue that when fixed, results in an "emergency" release labels Dec 27, 2017
@winksaville
Copy link
Contributor Author

@SeanTAllen any success duplicating this issue?

@dipinhora
Copy link
Contributor

@winksaville I haven't had a chance to look into it yet. Hopefully soon... In relation to that, are there any specifics to reproducing beyond running ./stdlib in a loop? Do I need to use a specific number of ponythreads?

In the meantime:

  • My memory is fuzzy, but I vaguely recall similar stdlib hangs in CI prior to the dynamic scheduler scaling changes. Would it be possible for you to see if you can reproduce this from before that was merged?
  • Also, would it be possible for you to see if you can reproduce this with a debug build of ponyc and then in gdb/lldb print out information about the internal scheduler state?

winksaville added a commit to winksaville/ponyc that referenced this issue Dec 28, 2017
The targets can be used to test issue ponylang#2451

These use the sub-targets stdlib-debug, stdlib to compile stdlib either
in debug or not. And stdlib-loop will execute stdlib in a forever while
loop capturing the output to stdlib-xxxx.txt where xxx is the current
date and time.
@winksaville
Copy link
Contributor Author

I've add PR #2456 which has just failed on my system running test-stdlib-debug. What specific information would you like to have, here are the backtraces for the 3 running threads:

$ sudo gdb ./stdlib 12117
[sudo] password for wink: 
GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./stdlib...done.
Attaching to program: /home/wink/prgs/pony/ponyc/stdlib, process 12117
[New LWP 12119]
[New LWP 12124]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f848ad4a43d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f848bec46c0 (LWP 12117) "stdlib" 0x00007f848ad4a43d in pthread_join ()
   from /usr/lib/libpthread.so.0
  2    Thread 0x7f8481815700 (LWP 12119) "stdlib" 0x0000556b9a7c68e9 in run_thread ()
  3    Thread 0x7f8477815700 (LWP 12124) "stdlib" 0x00007f848ad5389f in do_sigwait ()
   from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f848bec46c0 (LWP 12117))]
#0  0x00007f848ad4a43d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f848ad4a43d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x0000556b9a7cec2b in ponyint_thread_join ()
#2  0x0000556b9a7c59f0 in ponyint_sched_shutdown ()
#3  0x0000556b9a7c5ca7 in ponyint_sched_start ()
#4  0x0000556b9a7c590d in pony_start ()
#5  0x0000556b9a7c55c5 in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f8481815700 (LWP 12119))]
#0  0x0000556b9a7c68e9 in run_thread ()
(gdb) bt
#0  0x0000556b9a7c68e9 in run_thread ()
#1  0x00007f848ad4908a in start_thread () from /usr/lib/libpthread.so.0
#2  0x00007f848a31942f in clone () from /usr/lib/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f8477815700 (LWP 12124))]
#0  0x00007f848ad5389f in do_sigwait () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f848ad5389f in do_sigwait () from /usr/lib/libpthread.so.0
#1  0x00007f848ad5392d in sigwait () from /usr/lib/libpthread.so.0
#2  0x0000556b9a7ceca2 in ponyint_thread_suspend ()
#3  0x0000556b9a7c6c0d in run_thread ()
#4  0x00007f848ad4908a in start_thread () from /usr/lib/libpthread.so.0
#5  0x00007f848a31942f in clone () from /usr/lib/libc.so.6

Curious, what additional information is available when running stdlib after being compiled with a debug build of ponyc? In the above I used the test-stdlib-debug target which passes the "-d" switch to ponyc when compiling.

Be glad to test any prior build, could you please give me a specific commit. I'm guessing this commit which is the one prior to your #2386 commit, but want to make sure.

@dipinhora
Copy link
Contributor

@winksaville compiling ponyc also compiles libponyrt which is then linked by ponyc into the final binary whenever any pony programs (like stdlib are built). Compiling a program using ponyc --debug does not impact the underlying libponyrt that is linked to but only the relevant code generated for the application level pony source code.

Compiling ponyc with config=debug results in libponyrt being compiled in debug (without optimizations) and with debug symbols which makes it possible to inspect the complete state of the runtime while a libponyrt compiled in release (config=release which is the default) makes it not possible to inspect the runtime in as much detail due to optimizations and lack of debug symbols. If ponyc is compiled with config=debug, there is no need to compile the binary with ponyc --debug in order to debug the runtime (libponyrt) itself. Compiling with ponyc --debug is mainly helpful for debugging the application level logic written in pony.

I hope that the above was helpful (and not just more confusing). I'm sure there's a better way to explain things but I'm unable to manage such an explanation at the moment.

In regards to what information to look at once you're in the debugger, it would be helpful to be able to see the following:

In regards to testing a prior build before the dynamic scheduler scaling changes, yes, that commit prior to #2386 being merged is a good one. Or you can pick one of the official release tags. Either one works.

@winksaville
Copy link
Contributor Author

Makes perfect sense, now that you explain it, txs. I'll rerun the test with ponyc compiled with config=debug and as well as with a prior commit.

@winksaville
Copy link
Contributor Author

whoops, wrong button, reopening.

@winksaville winksaville reopened this Dec 28, 2017
winksaville added a commit to winksaville/ponyc that referenced this issue Dec 29, 2017
The targets can be used to test issue ponylang#2451

These use the sub-targets stdlib-debug, stdlib to compile stdlib either
in debug or not. And stdlib-loop will execute stdlib in a forever while
loop capturing the output to stdlib-xxxx.txt where xxx is the current
date and time.
winksaville added a commit to winksaville/ponyc that referenced this issue Dec 30, 2017
The targets can be used to test issue ponylang#2451

These use the sub-targets stdlib-debug, stdlib to compile stdlib either
in debug or not. And stdlib-loop will execute stdlib in a forever while
loop capturing the output to stdlib-xxxx.txt where xxx is the current
date and time.
winksaville added a commit to winksaville/ponyc that referenced this issue Dec 30, 2017
The targets can be used to test issue ponylang#2451

These use the sub-targets stdlib-debug, stdlib to compile stdlib either
in debug or not. And stdlib-loop will execute stdlib in a forever while
loop capturing the output to stdlib-xxxx.txt where xxx is the current
date and time.
@winksaville
Copy link
Contributor Author

@dipinhora so I've run two branches in my repo.
debug-2451 which is built on master from 12/27 and
debug-2451-0.21.0 which is built on tag 0.21.0 from 12/17.

I built both the following, which builds "all" and then builds "stdlib":

rm -rf build; make config=debug default_pic=true use=openssl_1.1.0 stdlib

Finally I run test-long-term to start the forever loop

make test-long-term

I ran debug-2451-0.21.0 for about 16.5 hours starting yesterday afternoon and finishing this morning, there were no issues. If you wan't to see the 2.4G of uncompressed logs its here.

This moring I ran debug-2451 which has you new scheduler stuff and it ran a couple minutes and then hung. Another words it fails quickly, its logs are here.
I then used gdb to try to get the information you requested:

$ sudo gdb ./stdlib 12196
[sudo] password for wink: 
GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./stdlib...done.
Attaching to program: /home/wink/prgs/pony/ponyc/stdlib, process 12196
[New LWP 12198]
[New LWP 12203]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f0e2913743d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f0e2a2b16c0 (LWP 12196) "stdlib" 0x00007f0e2913743d in pthread_join ()
   from /usr/lib/libpthread.so.0
  2    Thread 0x7f0e1fc02700 (LWP 12198) "stdlib" 0x000055b3d7526b7d in wake_suspended_threads
    () at src/libponyrt/sched/scheduler.c:124
  3    Thread 0x7f0dfd402700 (LWP 12203) "stdlib" 0x00007f0e2914089f in do_sigwait ()
   from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f0e2a2b16c0 (LWP 12196))]
#0  0x00007f0e2913743d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f0e2913743d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x000055b3d75348c1 in ponyint_thread_join (thread=139698638956288)
    at src/libponyrt/platform/threads.c:229
#2  0x000055b3d752764a in ponyint_sched_shutdown () at src/libponyrt/sched/scheduler.c:587
#3  0x000055b3d7527b46 in ponyint_sched_start (library=false)
    at src/libponyrt/sched/scheduler.c:718
#4  0x000055b3d75267a5 in pony_start (library=false, language_features=true)
    at src/libponyrt/sched/start.c:168
#5  0x000055b3d7518345 in main ()
(gdb) print *this_scheduler
$1 = {tid = 139698813736640, index = 0, cpu = 0, node = 0, terminate = false, 
  asio_stopped = false, asio_noisy = false, sleep_object = 0, last_victim = 0x0, ctx = {
    scheduler = 0x0, current = 0x7f0e28401c00, 
    trace_object = 0x55b3d752b7df <ponyint_gc_sendobject>, 
    trace_actor = 0x55b3d752bb4f <ponyint_gc_sendactor>, stack = 0x0, acquire = {contents = {
        count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, 
    serialise_size = 0, serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, 
        buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {
    contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x0, 
    tail = {{object = 0x0, counter = 0}, raw = 0x00000000000000000000000000000000}}, mq = {
    head = 0x0, tail = 0x0}}
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f0e1fc02700 (LWP 12198))]
#0  0x000055b3d7526b7d in wake_suspended_threads () at src/libponyrt/sched/scheduler.c:124
124	    if(!atomic_exchange_explicit(&scheduler_count_changing, true,
(gdb) bt
#0  0x000055b3d7526b7d in wake_suspended_threads () at src/libponyrt/sched/scheduler.c:124
#1  0x000055b3d7526f91 in quiescent (sched=0x7f0e28402800, tsc=842543479762183, 
    tsc2=842546926076603) at src/libponyrt/sched/scheduler.c:271
#2  0x000055b3d75271b9 in steal (sched=0x7f0e28402800) at src/libponyrt/sched/scheduler.c:370
#3  0x000055b3d7527461 in run (sched=0x7f0e28402800) at src/libponyrt/sched/scheduler.c:509
#4  0x000055b3d75275d9 in run_thread (arg=0x7f0e28402800)
    at src/libponyrt/sched/scheduler.c:574
#5  0x00007f0e2913608a in start_thread () from /usr/lib/libpthread.so.0
#6  0x00007f0e2870642f in clone () from /usr/lib/libc.so.6
(gdb) print *this_scheduler 
$2 = {tid = 139698638956288, index = 0, cpu = 0, node = 0, terminate = false, 
  asio_stopped = true, asio_noisy = false, sleep_object = 12, last_victim = 0x7f0e28402b80, 
  ctx = {scheduler = 0x7f0e28402800, current = 0x7f0e16be7800, 
    trace_object = 0x55b3d752b7df <ponyint_gc_sendobject>, 
    trace_actor = 0x55b3d752bb4f <ponyint_gc_sendactor>, stack = 0x0, acquire = {contents = {
        count = 0, size = 8, item_bitmap = 0x7f0e16be3b00, buckets = 0x7f0e16be3b08}}, 
    serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 5, ack_token = 48, 
  ack_count = 5, mute_mapping = {contents = {count = 0, size = 0, item_bitmap = 0x0, 
      buckets = 0x0}}, q = {head = 0x7f0df4936300, tail = {{object = 0x7f0df4936300, 
        counter = 1011}, raw = 0x00000000000003f300007f0df4936300}}, mq = {
    head = 0x7f0dd0f53e00, tail = 0x7f0dd156a500}}
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f0dfd402700 (LWP 12203))]
#0  0x00007f0e2914089f in do_sigwait () from /usr/lib/libpthread.so.0
(gdb) print *this_scheduler 
$3 = {tid = 139698060142336, index = 5, cpu = 5, node = 0, terminate = false, 
  asio_stopped = false, asio_noisy = false, sleep_object = 12, last_victim = 0x7f0e284029c0, 
  ctx = {scheduler = 0x7f0e284030c0, current = 0x7f0e16be7800, 
    trace_object = 0x55b3d752b88a <ponyint_gc_recvobject>, 
    trace_actor = 0x55b3d752bbca <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {
        count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, 
    serialise_size = 0, serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, 
        buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {
    contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {
    head = 0x7f0e16b9eae0, tail = {{object = 0x7f0e16b9eae0, counter = 33}, 
      raw = 0x000000000000002100007f0e16b9eae0}}, mq = {head = 0x7f0df4b5b000, 
    tail = 0x7f0e28403d40}}
(gdb) printf "asio_cpu=%d scheduler_count=%d min_scheduler_count=%d active_scheduler_count=%d scheduler_count_changing=%d detect_quiescence=%d use_yield=%d\n", asio_cpu,scheduler_count,min_scheduler_count,active_scheduler_count,scheduler_count_changing,detect_quiescence,use_yield
asio_cpu=-1 scheduler_count=6 min_scheduler_count=1 active_scheduler_count=5 scheduler_count_changing=1 detect_quiescence=1 use_yield=1

The one thing that doesn't work is getting the queue size, it SIGSEGV's:

(gdb) p mpmcq_size_debug(&this_scheduler->mq)

Thread 3 "stdlib" received signal SIGSEGV, Segmentation fault.
0x000055d9113eae82 in mpmcq_size_debug (q=0x7f58afab1240)
    at src/libponyrt/sched/mpmcq.c:49
49	  while(atomic_load_explicit(&tail->next, memory_order_relaxed) != NULL)
The program being debugged was signaled while in a function called from GDB.
GDB remains in the frame where the signal was received.
To change this behavior use "set unwindonsignal on".
Evaluation of the expression containing the function
(mpmcq_size_debug) will be abandoned.
When the function is done executing, GDB will silently stop.

So I'm doing something wrong. Actually, I'm probably doing lots of things wrong. Just let me know what you'd like me to do.

@dipinhora
Copy link
Contributor

Hi @winksaville. Thanks for taking the time to do all the legwork and for confirming that the issue didn't exist prior to the dynamic scheduler scaling code changes.

I tried reproducing but my virtual machine doesn't seem to trigger the issue (or I just didn't wait long enough). I'll have to spin up an AWS instance at some point.

The debug information you provided is very helpful and points to the fact that the last thread is not being woken properly. I'm not sure why that's the case, but the following diff should try and catch that scenario and try again automatically (until it is hopefully successful and cleanly shuts down):

diff --git a/src/common/threads.h b/src/common/threads.h
index 6ed464f..06cb87c 100644
--- a/src/common/threads.h
+++ b/src/common/threads.h
@@ -59,6 +59,6 @@ pony_thread_id_t ponyint_thread_self();
 
 void ponyint_thread_suspend(pony_signal_event_t signal);
 
-void ponyint_thread_wake(pony_thread_id_t thread, pony_signal_event_t signal);
+int ponyint_thread_wake(pony_thread_id_t thread, pony_signal_event_t signal);
 
 #endif
diff --git a/src/libponyrt/platform/threads.c b/src/libponyrt/platform/threads.c
index 52e8a23..a9660a5 100644
--- a/src/libponyrt/platform/threads.c
+++ b/src/libponyrt/platform/threads.c
@@ -276,19 +276,18 @@ void ponyint_thread_suspend(pony_signal_event_t signal)
 #endif
 }
 
-void ponyint_thread_wake(pony_thread_id_t thread, pony_signal_event_t signal)
+int ponyint_thread_wake(pony_thread_id_t thread, pony_signal_event_t signal)
 {
+  int ret;
 #if defined(PLATFORM_IS_WINDOWS)
   (void) thread;
-  SetEvent(signal);
+  ret = SetEvent(signal);
 #elif defined(USE_SCHEDULER_SCALING_PTHREADS)
   (void) thread;
-  int ret;
   // signal condition variable
   ret = pthread_cond_signal(signal);
-  // TODO: What to do if `ret` is an unrecoverable error?
-  (void) ret;
 #else
-  pthread_kill(thread, signal);
+  ret = pthread_kill(thread, signal);
 #endif
+  return ret;
 }
diff --git a/src/libponyrt/sched/scheduler.c b/src/libponyrt/sched/scheduler.c
index ab23773..d942c88 100644
--- a/src/libponyrt/sched/scheduler.c
+++ b/src/libponyrt/sched/scheduler.c
@@ -126,9 +126,16 @@ static void wake_suspended_threads()
     {
       // in case the count changed between the while check and now
       if(get_active_scheduler_count() < scheduler_count)
+      {
         // send signal to wake up next scheduler thread available
-        ponyint_thread_wake(scheduler[get_active_scheduler_count()].tid,
-          scheduler[get_active_scheduler_count()].sleep_object);
+        if(!ponyint_thread_wake(scheduler[get_active_scheduler_count()].tid,
+          scheduler[get_active_scheduler_count()].sleep_object))
+          // if there was an error waking the thread
+          // unlock the bool that controls modifying the active scheduler count
+          // variable.
+          atomic_store_explicit(&scheduler_count_changing, false,
+            memory_order_release);
+      }
       else
         // if there are no scheduler threads left to unlock
         // unlock the bool that controls modifying the active scheduler count
@@ -800,8 +807,13 @@ void ponyint_sched_maybe_wakeup()
     memory_order_acquire))
   {
     // send signal to wake up next scheduler thread available
-    ponyint_thread_wake(scheduler[current_active_scheduler_count].tid,
-      scheduler[current_active_scheduler_count].sleep_object);
+    if(!ponyint_thread_wake(scheduler[current_active_scheduler_count].tid,
+      scheduler[current_active_scheduler_count].sleep_object))
+      // if there was an error waking the thread
+      // unlock the bool that controls modifying the active scheduler count
+      // variable.
+      atomic_store_explicit(&scheduler_count_changing, false,
+        memory_order_release);
   }
 }
 

There's no guarantee that this fixes the issue, but I think it should help. I likely need to go through and add in error handling logic everywhere there's a syscall that can return an error code to catch other similar edge cases.

It would be great if you're able to test with this diff applied to see if it helps any.

Also, re: the segfault when trying to get the queue count, that was my fault. I told you to use the wrong function to print the size of the queue. You should be using p messageq_size_debug(&sched->mq).

@winksaville
Copy link
Contributor Author

Today, I've been trying to simpily stdlib, it now is down to the code below, the big thing was it doesn't require openssl 1.1.0. It failed once after 20minutes and has now just failed a second time but took 55minutes!

So openssl 1.1.0 isn't directly the problem, but there is one other non-standard change and that's that my Arch Linux system runs llvm 5.0. This definitely has issues as the "normal tests" won't complete unless I remove a bunch of tests, see this.

Anyway, I'll apply your patch and see what happens.

use "ponytest"
use buffered = "buffered"
use builtin_test = "builtin_test"
use bureaucracy = "bureaucracy"
use collections = "collections"
use collections_persistent = "collections/persistent"
use files = "files"
use format = "format"
use glob = "glob"

actor Main is TestList
  new create(env: Env) => PonyTest(env, this)
  new make() => None

  fun tag tests(test: PonyTest) =>
    builtin_test.Main.make().tests(test)
    buffered.Main.make().tests(test)
    bureaucracy.Main.make().tests(test)
    collections.Main.make().tests(test)
    collections_persistent.Main.make().tests(test)
    files.Main.make().tests(test)
    format.Main.make().tests(test)

    ifdef not windows then
      // The glob tests are currently very broken on Windows, so ignore them.
      glob.Main.make().tests(test)
    end

@winksaville
Copy link
Contributor Author

winksaville commented Dec 31, 2017

@dipinhora, I applied your patch and stdlib still hung.

But I think you have the logic backwards, your if statement is true on the "good" case i.e. when ponyint_thread_wake returns 0. I changed it to the opposite logic and also added a printf just to see what would happen:

$ git diff HEAD^ src/libponyrt/sched/scheduler.c
diff --git a/src/libponyrt/sched/scheduler.c b/src/libponyrt/sched/scheduler.c
index d942c88d..dbc9c770 100644
--- a/src/libponyrt/sched/scheduler.c
+++ b/src/libponyrt/sched/scheduler.c
@@ -12,6 +12,8 @@
 #include <string.h>
 #include "mutemap.h"
 
+#include <stdio.h>
+
 #define PONY_SCHED_BATCH 100
 
 static DECLARE_THREAD_FN(run_thread);
@@ -128,13 +130,18 @@ static void wake_suspended_threads()
       if(get_active_scheduler_count() < scheduler_count)
       {
         // send signal to wake up next scheduler thread available
-        if(!ponyint_thread_wake(scheduler[get_active_scheduler_count()].tid,
+        if(0 != ponyint_thread_wake(scheduler[get_active_scheduler_count()].tid,
           scheduler[get_active_scheduler_count()].sleep_object))
+        {
+          int err = errno;
           // if there was an error waking the thread
           // unlock the bool that controls modifying the active scheduler count
           // variable.
           atomic_store_explicit(&scheduler_count_changing, false,
             memory_order_release);
+          printf("wake_suspended_threads: errno=%d from ponyint_thread_wake\n",
+              err);
+        }
       }
       else
         // if there are no scheduler threads left to unlock
@@ -807,13 +814,18 @@ void ponyint_sched_maybe_wakeup()
     memory_order_acquire))
   {
     // send signal to wake up next scheduler thread available
-    if(!ponyint_thread_wake(scheduler[current_active_scheduler_count].tid,
+    if(0 != ponyint_thread_wake(scheduler[current_active_scheduler_count].tid,
       scheduler[current_active_scheduler_count].sleep_object))
+    {
+      int err = errno;
       // if there was an error waking the thread
       // unlock the bool that controls modifying the active scheduler count
       // variable.
       atomic_store_explicit(&scheduler_count_changing, false,
         memory_order_release);
+      printf("ponyint_sched_maybe_wakeup: errno=%d from ponyint_thread_wake\n",
+            err);
+    }
   }
 }

As you might expect ponyint_thread_wake always returned 0 and stdlib still hung.

Below is the gdb output with the above patch:

[sudo] password for wink: 
GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./stdlib...done.
Attaching to program: /home/wink/prgs/pony/ponyc/stdlib, process 31748
[New LWP 31750]
[New LWP 31755]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f126418043d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f12652fa6c0 (LWP 31748) "stdlib" 0x00007f126418043d in pthread_join ()
   from /usr/lib/libpthread.so.0
  2    Thread 0x7f125ac4c700 (LWP 31750) "stdlib" 0x00005604cd2f15bd in wake_suspended_threads () at src/libponyrt/sched/scheduler.c:126
  3    Thread 0x7f123044c700 (LWP 31755) "stdlib" 0x00007f126418989f in do_sigwait ()
   from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f12652fa6c0 (LWP 31748))]
#0  0x00007f126418043d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f126418043d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x00005604cd2ff38d in ponyint_thread_join (thread=139716808984320)
    at src/libponyrt/platform/threads.c:229
#2  0x00005604cd2f20ce in ponyint_sched_shutdown () at src/libponyrt/sched/scheduler.c:601
#3  0x00005604cd2f25ca in ponyint_sched_start (library=false)
    at src/libponyrt/sched/scheduler.c:732
#4  0x00005604cd2f11e5 in pony_start (library=false, language_features=true)
    at src/libponyrt/sched/start.c:168
#5  0x00005604cd2e2d98 in main ()
(gdb) p *this_scheduler 
$1 = {tid = 139716983760576, index = 0, cpu = 0, node = 0, terminate = false, 
  asio_stopped = false, asio_noisy = false, sleep_object = 0, last_victim = 0x0, ctx = {
    scheduler = 0x0, current = 0x7f126344bc00, 
    trace_object = 0x5604cd2f62ab <ponyint_gc_sendobject>, 
    trace_actor = 0x5604cd2f661b <ponyint_gc_sendactor>, stack = 0x0, acquire = {contents = {
        count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, 
    serialise_size = 0, serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, 
        buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {
    contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x0, 
    tail = {{object = 0x0, counter = 0}, raw = 0x00000000000000000000000000000000}}, mq = {
    head = 0x0, tail = 0x0}}
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f125ac4c700 (LWP 31750))]
#0  0x00005604cd2f15bd in wake_suspended_threads () at src/libponyrt/sched/scheduler.c:126
126	    if(!atomic_exchange_explicit(&scheduler_count_changing, true,
(gdb) bt
#0  0x00005604cd2f15bd in wake_suspended_threads () at src/libponyrt/sched/scheduler.c:126
#1  0x00005604cd2f1a15 in quiescent (sched=0x7f126344c800, tsc=1154829614647498, 
    tsc2=1154829617615502) at src/libponyrt/sched/scheduler.c:285
#2  0x00005604cd2f1c3d in steal (sched=0x7f126344c800) at src/libponyrt/sched/scheduler.c:384
#3  0x00005604cd2f1ee5 in run (sched=0x7f126344c800) at src/libponyrt/sched/scheduler.c:523
#4  0x00005604cd2f205d in run_thread (arg=0x7f126344c800)
    at src/libponyrt/sched/scheduler.c:588
#5  0x00007f126417f08a in start_thread () from /usr/lib/libpthread.so.0
#6  0x00007f126374f42f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$2 = {tid = 139716808984320, index = 0, cpu = 0, node = 0, terminate = false, 
  asio_stopped = true, asio_noisy = false, sleep_object = 12, last_victim = 0x7f126344c9c0, 
  ctx = {scheduler = 0x7f126344c800, current = 0x7f12383ca600, 
    trace_object = 0x5604cd2f6356 <ponyint_gc_recvobject>, 
    trace_actor = 0x5604cd2f6696 <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {
        count = 0, size = 8, item_bitmap = 0x7f1238333a00, buckets = 0x7f1238333a08}}, 
    serialise_buffer = 0x7f123a372720, serialise_size = 0, serialise = {contents = {
        count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, 
    serialise_alloc = 0x5604cd2e67d0 <serialise_$36$2_apply_oZo>, 
    serialise_alloc_final = 0x5604cd1bf190 <serialise_$36$3_apply_oZo>, 
    serialise_throw = 0x5604cd2033f0 <serialise_$36$4_apply_o>}, block_count = 5, 
  ack_token = 40, ack_count = 5, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7f12099c1a20, tail = {{
        object = 0x7f12099c1a20, counter = 1266}, 
      raw = 0x00000000000004f200007f12099c1a20}}, mq = {head = 0x7f123d5e1c20, 
    tail = 0x7f12099b3b20}}
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f123044c700 (LWP 31755))]
#0  0x00007f126418989f in do_sigwait () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f126418989f in do_sigwait () from /usr/lib/libpthread.so.0
#1  0x00007f126418992d in sigwait () from /usr/lib/libpthread.so.0
#2  0x00005604cd2ff424 in ponyint_thread_suspend (signal=12)
    at src/libponyrt/platform/threads.c:275
#3  0x00005604cd2f1db0 in steal (sched=0x7f126344d0c0) at src/libponyrt/sched/scheduler.c:459
#4  0x00005604cd2f1ee5 in run (sched=0x7f126344d0c0) at src/libponyrt/sched/scheduler.c:523
#5  0x00005604cd2f205d in run_thread (arg=0x7f126344d0c0)
    at src/libponyrt/sched/scheduler.c:588
#6  0x00007f126417f08a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007f126374f42f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$3 = {tid = 139716095952640, index = 5, cpu = 5, node = 0, terminate = false, 
  asio_stopped = false, asio_noisy = false, sleep_object = 12, last_victim = 0x7f126344cd40, 
  ctx = {scheduler = 0x7f126344d0c0, current = 0x0, trace_object = 0x0, trace_actor = 0x0, 
    stack = 0x0, acquire = {contents = {count = 0, size = 0, item_bitmap = 0x0, 
        buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {
      contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, 
    serialise_alloc = 0x0, serialise_alloc_final = 0x0, serialise_throw = 0x0}, 
  block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, 
      size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7f126344dd60, tail = {{
        object = 0x7f126344dd60, counter = 0}, raw = 0x000000000000000000007f126344dd60}}, 
  mq = {head = 0x7f12099b3a20, tail = 0x7f126344dd40}}
(gdb) printf "asio_cpu=%d scheduler_count=%d min_scheduler_count=%d active_scheduler_count=%d scheduler_count_changing=%d detect_quiescence=%d use_yield=%d\n", asio_cpu,scheduler_count,min_scheduler_count,active_scheduler_count,scheduler_count_changing,detect_quiescence,use_yield
asio_cpu=-1 scheduler_count=6 min_scheduler_count=1 active_scheduler_count=5 scheduler_count_changing=1 detect_quiescence=1 use_yield=1

@dipinhora
Copy link
Contributor

@winksaville Yes, I did have the logic backwards on the if conditions. Thanks for correcting the logic and testing.

Still no clue why that last thread is not being woken if the signal is being successfully sent since signals are supposed to not get lost according to my understanding of things.

I know you tested this once previously, but would you be able to test again with make use= scheduler_scaling_pthreads to confirm whether it's an issue that exists regardless of pthreads or signals with all the changes that have been added since your last test? If it works without issue when using pthreads we can switch to using pthreads by default. If it hangs when using pthreads also, then either there are multiple issues or maybe an issue with the logic around the boolean that controls scheduler sleeping/waking.

I'm likely not going to be able to spend time on AWS to reproduce this for at least a week (maybe longer) so I'm mainly going to be limited to some analysis of code and semantics and maybe a suggestion or two in the meantime.

@winksaville
Copy link
Contributor Author

NP, I'll try it in the morning.

@winksaville
Copy link
Contributor Author

Is there a document on the pony runtime, specifically the relationship between the threads, actors and message queues?

Can I ignore GC when delving into these relationships?

@winksaville
Copy link
Contributor Author

@dipinhora, stdlib hung within a few minutes when I built with

make config=debug default_pic=true use=scheduler_scaling_pthreads use=openssl_1.1.0

Here is the gdb output, notice there were 4 threads still alive instead of the usual 3, but I've also see it with as many as 8 threads so I don't beleive it's related to use=scheduler_scaling_pthreads:

GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./stdlib...done.
Attaching to program: /home/wink/prgs/pony/ponyc/stdlib, process 32331
[New LWP 32333]
[New LWP 32337]
[New LWP 32338]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f1ac001c43d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f1ac11966c0 (LWP 32331) "stdlib" 0x00007f1ac001c43d in pthread_join ()
   from /usr/lib/libpthread.so.0
  2    Thread 0x7f1ab6ae8700 (LWP 32333) "stdlib" 0x0000559085d79062 in wake_suspended_threads () at src/libponyrt/sched/scheduler.c:126
  3    Thread 0x7f1a94ae8700 (LWP 32337) "stdlib" 0x00007f1ac002589f in do_sigwait ()
   from /usr/lib/libpthread.so.0
  4    Thread 0x7f1a942e8700 (LWP 32338) "stdlib" 0x00007f1ac002589f in do_sigwait ()
   from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f1ac11966c0 (LWP 32331))]
#0  0x00007f1ac001c43d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f1ac001c43d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x0000559085d86e2d in ponyint_thread_join (thread=139752710768384)
    at src/libponyrt/platform/threads.c:229
#2  0x0000559085d79b6e in ponyint_sched_shutdown () at src/libponyrt/sched/scheduler.c:601
#3  0x0000559085d7a06a in ponyint_sched_start (library=false)
    at src/libponyrt/sched/scheduler.c:732
#4  0x0000559085d78c85 in pony_start (library=false, language_features=true)
    at src/libponyrt/sched/start.c:168
#5  0x0000559085d6a835 in main ()
(gdb) p *this_scheduler 
$1 = {tid = 139752885544640, index = 0, cpu = 0, node = 0, terminate = false, 
  asio_stopped = false, asio_noisy = false, sleep_object = 0, last_victim = 0x0, ctx = {
    scheduler = 0x0, current = 0x7f1abf2e7c00, 
    trace_object = 0x559085d7dd4b <ponyint_gc_sendobject>, 
    trace_actor = 0x559085d7e0bb <ponyint_gc_sendactor>, stack = 0x0, acquire = {
      contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, 
    serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, 
  ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, item_bitmap = 0x0, 
      buckets = 0x0}}, q = {head = 0x0, tail = {{object = 0x0, counter = 0}, 
      raw = 0x00000000000000000000000000000000}}, mq = {head = 0x0, tail = 0x0}}
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f1ab6ae8700 (LWP 32333))]
#0  0x0000559085d79062 in wake_suspended_threads () at src/libponyrt/sched/scheduler.c:126
126	    if(!atomic_exchange_explicit(&scheduler_count_changing, true,
(gdb) bt
#0  0x0000559085d79062 in wake_suspended_threads () at src/libponyrt/sched/scheduler.c:126
#1  0x0000559085d794b5 in quiescent (sched=0x7f1abf2e8800, tsc=1375300254013405, 
    tsc2=1375300255898137) at src/libponyrt/sched/scheduler.c:285
#2  0x0000559085d796dd in steal (sched=0x7f1abf2e8800)
    at src/libponyrt/sched/scheduler.c:384
#3  0x0000559085d79985 in run (sched=0x7f1abf2e8800) at src/libponyrt/sched/scheduler.c:523
#4  0x0000559085d79afd in run_thread (arg=0x7f1abf2e8800)
    at src/libponyrt/sched/scheduler.c:588
#5  0x00007f1ac001b08a in start_thread () from /usr/lib/libpthread.so.0
#6  0x00007f1abf5eb42f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$2 = {tid = 139752710768384, index = 0, cpu = 0, node = 0, terminate = false, 
  asio_stopped = true, asio_noisy = false, sleep_object = 12, last_victim = 0x7f1abf2e8b80, 
  ctx = {scheduler = 0x7f1abf2e8800, current = 0x7f1abf2e9a00, 
    trace_object = 0x559085d7ddf6 <ponyint_gc_recvobject>, 
    trace_actor = 0x559085d7e136 <ponyint_gc_recvactor>, stack = 0x0, acquire = {
      contents = {count = 0, size = 8, item_bitmap = 0x7f1a9ca0ba00, 
        buckets = 0x7f1a9ca0ba08}}, serialise_buffer = 0x7f1a9ca118c0, serialise_size = 0, 
    serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, 
    serialise_alloc = 0x559085d6e3c0 <serialise_$36$2_apply_oZo>, 
    serialise_alloc_final = 0x559085ce5960 <serialise_$36$3_apply_oZo>, 
    serialise_throw = 0x559085c64d10 <serialise_$36$4_apply_o>}, block_count = 4, 
  ack_token = 42, ack_count = 4, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7f1a62f6a940, tail = {{
        object = 0x7f1a62f6a940, counter = 958}, 
      raw = 0x00000000000003be00007f1a62f6a940}}, mq = {head = 0x7f1a631a4b40, 
    tail = 0x7f1a631a4ac0}}
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f1a94ae8700 (LWP 32337))]
#0  0x00007f1ac002589f in do_sigwait () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f1ac002589f in do_sigwait () from /usr/lib/libpthread.so.0
#1  0x00007f1ac002592d in sigwait () from /usr/lib/libpthread.so.0
#2  0x0000559085d86ec4 in ponyint_thread_suspend (signal=12)
    at src/libponyrt/platform/threads.c:275
#3  0x0000559085d79850 in steal (sched=0x7f1abf2e8f00)
    at src/libponyrt/sched/scheduler.c:459
#4  0x0000559085d79985 in run (sched=0x7f1abf2e8f00) at src/libponyrt/sched/scheduler.c:523
#5  0x0000559085d79afd in run_thread (arg=0x7f1abf2e8f00)
    at src/libponyrt/sched/scheduler.c:588
#6  0x00007f1ac001b08a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007f1abf5eb42f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$3 = {tid = 139752140343040, index = 4, cpu = 4, node = 0, terminate = false, 
  asio_stopped = false, asio_noisy = false, sleep_object = 12, 
  last_victim = 0x7f1abf2e89c0, ctx = {scheduler = 0x7f1abf2e8f00, 
    current = 0x7f1aa52e7e00, trace_object = 0x559085d7ddf6 <ponyint_gc_recvobject>, 
    trace_actor = 0x559085d7e136 <ponyint_gc_recvactor>, stack = 0x0, acquire = {
      contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, 
    serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, 
  ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, item_bitmap = 0x0, 
      buckets = 0x0}}, q = {head = 0x7f1aa5299a20, tail = {{object = 0x7f1aa5299a20, 
        counter = 32}, raw = 0x000000000000002000007f1aa5299a20}}, mq = {
    head = 0x7f1a62f6a8c0, tail = 0x7f1abf2e9d00}}
(gdb) thread 4
[Switching to thread 4 (Thread 0x7f1a942e8700 (LWP 32338))]
#0  0x00007f1ac002589f in do_sigwait () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f1ac002589f in do_sigwait () from /usr/lib/libpthread.so.0
#1  0x00007f1ac002592d in sigwait () from /usr/lib/libpthread.so.0
#2  0x0000559085d86ec4 in ponyint_thread_suspend (signal=12)
    at src/libponyrt/platform/threads.c:275
#3  0x0000559085d79850 in steal (sched=0x7f1abf2e90c0)
    at src/libponyrt/sched/scheduler.c:459
#4  0x0000559085d79985 in run (sched=0x7f1abf2e90c0) at src/libponyrt/sched/scheduler.c:523
#5  0x0000559085d79afd in run_thread (arg=0x7f1abf2e90c0)
    at src/libponyrt/sched/scheduler.c:588
#6  0x00007f1ac001b08a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007f1abf5eb42f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$4 = {tid = 139752131954432, index = 5, cpu = 5, node = 0, terminate = false, 
  asio_stopped = false, asio_noisy = false, sleep_object = 12, 
  last_victim = 0x7f1abf2e8b80, ctx = {scheduler = 0x7f1abf2e90c0, 
    current = 0x7f1aa5288600, trace_object = 0x559085d7ddf6 <ponyint_gc_recvobject>, 
    trace_actor = 0x559085d7e136 <ponyint_gc_recvactor>, stack = 0x0, acquire = {
      contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, 
    serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, 
  ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, item_bitmap = 0x0, 
      buckets = 0x0}}, q = {head = 0x7f1aa528b7c0, tail = {{object = 0x7f1aa528b7c0, 
        counter = 47}, raw = 0x000000000000002f00007f1aa528b7c0}}, mq = {
    head = 0x7f1a631cf340, tail = 0x7f1abf2e9d40}}
(gdb) printf "asio_cpu=%d scheduler_count=%d min_scheduler_count=%d active_scheduler_count=%d scheduler_count_changing=%d detect_quiescence=%d use_yield=%d\n", asio_cpu,scheduler_count,min_scheduler_count,active_scheduler_count,scheduler_count_changing,detect_quiescence,use_yield
asio_cpu=-1 scheduler_count=6 min_scheduler_count=1 active_scheduler_count=4 scheduler_count_changing=1 detect_quiescence=1 use_yield=1

SeanTAllen pushed a commit that referenced this issue Jan 3, 2018
The targets can be used to test issue #2451

These use the sub-targets stdlib-debug, stdlib to compile stdlib either
in debug or not.
@dipinhora
Copy link
Contributor

@winksaville, thanks for taking the time to test out the pthreads variant of dynamic scheduler scaling. However, your make command didn't actually use pthreads because make treats arguments as variable assignments and the last variable assignment wins. In your make commands case, the last assignment to use was openssl_1.1.0 overriding the use=scheduler_scaling_pthreads. We can also see in the backtrace of the suspended threads that the sigwait function is being called on suspend which shouldn't be happening when using pthreads.

In order to use multiple items, you need to use the following make command:

make config=debug default_pic=true use="scheduler_scaling_pthreads openssl_1.1.0"

In regards to your question about documentation of the runtime. I'm not sure specific documentation exists. There are a number of papers about pony, so it is possible one of them covers the runtime (but I haven't read the papers so I could be wrong).

The following is a quick summary of my understanding of the relationship between the threads, actors, message queues and GC in the runtime. Please keep in mind that I've mainly learned things by randomly poking at things and looking at the source code and could very likely be misunderstanding things and providing you incorrect information. Additionally, I'm speaking in regards to the runtime as used by a compiled pony program. There are slight differences when the pony runtime is embedded as a C library (mainly around how the runtime gets started and not how it works internally) but I'm not versed in the nuances involved to speak to them.

To answer your question: Can I ignore GC when delving into these relationships?, yes, you can ignore GC when looking at scheduler threads and how they work. Additionally, the abstractions in the runtime are pretty well defined and that makes it possible to focus on individual bits in isolation. The abstractions are relatively easy to identify based on function naming and file grouping but I ended up jumping between abstractions just following code to trace how things are functioning initially to understand the overall system first (I don't necessarily recommend this, but it's what I did and it happened to work well enough for me). If I were starting again, I'd likely focus on individual modules (scheduler, actor, gc, etc) and treat calls to other modules as "black boxes" that I understood at a conceptual/high level and only look into the details when I was ready for it. Also, for me, a lot of my runtime spelunking was due to trying to solve specific bugs or understand specific behavior and that was both helpful and not helpful. The helpful bit was because it helpful focus my exploring. The unhelpful bit was that most bugs tended to not be limited to only one module so I was jumping all over and that tended to be overwhelming at times.

Ummm... so... I kind of went on a bit of a tangent, but hopefully that was at least somewhat coherent and maybe even useful. 8*)

@winksaville
Copy link
Contributor Author

First, txs so much for the explanation of the runtime, I'll have to study it more but it's a great start.

I'll re- run the test to properly use the "use statement".

I'll also make additional changes if the people would like me to, although I'd like to get openssl 1.1 support in ASAP, as I have to do alot of juggling to test drive n my Arch systems. But if will gladly do what's everyone desires.

@dipinhora
Copy link
Contributor

@winksaville Thank you again for being so very patient and helpful in regards to this while I've been unable to reproduce the issue on my end. I'm still hoping to try and reproduce on AWS at some point but I'm not sure when due to other more pressing concerns (work, etc).

In regards to the openssl 1.1 support, I don't mean to cause additional delay on it. I was thinking of it from the perspective of if I had to use it and I'm 100% sure it would annoy me to have to remember to add in the extra argument for ponyc on every invocation. The default_pic option came about because of a similar --pic scenario when I was trying to get ponyc working on alpine and I only had to deal with that for a few days before it drove me to add in the option.

@winksaville
Copy link
Contributor Author

Here's the latest run which failed after 10secs, so very quickly:

wink@wink-desktop:~/prgs/pony/ponyc (debug-2451):
$ make config=debug default_pic=true use=scheduler_scaling_pthreads OPENSSL=-Dopenssl_1.1.0 test-stdlib
$ make -f minimal-cases/issue-2451/ test-long-term
$ pid=$(ps ax | awk '{if ($5 == "./stdlib") {print $1}}'); sudo gdb ./stdlib $pid
[sudo] password for wink: 
GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./stdlib...done.
Attaching to program: /home/wink/prgs/pony/ponyc/stdlib, process 24980
[New LWP 24981]
[New LWP 24982]
[New LWP 24983]
[New LWP 24984]
[New LWP 24985]
[New LWP 24986]
[New LWP 24987]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f59e58ec43d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f59e6a666c0 (LWP 24980) "stdlib" 0x00007f59e58ec43d in pthread_join () from /usr/lib/libpthread.so.0
  2    Thread 0x7f59dcbb9700 (LWP 24981) "stdlib" 0x00007f59e4ebb596 in epoll_pwait () from /usr/lib/libc.so.6
  3    Thread 0x7f59dc3b8700 (LWP 24982) "stdlib" 0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
  4    Thread 0x7f59d3bb8700 (LWP 24983) "stdlib" 0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
  5    Thread 0x7f59d33b8700 (LWP 24984) "stdlib" 0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
  6    Thread 0x7f59c2bb8700 (LWP 24985) "stdlib" 0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
  7    Thread 0x7f59ba3b8700 (LWP 24986) "stdlib" 0x00007f59e58f138d in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /usr/lib/libpthread.so.0
  8    Thread 0x7f59b9bb8700 (LWP 24987) "stdlib" 0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt 1
#0  0x00007f59e58ec43d in pthread_join () from /usr/lib/libpthread.so.0
(More stack frames follow...)
(gdb) bt
#0  0x00007f59e58ec43d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x000055cb63b6774e in ponyint_thread_join (thread=140023923705600) at src/libponyrt/platform/threads.c:229
#2  0x000055cb63b5a549 in ponyint_sched_shutdown () at src/libponyrt/sched/scheduler.c:601
#3  0x000055cb63b5aaec in ponyint_sched_start (library=false) at src/libponyrt/sched/scheduler.c:732
#4  0x000055cb63b596be in pony_start (library=false, language_features=true, exit_code=0x0) at src/libponyrt/sched/start.c:168
#5  0x000055cb63b4b27a in main ()
(gdb) p *this_scheduler 
$1 = {tid = 140024098481856, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x0, last_victim = 0x0, ctx = {scheduler = 0x0, current = 0x7f59e4bb7c00, 
    trace_object = 0x55cb63b5e7cf <ponyint_gc_sendobject>, trace_actor = 0x55cb63b5eb3f <ponyint_gc_sendactor>, stack = 0x0, 
    acquire = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, 
    serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {
    contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x0, tail = {{object = 0x0, counter = 0}, 
      raw = 0x00000000000000000000000000000000}}, mq = {head = 0x0, tail = 0x0}}
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f59dcbb9700 (LWP 24981))]
#0  0x00007f59e4ebb596 in epoll_pwait () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007f59e4ebb596 in epoll_pwait () from /usr/lib/libc.so.6
#1  0x000055cb63b61c64 in ponyint_asio_backend_dispatch (arg=0x7f59e4bb8000) at src/libponyrt/asio/epoll.c:215
#2  0x00007f59e58eb08a in start_thread () from /usr/lib/libpthread.so.0
#3  0x00007f59e4ebb42f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler
$2 = {tid = 140023932098304, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x0, last_victim = 0x0, ctx = {scheduler = 0x0, current = 0x0, trace_object = 0x0, trace_actor = 0x0, 
    stack = 0x0, acquire = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, 
    serialise_size = 0, serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, 
    serialise_alloc = 0x0, serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, 
  mute_mapping = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x0, tail = {{object = 0x0, 
        counter = 0}, raw = 0x00000000000000000000000000000000}}, mq = {head = 0x0, tail = 0x0}}
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f59dc3b8700 (LWP 24982))]
#0  0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055cb63b6dbdd in ponyint_cpu_core_pause (tsc=499586369066298, tsc2=499911221339868, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000055cb63b59fa5 in quiescent (sched=0x7f59e4bb8800, tsc=499586369066298, tsc2=499911221339868)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000055cb63b5a12a in steal (sched=0x7f59e4bb8800) at src/libponyrt/sched/scheduler.c:384
#4  0x000055cb63b5a3d4 in run (sched=0x7f59e4bb8800) at src/libponyrt/sched/scheduler.c:523
#5  0x000055cb63b5a4ec in run_thread (arg=0x7f59e4bb8800) at src/libponyrt/sched/scheduler.c:588
#6  0x00007f59e58eb08a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007f59e4ebb42f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$3 = {tid = 140023923705600, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x55cb63cd8a00 <sleep_cond>, last_victim = 0x7f59e4bb8d40, ctx = {scheduler = 0x7f59e4bb8800, 
    current = 0x7f59e4bb9a00, trace_object = 0x55cb63b5e87a <ponyint_gc_recvobject>, 
    trace_actor = 0x55cb63b5ebba <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 5, ack_token = 9, ack_count = 4, mute_mapping = {contents = {count = 0, size = 8, 
      item_bitmap = 0x7f59d2b71000, buckets = 0x7f59d2b71008}}, q = {head = 0x7f59cab503c0, tail = {{object = 0x7f59cab503c0, 
        counter = 104}, raw = 0x000000000000006800007f59cab503c0}}, mq = {head = 0x7f598ced1f60, tail = 0x7f598ced1f60}}
(gdb) thread 4
[Switching to thread 4 (Thread 0x7f59d3bb8700 (LWP 24983))]
#0  0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055cb63b6dbdd in ponyint_cpu_core_pause (tsc=499586369115812, tsc2=499911219009422, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000055cb63b59fa5 in quiescent (sched=0x7f59e4bb89c0, tsc=499586369115812, tsc2=499911219009422)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000055cb63b5a12a in steal (sched=0x7f59e4bb89c0) at src/libponyrt/sched/scheduler.c:384
#4  0x000055cb63b5a3d4 in run (sched=0x7f59e4bb89c0) at src/libponyrt/sched/scheduler.c:523
#5  0x000055cb63b5a4ec in run_thread (arg=0x7f59e4bb89c0) at src/libponyrt/sched/scheduler.c:588
#6  0x00007f59e58eb08a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007f59e4ebb42f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$4 = {tid = 140023781099264, index = 1, cpu = 1, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x55cb63cd8a00 <sleep_cond>, last_victim = 0x7f59e4bb8800, ctx = {scheduler = 0x7f59e4bb89c0, 
    current = 0x7f59e4b8f200, trace_object = 0x55cb63b5e87a <ponyint_gc_recvobject>, 
    trace_actor = 0x55cb63b5ebba <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 8, 
        item_bitmap = 0x7f59cab80900, buckets = 0x7f59cab80908}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {
      contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7f59cab138a0, tail = {{object = 0x7f59cab138a0, counter = 835}, 
      raw = 0x000000000000034300007f59cab138a0}}, mq = {head = 0x7f598d08b9e0, tail = 0x7f598d08b9e0}}
(gdb) thread 5
[Switching to thread 5 (Thread 0x7f59d33b8700 (LWP 24984))]
#0  0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055cb63b6dbdd in ponyint_cpu_core_pause (tsc=499586368700980, tsc2=499911175425200, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000055cb63b59fa5 in quiescent (sched=0x7f59e4bb8b80, tsc=499586368700980, tsc2=499911175425200)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000055cb63b5a12a in steal (sched=0x7f59e4bb8b80) at src/libponyrt/sched/scheduler.c:384
#4  0x000055cb63b5a3d4 in run (sched=0x7f59e4bb8b80) at src/libponyrt/sched/scheduler.c:523
#5  0x000055cb63b5a4ec in run_thread (arg=0x7f59e4bb8b80) at src/libponyrt/sched/scheduler.c:588
#6  0x00007f59e58eb08a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007f59e4ebb42f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$5 = {tid = 140023772710656, index = 2, cpu = 2, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x55cb63cd8a00 <sleep_cond>, last_victim = 0x7f59e4bb89c0, ctx = {scheduler = 0x7f59e4bb8b80, 
    current = 0x7f59e4bb9a00, trace_object = 0x55cb63b5e87a <ponyint_gc_recvobject>, 
    trace_actor = 0x55cb63b5ebba <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7f59cab1fbc0, tail = {{object = 0x7f59cab1fbc0, counter = 58}, 
      raw = 0x000000000000003a00007f59cab1fbc0}}, mq = {head = 0x7f59a63a1ce0, tail = 0x7f59a63a1ce0}}
(gdb) thread 6
[Switching to thread 6 (Thread 0x7f59c2bb8700 (LWP 24985))]
#0  0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055cb63b6dbdd in ponyint_cpu_core_pause (tsc=499586369459286, tsc2=499911174290198, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000055cb63b59fa5 in quiescent (sched=0x7f59e4bb8d40, tsc=499586369459286, tsc2=499911174290198)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000055cb63b5a12a in steal (sched=0x7f59e4bb8d40) at src/libponyrt/sched/scheduler.c:384
#4  0x000055cb63b5a3d4 in run (sched=0x7f59e4bb8d40) at src/libponyrt/sched/scheduler.c:523
#5  0x000055cb63b5a4ec in run_thread (arg=0x7f59e4bb8d40) at src/libponyrt/sched/scheduler.c:588
#6  0x00007f59e58eb08a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007f59e4ebb42f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$6 = {tid = 140023495886592, index = 3, cpu = 3, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x55cb63cd8a00 <sleep_cond>, last_victim = 0x7f59e4bb8f00, ctx = {scheduler = 0x7f59e4bb8d40, 
    current = 0x7f59e4bb9a00, trace_object = 0x55cb63b5e7cf <ponyint_gc_sendobject>, 
    trace_actor = 0x55cb63b5eb3f <ponyint_gc_sendactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7f59cab1a320, tail = {{object = 0x7f59cab1a320, counter = 77}, 
      raw = 0x000000000000004d00007f59cab1a320}}, mq = {head = 0x7f598d14bfe0, tail = 0x7f598d14bfe0}}
(gdb) thread 7
[Switching to thread 7 (Thread 0x7f59ba3b8700 (LWP 24986))]
#0  0x00007f59e58f138d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f59e58f138d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x000055cb63b677b4 in ponyint_thread_suspend (signal=0x55cb63cd8a00 <sleep_cond>) at src/libponyrt/platform/threads.c:262
#2  0x000055cb63b5a29f in steal (sched=0x7f59e4bb8f00) at src/libponyrt/sched/scheduler.c:459
#3  0x000055cb63b5a3d4 in run (sched=0x7f59e4bb8f00) at src/libponyrt/sched/scheduler.c:523
#4  0x000055cb63b5a4ec in run_thread (arg=0x7f59e4bb8f00) at src/libponyrt/sched/scheduler.c:588
#5  0x00007f59e58eb08a in start_thread () from /usr/lib/libpthread.so.0
#6  0x00007f59e4ebb42f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$7 = {tid = 140023353280256, index = 4, cpu = 4, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x55cb63cd8a00 <sleep_cond>, last_victim = 0x7f59e4bb89c0, ctx = {scheduler = 0x7f59e4bb8f00, 
    current = 0x7f59e4b8f200, trace_object = 0x55cb63b5e92c <ponyint_gc_markobject>, 
    trace_actor = 0x55cb63b5ec35 <ponyint_gc_markactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7f59cab58860, tail = {{object = 0x7f59cab58860, counter = 62}, 
      raw = 0x000000000000003e00007f59cab58860}}, mq = {head = 0x7f598d10bfe0, tail = 0x7f59e4bb9d00}}
(gdb) thread 8
[Switching to thread 8 (Thread 0x7f59b9bb8700 (LWP 24987))]
#0  0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f59e58f5230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055cb63b6dbdd in ponyint_cpu_core_pause (tsc=499589768803224, tsc2=499911235364563, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000055cb63b59fa5 in quiescent (sched=0x7f59e4bb90c0, tsc=499589768803224, tsc2=499911235364563)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000055cb63b5a12a in steal (sched=0x7f59e4bb90c0) at src/libponyrt/sched/scheduler.c:384
#4  0x000055cb63b5a3d4 in run (sched=0x7f59e4bb90c0) at src/libponyrt/sched/scheduler.c:523
#5  0x000055cb63b5a4ec in run_thread (arg=0x7f59e4bb90c0) at src/libponyrt/sched/scheduler.c:588
#6  0x00007f59e58eb08a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007f59e4ebb42f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$8 = {tid = 140023344891648, index = 5, cpu = 5, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x55cb63cd8a00 <sleep_cond>, last_victim = 0x7f59e4bb8b80, ctx = {scheduler = 0x7f59e4bb90c0, 
    current = 0x7f59e4bb9a00, trace_object = 0x55cb63b5e87a <ponyint_gc_recvobject>, 
    trace_actor = 0x55cb63b5ebba <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 8, 
        item_bitmap = 0x7f59a9272300, buckets = 0x7f59a9272308}}, serialise_buffer = 0x7f59b132e4c0, serialise_size = 0, 
    serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, 
    serialise_alloc = 0x55cb63b4ee10 <serialise_$36$2_apply_oZo>, 
    serialise_alloc_final = 0x55cb63aabb30 <serialise_$36$3_apply_oZo>, 
    serialise_throw = 0x55cb63adb900 <serialise_$36$4_apply_o>}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {
    contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7f598ce60760, tail = {{
        object = 0x7f598ce60760, counter = 5535}, raw = 0x000000000000159f00007f598ce60760}}, mq = {head = 0x7f598d07d7e0, 
    tail = 0x7f598d07d7e0}}
(gdb) printf "asio_cpu=%d scheduler_count=%d min_scheduler_count=%d active_scheduler_count=%d scheduler_count_changing=%d detect_quiescence=%d use_yield=%d\n", asio_cpu,scheduler_count,min_scheduler_count,active_scheduler_count,scheduler_count_changing,detect_quiescence,use_yield
asio_cpu=-1 scheduler_count=6 min_scheduler_count=1 active_scheduler_count=5 scheduler_count_changing=0 detect_quiescence=1 use_yield=1

@winksaville
Copy link
Contributor Author

@dipinhora, one thing we could do is you could log into my computer and debug it. We could use tmux and I could watch over your shoulder and I'm sure I'd learn alot.

@winksaville
Copy link
Contributor Author

winksaville commented Jan 6, 2018

So I've modified examples/ring and called it pony-ring. When it hangs the stack traces are similar to, but not exactly like, the one from stdlib above.

I built pony-ring with the same ponyc compiler as I used for stdlib, you can look at the Makefile.

It seems to fail maybe 50% of the time when running --ponynoblock with 1000 rings each with a 1000 Nodes and requesting 10 messages to be forwarded. When it works, as below, it takes about 1.5secs:

$ time ./ring --ponynoblock --size 1000 --count 1000 --pass 10
361842: setup_ring:+
899821667: setup_ring:-
1483219503: start:+
1483445153: start:-
1486456001: complete: DONE

real	0m1.568s
user	0m2.867s
sys	0m0.158s

When it hangs, none of the CPUs are doing much:

$ time ./ring --ponynoblock --size 1000 --count 1000 --pass 10
448403: setup_ring:+
788050521: setup_ring:-
2290864039: start:+
2291044078: start:-
2510876416: complete: DONE

And then in another terminal I run gdb:

$ pid=$(ps ax | awk '{if ($5 == "./ring") {print $1}}'); sudo gdb ./ring $pid
GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./ring...done.
Attaching to program: /home/wink/prgs/pony/ring/ring, process 30526
[New LWP 30527]
[New LWP 30528]
[New LWP 30529]
[New LWP 30530]
[New LWP 30531]
[New LWP 30532]
[New LWP 30533]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007fb352ec843d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info thread
  Id   Target Id         Frame 
* 1    Thread 0x7fb3534cd980 (LWP 30526) "ring" 0x00007fb352ec843d in pthread_join () from /usr/lib/libpthread.so.0
  2    Thread 0x7fb34a195700 (LWP 30527) "ring" 0x00007fb352497596 in epoll_pwait () from /usr/lib/libc.so.6
  3    Thread 0x7fb341994700 (LWP 30528) "ring" 0x00007fb352ed1230 in nanosleep () from /usr/lib/libpthread.so.0
  4    Thread 0x7fb341194700 (LWP 30529) "ring" 0x00007fb352ecd38d in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /usr/lib/libpthread.so.0
  5    Thread 0x7fb340994700 (LWP 30530) "ring" 0x00007fb352ecd38d in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /usr/lib/libpthread.so.0
  6    Thread 0x7fb338194700 (LWP 30531) "ring" 0x00007fb352ecd38d in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /usr/lib/libpthread.so.0
  7    Thread 0x7fb31f994700 (LWP 30532) "ring" 0x00007fb352ed1230 in nanosleep () from /usr/lib/libpthread.so.0
  8    Thread 0x7fb31f194700 (LWP 30533) "ring" 0x00007fb352ed1230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fb352ec843d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x000055fb4ab77e82 in ponyint_thread_join (thread=140407876437760) at src/libponyrt/platform/threads.c:229
#2  0x000055fb4ab6b439 in ponyint_sched_shutdown () at src/libponyrt/sched/scheduler.c:601
#3  0x000055fb4ab6b9dc in ponyint_sched_start (library=false) at src/libponyrt/sched/scheduler.c:732
#4  0x000055fb4ab6a5ae in pony_start (library=false, language_features=true, exit_code=0x0) at src/libponyrt/sched/start.c:168
#5  0x000055fb4ab69b09 in main ()
(gdb) p *this_scheduler 
$1 = {tid = 140408173418880, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x0, last_victim = 0x0, ctx = {scheduler = 0x0, current = 0x7fb352193c00, 
    trace_object = 0x55fb4ab6f6bf <ponyint_gc_sendobject>, trace_actor = 0x55fb4ab6fa2f <ponyint_gc_sendactor>, stack = 0x0, 
    acquire = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, 
    serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {
    contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x0, tail = {{object = 0x0, counter = 0}, 
      raw = 0x00000000000000000000000000000000}}, mq = {head = 0x0, tail = 0x0}}
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fb34a195700 (LWP 30527))]
#0  0x00007fb352497596 in epoll_pwait () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007fb352497596 in epoll_pwait () from /usr/lib/libc.so.6
#1  0x000055fb4ab72b54 in ponyint_asio_backend_dispatch (arg=0x7fb352194000) at src/libponyrt/asio/epoll.c:215
#2  0x00007fb352ec708a in start_thread () from /usr/lib/libpthread.so.0
#3  0x00007fb35249742f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$2 = {tid = 140408019048192, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x0, last_victim = 0x0, ctx = {scheduler = 0x0, current = 0x0, trace_object = 0x0, trace_actor = 0x0, 
    stack = 0x0, acquire = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, 
    serialise_size = 0, serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, 
    serialise_alloc = 0x0, serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, 
  mute_mapping = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x0, tail = {{object = 0x0, 
        counter = 0}, raw = 0x00000000000000000000000000000000}}, mq = {head = 0x0, tail = 0x0}}
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fb341994700 (LWP 30528))]
#0  0x00007fb352ed1230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fb352ed1230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055fb4ab7e311 in ponyint_cpu_core_pause (tsc=524066533308789, tsc2=525037243355047, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000055fb4ab6ae95 in quiescent (sched=0x7fb352194800, tsc=524066533308789, tsc2=525037243355047)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000055fb4ab6b01a in steal (sched=0x7fb352194800) at src/libponyrt/sched/scheduler.c:384
#4  0x000055fb4ab6b2c4 in run (sched=0x7fb352194800) at src/libponyrt/sched/scheduler.c:523
#5  0x000055fb4ab6b3dc in run_thread (arg=0x7fb352194800) at src/libponyrt/sched/scheduler.c:588
#6  0x00007fb352ec708a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fb35249742f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$3 = {tid = 140407876437760, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x55fb4ab8a740 <sleep_cond>, last_victim = 0x7fb352194b80, ctx = {scheduler = 0x7fb352194800, 
    current = 0x7fb33f6d7200, trace_object = 0x55fb4ab6f76a <ponyint_gc_recvobject>, 
    trace_actor = 0x55fb4ab6faaa <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 3, ack_token = 5, ack_count = 2, mute_mapping = {contents = {count = 0, size = 8, 
      item_bitmap = 0x7fb2b1f4c700, buckets = 0x7fb2b1f4c708}}, q = {head = 0x7fb2e6125c20, tail = {{object = 0x7fb2e6125c20, 
        counter = 2018039}, raw = 0x00000000001ecaf700007fb2e6125c20}}, mq = {head = 0x7fb2e55616a0, tail = 0x7fb2e55616a0}}
(gdb) thread 4
[Switching to thread 4 (Thread 0x7fb341194700 (LWP 30529))]
#0  0x00007fb352ecd38d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fb352ecd38d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x000055fb4ab77ee8 in ponyint_thread_suspend (signal=0x55fb4ab8a740 <sleep_cond>) at src/libponyrt/platform/threads.c:262
#2  0x000055fb4ab6b18f in steal (sched=0x7fb3521949c0) at src/libponyrt/sched/scheduler.c:459
#3  0x000055fb4ab6b2c4 in run (sched=0x7fb3521949c0) at src/libponyrt/sched/scheduler.c:523
#4  0x000055fb4ab6b3dc in run_thread (arg=0x7fb3521949c0) at src/libponyrt/sched/scheduler.c:588
#5  0x00007fb352ec708a in start_thread () from /usr/lib/libpthread.so.0
#6  0x00007fb35249742f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$4 = {tid = 140407868049152, index = 1, cpu = 1, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x55fb4ab8a740 <sleep_cond>, last_victim = 0x7fb352194800, ctx = {scheduler = 0x7fb3521949c0, 
    current = 0x7fb33f3c2000, trace_object = 0x55fb4ab6f76a <ponyint_gc_recvobject>, 
    trace_actor = 0x55fb4ab6faaa <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7fb352195c60, tail = {{object = 0x7fb352195c60, counter = 0}, 
      raw = 0x000000000000000000007fb352195c60}}, mq = {head = 0x7fb2e62c92e0, tail = 0x7fb352195c40}}
(gdb) thread 5
[Switching to thread 5 (Thread 0x7fb340994700 (LWP 30530))]
#0  0x00007fb352ecd38d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fb352ecd38d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x000055fb4ab77ee8 in ponyint_thread_suspend (signal=0x55fb4ab8a740 <sleep_cond>) at src/libponyrt/platform/threads.c:262
#2  0x000055fb4ab6b18f in steal (sched=0x7fb352194b80) at src/libponyrt/sched/scheduler.c:459
#3  0x000055fb4ab6b2c4 in run (sched=0x7fb352194b80) at src/libponyrt/sched/scheduler.c:523
#4  0x000055fb4ab6b3dc in run_thread (arg=0x7fb352194b80) at src/libponyrt/sched/scheduler.c:588
#5  0x00007fb352ec708a in start_thread () from /usr/lib/libpthread.so.0
#6  0x00007fb35249742f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$5 = {tid = 140407859660544, index = 2, cpu = 2, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x55fb4ab8a740 <sleep_cond>, last_victim = 0x7fb352194800, ctx = {scheduler = 0x7fb352194b80, 
    current = 0x7fb33f800200, trace_object = 0x55fb4ab6f76a <ponyint_gc_recvobject>, 
    trace_actor = 0x55fb4ab6faaa <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7fb352195ca0, tail = {{object = 0x7fb352195ca0, counter = 0}, 
      raw = 0x000000000000000000007fb352195ca0}}, mq = {head = 0x7fb2e4a62d00, tail = 0x7fb352195c80}}
(gdb) thread 6
[Switching to thread 6 (Thread 0x7fb338194700 (LWP 30531))]
#0  0x00007fb352ecd38d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fb352ecd38d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x000055fb4ab77ee8 in ponyint_thread_suspend (signal=0x55fb4ab8a740 <sleep_cond>) at src/libponyrt/platform/threads.c:262
#2  0x000055fb4ab6b18f in steal (sched=0x7fb352194d40) at src/libponyrt/sched/scheduler.c:459
#3  0x000055fb4ab6b2c4 in run (sched=0x7fb352194d40) at src/libponyrt/sched/scheduler.c:523
#4  0x000055fb4ab6b3dc in run_thread (arg=0x7fb352194d40) at src/libponyrt/sched/scheduler.c:588
#5  0x00007fb352ec708a in start_thread () from /usr/lib/libpthread.so.0
#6  0x00007fb35249742f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$6 = {tid = 140407717054208, index = 3, cpu = 3, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x55fb4ab8a740 <sleep_cond>, last_victim = 0x7fb352194800, ctx = {scheduler = 0x7fb352194d40, 
    current = 0x7fb33ffd2a00, trace_object = 0x55fb4ab6f76a <ponyint_gc_recvobject>, 
    trace_actor = 0x55fb4ab6faaa <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7fb352195ce0, tail = {{object = 0x7fb352195ce0, counter = 0}, 
      raw = 0x000000000000000000007fb352195ce0}}, mq = {head = 0x7fb352195cc1, tail = 0x7fb352195cc0}}
(gdb) thread 7
[Switching to thread 7 (Thread 0x7fb31f994700 (LWP 30532))]
#0  0x00007fb352ed1230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fb352ed1230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055fb4ab7e311 in ponyint_cpu_core_pause (tsc=524058342006298, tsc2=525037274072892, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000055fb4ab6ae95 in quiescent (sched=0x7fb352194f00, tsc=524058342006298, tsc2=525037274072892)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000055fb4ab6b01a in steal (sched=0x7fb352194f00) at src/libponyrt/sched/scheduler.c:384
#4  0x000055fb4ab6b2c4 in run (sched=0x7fb352194f00) at src/libponyrt/sched/scheduler.c:523
#5  0x000055fb4ab6b3dc in run_thread (arg=0x7fb352194f00) at src/libponyrt/sched/scheduler.c:588
#6  0x00007fb352ec708a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fb35249742f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$7 = {tid = 140407306012416, index = 4, cpu = 4, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x55fb4ab8a740 <sleep_cond>, last_victim = 0x7fb352194800, ctx = {scheduler = 0x7fb352194f00, 
    current = 0x7fb340000000, trace_object = 0x55fb4ab6f76a <ponyint_gc_recvobject>, 
    trace_actor = 0x55fb4ab6faaa <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7fb352195d20, tail = {{object = 0x7fb352195d20, counter = 0}, 
      raw = 0x000000000000000000007fb352195d20}}, mq = {head = 0x7fb352195d01, tail = 0x7fb352195d00}}
(gdb) thread 9
Unknown thread 9.
(gdb) thread 8
[Switching to thread 8 (Thread 0x7fb31f194700 (LWP 30533))]
#0  0x00007fb352ed1230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fb352ed1230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000055fb4ab7e311 in ponyint_cpu_core_pause (tsc=524058341899037, tsc2=525037304521937, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000055fb4ab6ae95 in quiescent (sched=0x7fb3521950c0, tsc=524058341899037, tsc2=525037304521937)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000055fb4ab6b01a in steal (sched=0x7fb3521950c0) at src/libponyrt/sched/scheduler.c:384
#4  0x000055fb4ab6b2c4 in run (sched=0x7fb3521950c0) at src/libponyrt/sched/scheduler.c:523
#5  0x000055fb4ab6b3dc in run_thread (arg=0x7fb3521950c0) at src/libponyrt/sched/scheduler.c:588
#6  0x00007fb352ec708a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fb35249742f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$8 = {tid = 140407297623808, index = 5, cpu = 5, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x55fb4ab8a740 <sleep_cond>, last_victim = 0x7fb352194800, ctx = {scheduler = 0x7fb3521950c0, 
    current = 0x7fb340002000, trace_object = 0x55fb4ab6f76a <ponyint_gc_recvobject>, 
    trace_actor = 0x55fb4ab6faaa <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7fb352195d60, tail = {{object = 0x7fb352195d60, counter = 0}, 
      raw = 0x000000000000000000007fb352195d60}}, mq = {head = 0x7fb352195d41, tail = 0x7fb352195d40}}
(gdb) printf "asio_cpu=%d scheduler_count=%d min_scheduler_count=%d active_scheduler_count=%d scheduler_count_changing=%d detect_quiescence=%d use_yield=%d\n", asio_cpu,scheduler_count,min_scheduler_count,active_scheduler_count,scheduler_count_changing,detect_quiescence,use_yield
asio_cpu=-1 scheduler_count=6 min_scheduler_count=1 active_scheduler_count=3 scheduler_count_changing=0 detect_quiescence=1 use_yield=1

If I don't use --ponynoblock I've only seen it hang once but it takes variable length of time to finish and the long time between the "DONE" message and "Main._final" there is one CPU is running at 100%, the others are fairly idle. Below are a series of runs that completed successfully, notice how variable the time is:

$ time ./ring --size 1000 --count 1000 --pass 10
417475: setup_ring:+
845354503: setup_ring:-
1810889495: start:+
1811099975: start:-
1919948667: complete: DONE
7195668261: Main._final  #!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#

real	0m8.034s
user	0m9.783s
sys	0m0.194s
wink@wink-desktop:~/prgs/pony/ring (master)
$ time ./ring --size 1000 --count 1000 --pass 10
843540: setup_ring:+
896792331: setup_ring:-
1611610445: start:+
1611838134: start:-
1626888716: complete: DONE
44201697508: Main._final  #!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#

real	0m45.029s
user	0m48.017s
sys	0m0.207s
wink@wink-desktop:~/prgs/pony/ring (master)
$ time ./ring --size 1000 --count 1000 --pass 10
835820: setup_ring:+
1092404434: setup_ring:-
1508799324: start:+
1509029059: start:-
1510842954: complete: DONE
9524480269: Main._final  #!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#

real	0m10.521s
user	0m17.781s
sys	0m0.221s
wink@wink-desktop:~/prgs/pony/ring (master)
$ time ./ring --size 1000 --count 1000 --pass 10
121768: setup_ring:+
893868815: setup_ring:-
1618022886: start:+
1618248422: start:-
1636603891: complete: DONE
72080148168: Main._final  #!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#

real	1m13.095s
user	1m16.154s
sys	0m0.178s

And here is the the one hang without --ponynoblock:

$ time ./ring --size 1000 --count 1000 --pass 10
1234088: setup_ring:+
843499706: setup_ring:-
1788991456: start:+
1789207916: start:-
1889877268: complete: DONE

Of course it never gets to _Main._final because it was hung and eventually the all the CPU's become idle, although for several minutes one CPU was at 100%. Here is the stack traces from gdb:

$ pid=$(ps ax | awk '{if ($5 == "./ring") {print $1}}'); sudo gdb ./ring $pid
[sudo] password for wink: 
GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./ring...done.
Attaching to program: /home/wink/prgs/pony/ring/ring, process 5633
[New LWP 5634]
[New LWP 5635]
[New LWP 5636]
[New LWP 5637]
[New LWP 5638]
[New LWP 5639]
[New LWP 5640]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007fe2c30c643d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7fe2c36cb980 (LWP 5633) "ring" 0x00007fe2c30c643d in pthread_join () from /usr/lib/libpthread.so.0
  2    Thread 0x7fe2ba393700 (LWP 5634) "ring" 0x00007fe2c2695596 in epoll_pwait () from /usr/lib/libc.so.6
  3    Thread 0x7fe2b9b92700 (LWP 5635) "ring" 0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
  4    Thread 0x7fe2b1392700 (LWP 5636) "ring" 0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
  5    Thread 0x7fe2b0b92700 (LWP 5637) "ring" 0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
  6    Thread 0x7fe2b0392700 (LWP 5638) "ring" 0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
  7    Thread 0x7fe29fb92700 (LWP 5639) "ring" 0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
  8    Thread 0x7fe297392700 (LWP 5640) "ring" 0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fe2c36cb980 (LWP 5633))]
#0  0x00007fe2c30c643d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fe2c30c643d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x000056281d93de82 in ponyint_thread_join (thread=140611755255552) at src/libponyrt/platform/threads.c:229
#2  0x000056281d931439 in ponyint_sched_shutdown () at src/libponyrt/sched/scheduler.c:601
#3  0x000056281d9319dc in ponyint_sched_start (library=false) at src/libponyrt/sched/scheduler.c:732
#4  0x000056281d9305ae in pony_start (library=false, language_features=true, exit_code=0x0) at src/libponyrt/sched/start.c:168
#5  0x000056281d92fb09 in main ()
(gdb) p *this_scheduler 
$1 = {tid = 140611918018944, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x0, last_victim = 0x0, ctx = {scheduler = 0x0, current = 0x7fe2c2391c00, 
    trace_object = 0x56281d9356bf <ponyint_gc_sendobject>, trace_actor = 0x56281d935a2f <ponyint_gc_sendactor>, stack = 0x0, 
    acquire = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, 
    serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {
    contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x0, tail = {{object = 0x0, counter = 0}, 
      raw = 0x00000000000000000000000000000000}}, mq = {head = 0x0, tail = 0x0}}
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fe2ba393700 (LWP 5634))]
#0  0x00007fe2c2695596 in epoll_pwait () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007fe2c2695596 in epoll_pwait () from /usr/lib/libc.so.6
#1  0x000056281d938b54 in ponyint_asio_backend_dispatch (arg=0x7fe2c2392000) at src/libponyrt/asio/epoll.c:215
#2  0x00007fe2c30c508a in start_thread () from /usr/lib/libpthread.so.0
#3  0x00007fe2c269542f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$2 = {tid = 140611763648256, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x0, last_victim = 0x0, ctx = {scheduler = 0x0, current = 0x0, trace_object = 0x0, trace_actor = 0x0, 
    stack = 0x0, acquire = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, 
    serialise_size = 0, serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, 
    serialise_alloc = 0x0, serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, 
  mute_mapping = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x0, tail = {{object = 0x0, 
        counter = 0}, raw = 0x00000000000000000000000000000000}}, mq = {head = 0x0, tail = 0x0}}
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fe2b9b92700 (LWP 5635))]
#0  0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000056281d944311 in ponyint_cpu_core_pause (tsc=596188669040802, tsc2=597734127685315, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000056281d930e95 in quiescent (sched=0x7fe2c2392800, tsc=596188669040802, tsc2=597734127685315)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000056281d93101a in steal (sched=0x7fe2c2392800) at src/libponyrt/sched/scheduler.c:384
#4  0x000056281d9312c4 in run (sched=0x7fe2c2392800) at src/libponyrt/sched/scheduler.c:523
#5  0x000056281d9313dc in run_thread (arg=0x7fe2c2392800) at src/libponyrt/sched/scheduler.c:588
#6  0x00007fe2c30c508a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fe2c269542f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$3 = {tid = 140611755255552, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x56281d950740 <sleep_cond>, last_victim = 0x7fe2c2392d40, ctx = {scheduler = 0x7fe2c2392800, 
    current = 0x7fe2624c2600, trace_object = 0x56281d93576a <ponyint_gc_recvobject>, 
    trace_actor = 0x56281d935aaa <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 5, ack_token = 2, ack_count = 0, mute_mapping = {contents = {count = 0, size = 8, 
      item_bitmap = 0x7fe222160100, buckets = 0x7fe222160108}}, q = {head = 0x7fe246efe1a0, tail = {{object = 0x7fe246efe1a0, 
        counter = 1729764}, raw = 0x00000000001a64e400007fe246efe1a0}}, mq = {head = 0x7fe2448c2840, tail = 0x7fe2448c2840}}
(gdb) thread 4
[Switching to thread 4 (Thread 0x7fe2b1392700 (LWP 5636))]
#0  0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000056281d944311 in ponyint_cpu_core_pause (tsc=597281550997329, tsc2=597734139684451, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000056281d930e95 in quiescent (sched=0x7fe2c23929c0, tsc=597281550997329, tsc2=597734139684451)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000056281d93101a in steal (sched=0x7fe2c23929c0) at src/libponyrt/sched/scheduler.c:384
#4  0x000056281d9312c4 in run (sched=0x7fe2c23929c0) at src/libponyrt/sched/scheduler.c:523
#5  0x000056281d9313dc in run_thread (arg=0x7fe2c23929c0) at src/libponyrt/sched/scheduler.c:588
#6  0x00007fe2c30c508a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fe2c269542f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$4 = {tid = 140611612649216, index = 1, cpu = 1, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x56281d950740 <sleep_cond>, last_victim = 0x7fe2c2392800, ctx = {scheduler = 0x7fe2c23929c0, 
    current = 0x7fe2c2393a00, trace_object = 0x56281d93576a <ponyint_gc_recvobject>, 
    trace_actor = 0x56281d935aaa <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7fe246c6ae40, tail = {{object = 0x7fe246c6ae40, counter = 647149}, 
      raw = 0x000000000009dfed00007fe246c6ae40}}, mq = {head = 0x7fe246efbe80, tail = 0x7fe246efbe80}}
(gdb) thread 5
[Switching to thread 5 (Thread 0x7fe2b0b92700 (LWP 5637))]
#0  0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000056281d944311 in ponyint_cpu_core_pause (tsc=596182514718585, tsc2=597734103009175, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000056281d930e95 in quiescent (sched=0x7fe2c2392b80, tsc=596182514718585, tsc2=597734103009175)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000056281d93101a in steal (sched=0x7fe2c2392b80) at src/libponyrt/sched/scheduler.c:384
#4  0x000056281d9312c4 in run (sched=0x7fe2c2392b80) at src/libponyrt/sched/scheduler.c:523
#5  0x000056281d9313dc in run_thread (arg=0x7fe2c2392b80) at src/libponyrt/sched/scheduler.c:588
#6  0x00007fe2c30c508a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fe2c269542f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$5 = {tid = 140611604260608, index = 2, cpu = 2, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x56281d950740 <sleep_cond>, last_victim = 0x7fe2c2392800, ctx = {scheduler = 0x7fe2c2392b80, 
    current = 0x7fe2aee00000, trace_object = 0x56281d93576a <ponyint_gc_recvobject>, 
    trace_actor = 0x56281d935aaa <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7fe29f35fee0, tail = {{object = 0x7fe29f35fee0, counter = 67}, 
      raw = 0x000000000000004300007fe29f35fee0}}, mq = {head = 0x7fe246efa360, tail = 0x7fe246efa360}}
(gdb) thread 6
[Switching to thread 6 (Thread 0x7fe2b0392700 (LWP 5638))]
#0  0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000056281d944311 in ponyint_cpu_core_pause (tsc=596182514716209, tsc2=597734113374607, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000056281d930e95 in quiescent (sched=0x7fe2c2392d40, tsc=596182514716209, tsc2=597734113374607)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000056281d93101a in steal (sched=0x7fe2c2392d40) at src/libponyrt/sched/scheduler.c:384
#4  0x000056281d9312c4 in run (sched=0x7fe2c2392d40) at src/libponyrt/sched/scheduler.c:523
#5  0x000056281d9313dc in run_thread (arg=0x7fe2c2392d40) at src/libponyrt/sched/scheduler.c:588
#6  0x00007fe2c30c508a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fe2c269542f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$6 = {tid = 140611595872000, index = 3, cpu = 3, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x56281d950740 <sleep_cond>, last_victim = 0x7fe2c2392f00, ctx = {scheduler = 0x7fe2c2392d40, 
    current = 0x7fe2aee00600, trace_object = 0x56281d93576a <ponyint_gc_recvobject>, 
    trace_actor = 0x56281d935aaa <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7fe2c2393ce0, tail = {{object = 0x7fe2c2393ce0, counter = 0}, 
      raw = 0x000000000000000000007fe2c2393ce0}}, mq = {head = 0x7fe246ef8720, tail = 0x7fe246ef8720}}
(gdb) thread 7
[Switching to thread 7 (Thread 0x7fe29fb92700 (LWP 5639))]
#0  0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000056281d944311 in ponyint_cpu_core_pause (tsc=596182459076237, tsc2=597734118973388, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000056281d930e95 in quiescent (sched=0x7fe2c2392f00, tsc=596182459076237, tsc2=597734118973388)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000056281d93101a in steal (sched=0x7fe2c2392f00) at src/libponyrt/sched/scheduler.c:384
#4  0x000056281d9312c4 in run (sched=0x7fe2c2392f00) at src/libponyrt/sched/scheduler.c:523
#5  0x000056281d9313dc in run_thread (arg=0x7fe2c2392f00) at src/libponyrt/sched/scheduler.c:588
#6  0x00007fe2c30c508a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fe2c269542f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$7 = {tid = 140611319047936, index = 4, cpu = 4, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x56281d950740 <sleep_cond>, last_victim = 0x7fe2c23930c0, ctx = {scheduler = 0x7fe2c2392f00, 
    current = 0x7fe2af9d4200, trace_object = 0x56281d93576a <ponyint_gc_recvobject>, 
    trace_actor = 0x56281d935aaa <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x7fe296b62920, tail = {{object = 0x7fe296b62920, counter = 64}, 
      raw = 0x000000000000004000007fe296b62920}}, mq = {head = 0x7fe246ef6220, tail = 0x7fe246ef6220}}
(gdb) thread 8
[Switching to thread 8 (Thread 0x7fe297392700 (LWP 5640))]
#0  0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007fe2c30cf230 in nanosleep () from /usr/lib/libpthread.so.0
#1  0x000056281d944311 in ponyint_cpu_core_pause (tsc=596188669040481, tsc2=597734067316562, yield=true)
    at src/libponyrt/sched/cpu.c:348
#2  0x000056281d930e95 in quiescent (sched=0x7fe2c23930c0, tsc=596188669040481, tsc2=597734067316562)
    at src/libponyrt/sched/scheduler.c:299
#3  0x000056281d93101a in steal (sched=0x7fe2c23930c0) at src/libponyrt/sched/scheduler.c:384
#4  0x000056281d9312c4 in run (sched=0x7fe2c23930c0) at src/libponyrt/sched/scheduler.c:523
#5  0x000056281d9313dc in run_thread (arg=0x7fe2c23930c0) at src/libponyrt/sched/scheduler.c:588
#6  0x00007fe2c30c508a in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007fe2c269542f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$8 = {tid = 140611176441600, index = 5, cpu = 5, node = 0, terminate = false, asio_stopped = false, asio_noisy = false, 
  sleep_object = 0x56281d950740 <sleep_cond>, last_victim = 0x7fe2c2392800, ctx = {scheduler = 0x7fe2c23930c0, 
    current = 0x7fe24989ea00, trace_object = 0x56281d93576a <ponyint_gc_recvobject>, 
    trace_actor = 0x56281d935aaa <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, 
        size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, 
    serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 1, size = 8, 
      item_bitmap = 0x7fe28eabc100, buckets = 0x7fe28eabc108}}, q = {head = 0x7fe28eb25880, tail = {{object = 0x7fe28eb25880, 
        counter = 67}, raw = 0x000000000000004300007fe28eb25880}}, mq = {head = 0x7fe246ef44c0, tail = 0x7fe246ef44c0}}
(gdb) printf "asio_cpu=%d scheduler_count=%d min_scheduler_count=%d active_scheduler_count=%d scheduler_count_changing=%d detect_quiescence=%d use_yield=%d\n", asio_cpu,scheduler_count,min_scheduler_count,active_scheduler_count,scheduler_count_changing,detect_quiescence,use_yield
asio_cpu=-1 scheduler_count=6 min_scheduler_count=1 active_scheduler_count=6 scheduler_count_changing=0 detect_quiescence=1 use_yield=1
(gdb) 

Anyway, I hope this is the same problem as stdlib, as its obviously much simpler.

@winksaville
Copy link
Contributor Author

Oh, here is the output from the hang without --ponynoblock after quiting gdb and then hitting Ctrl-C on the ./ring run. I think that "user 5m35.268s" represents the approx 5m33s of time it must have spent with the one CPU at 100%:

$ time ./ring --size 1000 --count 1000 --pass 10
1234088: setup_ring:+
843499706: setup_ring:-
1788991456: start:+
1789207916: start:-
1889877268: complete: DONE


real	24m33.797s
user	5m35.268s
sys	0m0.843s

@dipinhora
Copy link
Contributor

@winksaville Thanks for your continued work on this and for your offer regarding the tmux session. The bottleneck for both your suggestion and my testing on AWS is time on my side. I am unable to dedicate enough time to properly debug this at the moment (which my guess is at least a good few hours if not a day or two worth of banging my head against it). Also, I'm not sure how much you'd learn from watching me work. It's mostly a lot of "googling to figure out how to use lldb properly, looking at source code, scratching my head as to what could be happening, taking a semi-random guess, making a change, compiling and rerunning to see if it worked or not" repeated over and over and over again until I somehow get lucky. As @SeanTAllen will confirm, I'm a database guy, not a programmer. 8*P (sorry for the inside joke, couldn't help myself).

From what I can tell, there are at least two or three different issues based on the backtraces you've provided so far.

  • The first is where there's an out of order suspended thread (i.e. index = 4 is suspended but index = 5 is not). This should not be possible given how the scheduler_count_changing variable is supposed to gate things.
  • The second is where during shutdown, every thread but one is successfully woken up and shuts down and index = 0 is stuck in wake_suspended_threads because that last thread is still sleeping even though it was sent a wake signal. I'm not sure why/how this is happening but my suspicion is that this is the same root cause as the previous one.
  • The third is where the no threads are suspended but there is still a hang (your last backtrace). My guess is this happens because there's an edge case around quiescence and thread waking/sleeping that I missed.

I've managed to reproduce the issue with the pthreads version of things in my linux vm. I'll try and bang my head against it in order to figure out where I went wrong.

In the meantime, if you have time, it would be great if you could run your tests with the following two scenarios to help rule out some things (ideally both with pthreads and signals):

  • Run the programs with --ponyminthreads 9999 (this effectively disabled dynamic scheduler scaling) to confirm that there isn't a larger issue introduced accidentally (unless I screwed something up, this should behave exactly the same as before the dynamic scheduler scaling logic was merged)
  • Run the programs with the following diff applied to rule out atomics/memory order related issues:
diff --git a/src/common/threads.h b/src/common/threads.h
index 6ed464f..06cb87c 100644
--- a/src/common/threads.h
+++ b/src/common/threads.h
@@ -59,6 +59,6 @@ pony_thread_id_t ponyint_thread_self();
 
 void ponyint_thread_suspend(pony_signal_event_t signal);
 
-void ponyint_thread_wake(pony_thread_id_t thread, pony_signal_event_t signal);
+int ponyint_thread_wake(pony_thread_id_t thread, pony_signal_event_t signal);
 
 #endif
diff --git a/src/libponyrt/platform/threads.c b/src/libponyrt/platform/threads.c
index 52e8a23..a9660a5 100644
--- a/src/libponyrt/platform/threads.c
+++ b/src/libponyrt/platform/threads.c
@@ -276,19 +276,18 @@ void ponyint_thread_suspend(pony_signal_event_t signal)
 #endif
 }
 
-void ponyint_thread_wake(pony_thread_id_t thread, pony_signal_event_t signal)
+int ponyint_thread_wake(pony_thread_id_t thread, pony_signal_event_t signal)
 {
+  int ret;
 #if defined(PLATFORM_IS_WINDOWS)
   (void) thread;
-  SetEvent(signal);
+  ret = SetEvent(signal);
 #elif defined(USE_SCHEDULER_SCALING_PTHREADS)
   (void) thread;
-  int ret;
   // signal condition variable
   ret = pthread_cond_signal(signal);
-  // TODO: What to do if `ret` is an unrecoverable error?
-  (void) ret;
 #else
-  pthread_kill(thread, signal);
+  ret = pthread_kill(thread, signal);
 #endif
+  return ret;
 }
diff --git a/src/libponyrt/sched/scheduler.c b/src/libponyrt/sched/scheduler.c
index ab23773..fb366c2 100644
--- a/src/libponyrt/sched/scheduler.c
+++ b/src/libponyrt/sched/scheduler.c
@@ -122,19 +122,26 @@ static void wake_suspended_threads()
   while (get_active_scheduler_count() < scheduler_count)
   {
     if(!atomic_exchange_explicit(&scheduler_count_changing, true,
-      memory_order_acquire))
+      memory_order_seq_cst))
     {
       // in case the count changed between the while check and now
       if(get_active_scheduler_count() < scheduler_count)
+      {
         // send signal to wake up next scheduler thread available
-        ponyint_thread_wake(scheduler[get_active_scheduler_count()].tid,
-          scheduler[get_active_scheduler_count()].sleep_object);
+        if(ponyint_thread_wake(scheduler[get_active_scheduler_count()].tid,
+          scheduler[get_active_scheduler_count()].sleep_object))
+          // if there was an error waking the thread
+          // unlock the bool that controls modifying the active scheduler count
+          // variable.
+          atomic_store_explicit(&scheduler_count_changing, false,
+            memory_order_seq_cst);
+      }
       else
         // if there are no scheduler threads left to unlock
         // unlock the bool that controls modifying the active scheduler count
         // variable.
         atomic_store_explicit(&scheduler_count_changing, false,
-          memory_order_release);
+          memory_order_seq_cst);
     }
   }
 }
@@ -407,6 +414,7 @@ static pony_actor_t* steal(scheduler_t* sched)
     if (!block_sent)
     {
       uint32_t current_active_scheduler_count = get_active_scheduler_count();
+
       if (steal_attempts < current_active_scheduler_count)
       {
         steal_attempts++;
@@ -421,7 +429,7 @@ static pony_actor_t* steal(scheduler_t* sched)
           && (current_active_scheduler_count > min_scheduler_count)
           && (!sched->terminate)
           && !atomic_exchange_explicit(&scheduler_count_changing, true,
-            memory_order_acquire))
+            memory_order_seq_cst))
         {
           // let sched 0 know we're suspending
           send_msg(sched->index, 0, SCHED_SUSPEND, 0);
@@ -439,7 +447,7 @@ static pony_actor_t* steal(scheduler_t* sched)
           // unlock the bool that controls modifying the active scheduler count
           // variable
           atomic_store_explicit(&scheduler_count_changing, false,
-            memory_order_release);
+            memory_order_seq_cst);
 
           // sleep waiting for signal to wake up again
           ponyint_thread_suspend(sched->sleep_object);
@@ -455,7 +463,7 @@ static pony_actor_t* steal(scheduler_t* sched)
           // variable. this is because the signalling thread locks the control
           // variable before signalling
           atomic_store_explicit(&scheduler_count_changing, false,
-            memory_order_release);
+            memory_order_seq_cst);
 
           // dtrace resume notification
           DTRACE1(THREAD_RESUME, (uintptr_t)sched);
@@ -797,11 +805,26 @@ void ponyint_sched_maybe_wakeup()
   // if we have some schedulers that are sleeping, wake one up
   if((current_active_scheduler_count < scheduler_count) &&
     !atomic_exchange_explicit(&scheduler_count_changing, true,
-    memory_order_acquire))
+    memory_order_seq_cst))
   {
-    // send signal to wake up next scheduler thread available
-    ponyint_thread_wake(scheduler[current_active_scheduler_count].tid,
-      scheduler[current_active_scheduler_count].sleep_object);
+    // in case the count changed between the while check and now
+    if(get_active_scheduler_count() < scheduler_count)
+    {
+      // send signal to wake up next scheduler thread available
+      if(ponyint_thread_wake(scheduler[get_active_scheduler_count()].tid,
+        scheduler[get_active_scheduler_count()].sleep_object))
+        // if there was an error waking the thread
+        // unlock the bool that controls modifying the active scheduler count
+        // variable.
+        atomic_store_explicit(&scheduler_count_changing, false,
+          memory_order_seq_cst);
+    }
+    else
+      // if there are no scheduler threads left to unlock
+      // unlock the bool that controls modifying the active scheduler count
+      // variable.
+      atomic_store_explicit(&scheduler_count_changing, false,
+        memory_order_seq_cst);
   }
 }
 

@winksaville
Copy link
Contributor Author

winksaville commented Jan 6, 2018 via email

@dipinhora
Copy link
Contributor

I reproduced using stdlib (mainly because I hadn't seen your pony-ring update at that time) with the following variant of your looping test (when compiling with pthreads; I have not been able to reproduce when compiling with signals so far but I'm assuming that's due to the fact I'm running in a VM with only 2 cores):

while true; do ./stdlib --sequential --only=builtin --ponythreads 6 > /dev/null; done

No clue how long it took as I wasn't timing it and reproducing was more important but I don't think it took more than 5 - 10 mins most of the time. I ignored all standard output because my goal was to inspect the program state via lldb. Also, I made sure to add debug statements via standard error so they would show if they occurred (mainly for identifying edge cases/unexpected behavior that will become asserts). I checked to see if it was hung or not by polling ps (ps aux|grep stdlib) to see if the pid for stdlib was still changing or not.

Also, good news... I apparently don't know how to use pthreads condition variables correctly and that is why the pthreads variant was hanging so much quicker/easier than the signals variant. I have a fix that uses pthread condition variables correctly, with a bunch of asserts added for which I will open a PR soon.

dipinhora added a commit to dipinhora/ponyc that referenced this issue Jan 6, 2018
Prior to this commit, the pthreads variant of the dynamic scheduler
scaling functionality didn't use pthread condition variables
correctly for the logic required. It relied on a single pthread
condition variable for all threads to sleep against. Unfortunately,
there's no way to wake up a specific thread when relying on pthread
condition variables and the operating system is free to wake up any
thread that is sleeping against a pthread condition variable. This
resulted in out of order waking of sleeping threads which broke one
of the invariants for how dynamic scheduler scaling is supposed to
work.

This commit fixes the code to create a separate pthread condition
variable for each scheduler thread ensure that we can wake up a
specific single thread at a time correctly.

This commit also adds some error handling and asssertions to help
catch any other logic error that might exist.

Thanks to @winksaville for all his help in testing and debugging
this. It likely wouldn't have been found/fixed so soon if it
weren't for him.

NOTE: While this fixes at least one of the causes of ponylang#2451, I don't
believe it resolves the entire issue. Hopefully @winksaville is
able to test and confirm if the issue still exists or not.
@dipinhora
Copy link
Contributor

@winksaville I've opened up the PR (#2472) regarding the pthreads condition variable fix. It would be awesome if you're able to test it out.

As an aside, to avoid any confusion regarding my "inside joke" from my comment last night, the inside joke was the comment about @SeanTAllen confirming that I'm a database guy. The rest of everything I wrote is actually my process and not a joke... 8*/

@dipinhora
Copy link
Contributor

@winksaville the atomics patch to apply on top of the PR #2472 commit for testing:

diff --git a/src/libponyrt/sched/scheduler.c b/src/libponyrt/sched/scheduler.c
index 6800ab7..f0bb952 100644
--- a/src/libponyrt/sched/scheduler.c
+++ b/src/libponyrt/sched/scheduler.c
@@ -118,7 +118,7 @@ static void wake_suspended_threads()
   while (get_active_scheduler_count() < scheduler_count)
   {
     if(!atomic_exchange_explicit(&scheduler_count_changing, true,
-      memory_order_acquire))
+      memory_order_seq_cst))
     {
       // in case the count changed between the while check and now
       if(get_active_scheduler_count() < scheduler_count)
@@ -130,14 +130,14 @@ static void wake_suspended_threads()
           // unlock the bool that controls modifying the active scheduler count
           // variable.
           atomic_store_explicit(&scheduler_count_changing, false,
-            memory_order_release);
+            memory_order_seq_cst);
       }
       else
         // if there are no scheduler threads left to unlock
         // unlock the bool that controls modifying the active scheduler count
         // variable.
         atomic_store_explicit(&scheduler_count_changing, false,
-          memory_order_release);
+          memory_order_seq_cst);
     }
   }
 }
@@ -429,7 +429,7 @@ static pony_actor_t* steal(scheduler_t* sched)
           && (current_active_scheduler_count > min_scheduler_count)
           && (!sched->terminate)
           && !atomic_exchange_explicit(&scheduler_count_changing, true,
-            memory_order_acquire))
+            memory_order_seq_cst))
         {
           // let sched 0 know we're suspending
           send_msg(sched->index, 0, SCHED_SUSPEND, 0);
@@ -451,13 +451,13 @@ static pony_actor_t* steal(scheduler_t* sched)
           // unlock the bool that controls modifying the active scheduler count
           // variable
           atomic_store_explicit(&scheduler_count_changing, false,
-            memory_order_release);
+            memory_order_seq_cst);
 
           // sleep waiting for signal to wake up again
           ponyint_thread_suspend(sched->sleep_object);
 
           bool scc = atomic_load_explicit(&scheduler_count_changing,
-            memory_order_acquire);
+            memory_order_seq_cst);
 
           // make sure scheduler_count_changing is true
           pony_assert(scc);
@@ -478,7 +478,7 @@ static pony_actor_t* steal(scheduler_t* sched)
           // variable before signalling
           scc = false;
           atomic_store_explicit(&scheduler_count_changing, scc,
-            memory_order_release);
+            memory_order_seq_cst);
 
           // dtrace resume notification
           DTRACE1(THREAD_RESUME, (uintptr_t)sched);
@@ -814,7 +814,7 @@ void ponyint_sched_maybe_wakeup()
   // if we have some schedulers that are sleeping, wake one up
   if((current_active_scheduler_count < scheduler_count) &&
     !atomic_exchange_explicit(&scheduler_count_changing, true,
-    memory_order_acquire))
+    memory_order_seq_cst))
   {
     // in case the count changed between the while check and now
     if(get_active_scheduler_count() < scheduler_count)
@@ -826,14 +826,14 @@ void ponyint_sched_maybe_wakeup()
         // unlock the bool that controls modifying the active scheduler count
         // variable.
         atomic_store_explicit(&scheduler_count_changing, false,
-          memory_order_release);
+          memory_order_seq_cst);
     }
     else
       // if there are no scheduler threads left to unlock
       // unlock the bool that controls modifying the active scheduler count
       // variable.
       atomic_store_explicit(&scheduler_count_changing, false,
-        memory_order_release);
+        memory_order_seq_cst);
   }
 }
 

dipinhora added a commit to dipinhora/ponyc that referenced this issue Jan 7, 2018
Prior to this commit, the pthreads variant of the dynamic scheduler
scaling functionality didn't use pthread condition variables
correctly for the logic required. It relied on a single pthread
condition variable for all threads to sleep against. Unfortunately,
there's no way to wake up a specific thread when relying on pthread
condition variables and the operating system is free to wake up any
thread that is sleeping against a pthread condition variable. This
resulted in out of order waking of sleeping threads which broke one
of the invariants for how dynamic scheduler scaling is supposed to
work.

This commit fixes the code to create a separate pthread condition
variable for each scheduler thread ensure that we can wake up a
specific single thread at a time correctly.

This commit also adds some error handling and asssertions to help
catch any other logic error that might exist.

Thanks to @winksaville for all his help in testing and debugging
this. It likely wouldn't have been found/fixed so soon if it
weren't for him.

NOTE: While this fixes at least one of the causes of ponylang#2451, I don't
believe it resolves the entire issue. Hopefully @winksaville is
able to test and confirm if the issue still exists or not.
@dipinhora
Copy link
Contributor

@winksaville I've updated PR #2472 to fix some edge cases around quiescence. I think I might have done a force push out of habit so keep that in mind when you pull/merge.

Given that I was able to reproduce the issue (at least for pthreads) and fix some issues and add some assertions in the PR, it would be great if you could do the following as a next step:

Assuming the debug tests are successful and nothing hangs and you have time:

@winksaville
Copy link
Contributor Author

@dipinhora, I ran pony-ring and it hung after about 20minutes, I then tried to replicate after rebuilding ponyc and ring again its run 2hrs without hanging. So I may have made a mistake the first run. In the second run I definitlely build ponyc with "make default_pic=true config=debug use=scheduler_scaling_pthreads" and ./ring was with run with the defaults in pony-ring's Makefile "./ring --ponynoblock --size 1000 --count 1000 --pass 10". I believe I did the same for the first run, but its possible I didn't.

Anyway, the second run continues, and below is the output from gdb on the first run and here is the log file:

GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./ring...done.
Attaching to program: /home/wink/prgs/pony/ring/ring, process 27079
[New LWP 27081]
[New LWP 27084]
[New LWP 27085]
[New LWP 27086]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007faa289e843d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7faa28fed980 (LWP 27079) "ring" 0x00007faa289e843d in pthread_join () from /usr/lib/libpthread.so.0
  2    Thread 0x7faa1f4b4700 (LWP 27081) "ring" 0x000056469863a988 in wake_suspended_threads ()
    at src/libponyrt/sched/scheduler.c:120
  3    Thread 0x7faa15cb4700 (LWP 27084) "ring" 0x00007faa289ed38d in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /usr/lib/libpthread.so.0
  4    Thread 0x7faa0d4b4700 (LWP 27085) "ring" 0x00007faa289ed38d in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /usr/lib/libpthread.so.0
  5    Thread 0x7faa0ccb4700 (LWP 27086) "ring" 0x00007faa289ed38d in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /usr/lib/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7faa28fed980 (LWP 27079))]
#0  0x00007faa289e843d in pthread_join () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007faa289e843d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x0000564698647ff3 in ponyint_thread_join (thread=140368646194944) at src/libponyrt/platform/threads.c:229
#2  0x000056469863b4e8 in ponyint_sched_shutdown () at src/libponyrt/sched/scheduler.c:630
#3  0x000056469863bb1d in ponyint_sched_start (library=false) at src/libponyrt/sched/scheduler.c:755
#4  0x000056469863a59e in pony_start (library=false, language_features=true, exit_code=0x0)
    at src/libponyrt/sched/start.c:168
#5  0x0000564698639af9 in main ()
(gdb) p *this_scheduler 
$1 = {tid = 140368808958336, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = false, 
  asio_noisy = false, sleep_object = 0x0, last_victim = 0x0, ctx = {scheduler = 0x0, current = 0x7faa27cb3800, 
    trace_object = 0x56469863f830 <ponyint_gc_sendobject>, trace_actor = 0x56469863fba0 <ponyint_gc_sendactor>, 
    stack = 0x0, acquire = {contents = {count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, 
    serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {count = 0, size = 0, item_bitmap = 0x0, 
        buckets = 0x0}}, serialise_alloc = 0x0, serialise_alloc_final = 0x0, serialise_throw = 0x0}, 
  block_count = 0, ack_token = 0, ack_count = 0, mute_mapping = {contents = {count = 0, size = 0, 
      item_bitmap = 0x0, buckets = 0x0}}, q = {head = 0x0, tail = {{object = 0x0, counter = 0}, 
      raw = 0x00000000000000000000000000000000}}, mq = {head = 0x0, tail = 0x0}}
(gdb) thread 2
[Switching to thread 2 (Thread 0x7faa1f4b4700 (LWP 27081))]
#0  0x000056469863a988 in wake_suspended_threads () at src/libponyrt/sched/scheduler.c:120
120	    if(!atomic_exchange_explicit(&scheduler_count_changing, true,
(gdb) bt
#0  0x000056469863a988 in wake_suspended_threads () at src/libponyrt/sched/scheduler.c:120
#1  0x000056469863adda in quiescent (sched=0x7faa27cb4800, tsc=1059165154515495, tsc2=1059165155734026)
    at src/libponyrt/sched/scheduler.c:286
#2  0x000056469863b002 in steal (sched=0x7faa27cb4800) at src/libponyrt/sched/scheduler.c:385
#3  0x000056469863b373 in run (sched=0x7faa27cb4800) at src/libponyrt/sched/scheduler.c:552
#4  0x000056469863b48b in run_thread (arg=0x7faa27cb4800) at src/libponyrt/sched/scheduler.c:617
#5  0x00007faa289e708a in start_thread () from /usr/lib/libpthread.so.0
#6  0x00007faa27fb742f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$2 = {tid = 140368646194944, index = 0, cpu = 0, node = 0, terminate = false, asio_stopped = true, 
  asio_noisy = false, sleep_object = 0x7faa27cb4400, last_victim = 0x7faa27cb49c0, ctx = {
    scheduler = 0x7faa27cb4800, current = 0x7fa9b64c6a00, trace_object = 0x56469863f8db <ponyint_gc_recvobject>, 
    trace_actor = 0x56469863fc1b <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {
        count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 3, ack_token = 10, ack_count = 3, 
  mute_mapping = {contents = {count = 0, size = 8, item_bitmap = 0x7fa9849cbc00, buckets = 0x7fa9849cbc08}}, q = {
    head = 0x7fa99485ee00, tail = {{object = 0x7fa99485ee00, counter = 5577}, 
      raw = 0x00000000000015c900007fa99485ee00}}, mq = {head = 0x7fa99485ec40, tail = 0x7fa997acf320}}
(gdb) thread 3
[Switching to thread 3 (Thread 0x7faa15cb4700 (LWP 27084))]
#0  0x00007faa289ed38d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007faa289ed38d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x0000564698648059 in ponyint_thread_suspend (signal=0x7faa27cb44c0) at src/libponyrt/platform/threads.c:262
#2  0x000056469863b1cc in steal (sched=0x7faa27cb4d40) at src/libponyrt/sched/scheduler.c:477
#3  0x000056469863b373 in run (sched=0x7faa27cb4d40) at src/libponyrt/sched/scheduler.c:552
#4  0x000056469863b48b in run_thread (arg=0x7faa27cb4d40) at src/libponyrt/sched/scheduler.c:617
#5  0x00007faa289e708a in start_thread () from /usr/lib/libpthread.so.0
#6  0x00007faa27fb742f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$3 = {tid = 140368486811392, index = 3, cpu = 3, node = 0, terminate = false, asio_stopped = false, 
  asio_noisy = false, sleep_object = 0x7faa27cb44c0, last_victim = 0x7faa27cb49c0, ctx = {
    scheduler = 0x7faa27cb4d40, current = 0x7fa9b5af2e00, trace_object = 0x56469863f8db <ponyint_gc_recvobject>, 
    trace_actor = 0x56469863fc1b <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {
        count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, 
  mute_mapping = {contents = {count = 0, size = 8, item_bitmap = 0x7fa9fdbca800, buckets = 0x7fa9fdbca808}}, q = {
    head = 0x7fa997a91c60, tail = {{object = 0x7fa997a91c60, counter = 1646}, 
      raw = 0x000000000000066e00007fa997a91c60}}, mq = {head = 0x7fa997ad2d80, tail = 0x7fa99485f3c0}}
(gdb) thread 4
[Switching to thread 4 (Thread 0x7faa0d4b4700 (LWP 27085))]
#0  0x00007faa289ed38d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007faa289ed38d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x0000564698648059 in ponyint_thread_suspend (signal=0x7faa27cb4500) at src/libponyrt/platform/threads.c:262
#2  0x000056469863b1cc in steal (sched=0x7faa27cb4f00) at src/libponyrt/sched/scheduler.c:477
#3  0x000056469863b373 in run (sched=0x7faa27cb4f00) at src/libponyrt/sched/scheduler.c:552
#4  0x000056469863b48b in run_thread (arg=0x7faa27cb4f00) at src/libponyrt/sched/scheduler.c:617
#5  0x00007faa289e708a in start_thread () from /usr/lib/libpthread.so.0
#6  0x00007faa27fb742f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$4 = {tid = 140368344205056, index = 4, cpu = 4, node = 0, terminate = false, asio_stopped = false, 
  asio_noisy = false, sleep_object = 0x7faa27cb4500, last_victim = 0x7faa27cb49c0, ctx = {
    scheduler = 0x7faa27cb4f00, current = 0x7fa9b61f7e00, trace_object = 0x56469863f8db <ponyint_gc_recvobject>, 
    trace_actor = 0x56469863fc1b <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {
        count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, 
  mute_mapping = {contents = {count = 0, size = 8, item_bitmap = 0x7fa9efe0d300, buckets = 0x7fa9efe0d308}}, q = {
    head = 0x7fa994844000, tail = {{object = 0x7fa994844000, counter = 1489}, 
      raw = 0x00000000000005d100007fa994844000}}, mq = {head = 0x7fa99485ed80, tail = 0x7fa99485f340}}
(gdb) thread 5
[Switching to thread 5 (Thread 0x7faa0ccb4700 (LWP 27086))]
#0  0x00007faa289ed38d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007faa289ed38d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x0000564698648059 in ponyint_thread_suspend (signal=0x7faa27cb4540) at src/libponyrt/platform/threads.c:262
#2  0x000056469863b1cc in steal (sched=0x7faa27cb50c0) at src/libponyrt/sched/scheduler.c:477
#3  0x000056469863b373 in run (sched=0x7faa27cb50c0) at src/libponyrt/sched/scheduler.c:552
#4  0x000056469863b48b in run_thread (arg=0x7faa27cb50c0) at src/libponyrt/sched/scheduler.c:617
#5  0x00007faa289e708a in start_thread () from /usr/lib/libpthread.so.0
#6  0x00007faa27fb742f in clone () from /usr/lib/libc.so.6
(gdb) p *this_scheduler 
$5 = {tid = 140368335816448, index = 5, cpu = 5, node = 0, terminate = false, asio_stopped = false, 
  asio_noisy = false, sleep_object = 0x7faa27cb4540, last_victim = 0x7faa27cb4d40, ctx = {
    scheduler = 0x7faa27cb50c0, current = 0x7faa0f77ea00, trace_object = 0x56469863f8db <ponyint_gc_recvobject>, 
    trace_actor = 0x56469863fc1b <ponyint_gc_recvactor>, stack = 0x0, acquire = {contents = {count = 0, size = 0, 
        item_bitmap = 0x0, buckets = 0x0}}, serialise_buffer = 0x0, serialise_size = 0, serialise = {contents = {
        count = 0, size = 0, item_bitmap = 0x0, buckets = 0x0}}, serialise_alloc = 0x0, 
    serialise_alloc_final = 0x0, serialise_throw = 0x0}, block_count = 0, ack_token = 0, ack_count = 0, 
  mute_mapping = {contents = {count = 0, size = 8, item_bitmap = 0x7fa9e7fdfc00, buckets = 0x7fa9e7fdfc08}}, q = {
    head = 0x7fa997a9a780, tail = {{object = 0x7fa997a9a780, counter = 1401}, 
      raw = 0x000000000000057900007fa997a9a780}}, mq = {head = 0x7fa99485e9c0, tail = 0x7fa99485f2c0}}
(gdb) printf "asio_cpu=%d scheduler_count=%d min_scheduler_count=%d active_scheduler_count=%d scheduler_count_changing=%d detect_quiescence=%d use_yield=%d\n", asio_cpu,scheduler_count,min_scheduler_count,active_scheduler_count,scheduler_count_changing,detect_quiescence,use_yield
asio_cpu=-1 scheduler_count=6 min_scheduler_count=1 active_scheduler_count=3 scheduler_count_changing=1 detect_quiescence=1 use_yield=1

@dipinhora
Copy link
Contributor

@winksaville Thanks for the update. The hang/backtrace you're showing here looks like the original hang/backtrace from the signal version (where scheduler_count_changing=1 and scheduler with index = 0 is in wake_suspended_threads for termination but a thread doesn't seem to have woken and reset scheduler_count_changing=0 as it should). Hopefully you're able to reproduce this again (whether with pthreads or signals).

It seems the PR has significantly improved things for at least the pthreads implementation. Can you comment on the PR with your findings (ideally about both the pthreads and signals implementations) on whether it improves things or not and we can hopefully get it merged (assuming it improves things) while we continue to find/fix other issues that cause the hanging behavior?

dipinhora added a commit to dipinhora/ponyc that referenced this issue Jan 10, 2018
This effectively disables dynamic scheduler scaling by default.

The main reason for this is issue ponylang#2451 which only surfaces when
dynamic scheduler scaling is used. We can revert this and re-enable
dynamic scheduler scaling by default once the issue is resolved.
dipinhora added a commit to dipinhora/ponyc that referenced this issue Jan 10, 2018
Prior to this commit, the pthreads variant of the dynamic scheduler
scaling functionality didn't use pthread condition variables
correctly for the logic required. It relied on a single pthread
condition variable for all threads to sleep against. Unfortunately,
there's no way to wake up a specific thread when relying on pthread
condition variables and the operating system is free to wake up any
thread that is sleeping against a pthread condition variable. This
resulted in out of order waking of sleeping threads which broke one
of the invariants for how dynamic scheduler scaling is supposed to
work.

This commit fixes the code to create a separate pthread condition
variable for each scheduler thread ensure that we can wake up a
specific single thread at a time correctly.

This commit also adds some error handling and asssertions to help
catch any other logic error that might exist.

Thanks to @winksaville for all his help in testing and debugging
this. It likely wouldn't have been found/fixed so soon if it
weren't for him.

NOTE: While this fixes at least one of the causes of ponylang#2451, I don't
believe it resolves the entire issue. Hopefully @winksaville is
able to test and confirm if the issue still exists or not.
dipinhora added a commit to dipinhora/ponyc that referenced this issue Jan 10, 2018
This effectively disables dynamic scheduler scaling by default.

The main reason for this is issue ponylang#2451 which only surfaces when
dynamic scheduler scaling is used. We can revert this and re-enable
dynamic scheduler scaling by default once the issue is resolved.
SeanTAllen pushed a commit that referenced this issue Jan 10, 2018
Prior to this commit, the pthreads variant of the dynamic scheduler
scaling functionality didn't use pthread condition variables
correctly for the logic required. It relied on a single pthread
condition variable for all threads to sleep against. Unfortunately,
there's no way to wake up a specific thread when relying on pthread
condition variables and the operating system is free to wake up any
thread that is sleeping against a pthread condition variable. This
resulted in out of order waking of sleeping threads which broke one
of the invariants for how dynamic scheduler scaling is supposed to
work.

This commit fixes the code to create a separate pthread condition
variable for each scheduler thread ensure that we can wake up a
specific single thread at a time correctly.

This commit also adds some error handling and asssertions to help
catch any other logic error that might exist.

Thanks to @winksaville for all his help in testing and debugging
this. It likely wouldn't have been found/fixed so soon if it
weren't for him.

NOTE: While this fixes at least one of the causes of #2451, I don't
believe it resolves the entire issue. Hopefully @winksaville is
able to test and confirm if the issue still exists or not.
SeanTAllen pushed a commit that referenced this issue Jan 10, 2018
This effectively disables dynamic scheduler scaling by default.

The main reason for this is issue #2451 which only surfaces when
dynamic scheduler scaling is used. We can revert this and re-enable
dynamic scheduler scaling by default once the issue is resolved.
dipinhora added a commit to dipinhora/ponyc that referenced this issue Jan 10, 2018
Change dynamic scheduler scaling implementation in order to resolve
the hangs encountered in ponylang#2451.

The previous implementation assumed that signalling to wake a thread
was a reliable operation. Apparently, that's not necessarily true
(see https://en.wikipedia.org/wiki/Spurious_wakeup and
https://askldjd.com/2010/04/24/the-lost-wakeup-problem/). Seeing
as we couldn't find any other explanation for why the previous
implementation was experiencing hangs, I've assumed it is either
because of lost wake ups or spurious wake ups and redesigned the
logic accordingly.

Now, when a thread is about to suspend, it will decrement the
`active_scheduler_count` and then suspend. When it wakes up, it will
check to see if the `active_scheduler_count` is at least as big as
its `index`. If the `active_scheduler_count` isn't big enough, the
thread will suspend itself again immediately. If it is big enough,
it will resume. Threads no longer modify `active_scheduler_count`
when they wake up.

`active_scheduler_count` must now be modified by the thread that is
waking up another thread prior to sending the wake up notification.
Additionally, since we're now assuming that wake up signals can be
lost, we now send multiple wake up notifications just in case. While
this is somewhat wasteful, it is better than being in a situation
where some threads aren't woken up at all (i.e. a hang).

This commit also includes a change inspired by ponylang#2474. Now, *all*
scheduler threads can suspend as long as there is at least one
noisy actor registered with the ASIO subsystem. If there are no
noisy actors registered with the ASIO subsystem then scheduler 0
is not allowed to suspend itself.
dipinhora added a commit to dipinhora/ponyc that referenced this issue Jan 14, 2018
Change dynamic scheduler scaling implementation in order to resolve
the hangs encountered in ponylang#2451.

The previous implementation assumed that signalling to wake a thread
was a reliable operation. Apparently, that's not necessarily true
(see https://en.wikipedia.org/wiki/Spurious_wakeup and
https://askldjd.com/2010/04/24/the-lost-wakeup-problem/). Seeing
as we couldn't find any other explanation for why the previous
implementation was experiencing hangs, I've assumed it is either
because of lost wake ups or spurious wake ups and redesigned the
logic accordingly.

Now, when a thread is about to suspend, it will decrement the
`active_scheduler_count` and then suspend. When it wakes up, it will
check to see if the `active_scheduler_count` is at least as big as
its `index`. If the `active_scheduler_count` isn't big enough, the
thread will suspend itself again immediately. If it is big enough,
it will resume. Threads no longer modify `active_scheduler_count`
when they wake up.

`active_scheduler_count` must now be modified by the thread that is
waking up another thread prior to sending the wake up notification.
Additionally, since we're now assuming that wake up signals can be
lost, we now send multiple wake up notifications just in case. While
this is somewhat wasteful, it is better than being in a situation
where some threads aren't woken up at all (i.e. a hang).

This commit also includes a change inspired by ponylang#2474. Now, *all*
scheduler threads can suspend as long as there is at least one
noisy actor registered with the ASIO subsystem. If there are no
noisy actors registered with the ASIO subsystem then scheduler 0
is not allowed to suspend itself.
dipinhora added a commit to dipinhora/ponyc that referenced this issue Jan 19, 2018
Change dynamic scheduler scaling implementation in order to resolve
the hangs encountered in ponylang#2451.

The previous implementation assumed that signalling to wake a thread
was a reliable operation. Apparently, that's not necessarily true
(see https://en.wikipedia.org/wiki/Spurious_wakeup and
https://askldjd.com/2010/04/24/the-lost-wakeup-problem/). Seeing
as we couldn't find any other explanation for why the previous
implementation was experiencing hangs, I've assumed it is either
because of lost wake ups or spurious wake ups and redesigned the
logic accordingly.

Now, when a thread is about to suspend, it will decrement the
`active_scheduler_count` and then suspend. When it wakes up, it will
check to see if the `active_scheduler_count` is at least as big as
its `index`. If the `active_scheduler_count` isn't big enough, the
thread will suspend itself again immediately. If it is big enough,
it will resume. Threads no longer modify `active_scheduler_count`
when they wake up.

`active_scheduler_count` must now be modified by the thread that is
waking up another thread prior to sending the wake up notification.
Additionally, since we're now assuming that wake up signals can be
lost, we now send multiple wake up notifications just in case. While
this is somewhat wasteful, it is better than being in a situation
where some threads aren't woken up at all (i.e. a hang).

Additionally, only use `scheduler_count_changing` for `signals`
implementation of dynamic scheduler scaling. `pthreads`
implementation now uses a mutex (`sched_mut`) in its place.
We also now change logic to only unlock mutex in `pthreads`
implementation once threads have been woken to avoid potential
lost wake ups. This isn't an issue for the `signals` implementation
and the unlocking of `scheduler_count_changing` can remain where it
is prior to threads being woken up.

This commit also splits out scheduler block/unblock message handling
logic into their own functions (this is so that sched 0 can call those
functions directly instead of sending messages to itself).

This commit also includes a change inspired by ponylang#2474. Now, *all*
scheduler threads can suspend as long as there is at least one
noisy actor registered with the ASIO subsystem. If there are no
noisy actors registered with the ASIO subsystem then scheduler 0
is not allowed to suspend itself because it is reponsible for
quiescence detection.

Lastly, this commit adds logic to allow a scheduler thread to suspend
even if it has already sent a scheduler block message so that we can
now suspend scheduler threads in most scenarios.
SeanTAllen pushed a commit that referenced this issue Jan 20, 2018
Change dynamic scheduler scaling implementation in order to resolve
the hangs encountered in #2451.

The previous implementation assumed that signalling to wake a thread
was a reliable operation. Apparently, that's not necessarily true
(see https://en.wikipedia.org/wiki/Spurious_wakeup and
https://askldjd.com/2010/04/24/the-lost-wakeup-problem/). Seeing
as we couldn't find any other explanation for why the previous
implementation was experiencing hangs, I've assumed it is either
because of lost wake ups or spurious wake ups and redesigned the
logic accordingly.

Now, when a thread is about to suspend, it will decrement the
`active_scheduler_count` and then suspend. When it wakes up, it will
check to see if the `active_scheduler_count` is at least as big as
its `index`. If the `active_scheduler_count` isn't big enough, the
thread will suspend itself again immediately. If it is big enough,
it will resume. Threads no longer modify `active_scheduler_count`
when they wake up.

`active_scheduler_count` must now be modified by the thread that is
waking up another thread prior to sending the wake up notification.
Additionally, since we're now assuming that wake up signals can be
lost, we now send multiple wake up notifications just in case. While
this is somewhat wasteful, it is better than being in a situation
where some threads aren't woken up at all (i.e. a hang).

Additionally, only use `scheduler_count_changing` for `signals`
implementation of dynamic scheduler scaling. `pthreads`
implementation now uses a mutex (`sched_mut`) in its place.
We also now change logic to only unlock mutex in `pthreads`
implementation once threads have been woken to avoid potential
lost wake ups. This isn't an issue for the `signals` implementation
and the unlocking of `scheduler_count_changing` can remain where it
is prior to threads being woken up.

This commit also splits out scheduler block/unblock message handling
logic into their own functions (this is so that sched 0 can call those
functions directly instead of sending messages to itself).

This commit also includes a change inspired by #2474. Now, *all*
scheduler threads can suspend as long as there is at least one
noisy actor registered with the ASIO subsystem. If there are no
noisy actors registered with the ASIO subsystem then scheduler 0
is not allowed to suspend itself because it is reponsible for
quiescence detection.

Lastly, this commit adds logic to allow a scheduler thread to suspend
even if it has already sent a scheduler block message so that we can
now suspend scheduler threads in most scenarios.
@dipinhora
Copy link
Contributor

@winksaville now that #2483 has been merged, can this issue be closed based on all your testing of that PR?

@winksaville
Copy link
Contributor Author

@dipinhora, yes closing as I'm unable to reproduce this error with #2483 merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triggers release Major issue that when fixed, results in an "emergency" release
Projects
None yet
Development

No branches or pull requests

3 participants