Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NonDaemonicSpawnProcess hangs at exit #1497

Closed
albertz opened this issue Jan 18, 2024 · 2 comments
Closed

NonDaemonicSpawnProcess hangs at exit #1497

albertz opened this issue Jan 18, 2024 · 2 comments

Comments

@albertz
Copy link
Member

albertz commented Jan 18, 2024

One worker looks like this:

Thread 1130071 (idle): "MainThread"
    __call__ (returnn/returnn/util/multi_proc_non_daemonic_spawn.py:145)

This is the atexit handler. This is this line:

                os.waitpid(self.proc_pid, 0)

So it hangs for some sub proc, after it has sent SIGINT to it.

Looking at that proc tree (after I send a few SIGINT to some of the MPD workers, which are now in defunct state):

1130071          \_ python3.11
1130231          |   \_ python3.11
1130232          |   \_ watch memory
1130350          |   \_ MPD worker 0 <defunct>
1130353          |   \_ MPD worker 1 <defunct>
1130354          |   \_ MPD worker 2 <defunct>
1130355          |   \_ MPD worker 3 <defunct>
1130811          |   \_ python3.11
1131110          |   \_ MPD worker 0
1131111          |   \_ MPD worker 1
1131112          |   \_ MPD worker 2
1131114          |   \_ MPD worker 3
1131433          |   \_ MPD worker 0
1131434          |   \_ MPD worker 1
1131435          |   \_ MPD worker 2
1131467          |   \_ MPD worker 3 <defunct>
1131835          |   \_ TDL worker 0
1132208          |   |   \_ MPD worker 0
1132314          |   |   \_ MPD worker 1
1132420          |   |   \_ MPD worker 2
1132528          |   |   \_ MPD worker 3
1136336          |   \_ TDL worker 0
1136702          |   |   \_ MPD worker 0
1136806          |   |   \_ MPD worker 1
1136929          |   |   \_ MPD worker 2
1137038          |   |   \_ MPD worker 3
1137253          |   \_ TDL worker 0
1137614          |       \_ MPD worker 0
1137718          |       \_ MPD worker 1
1137845          |       \_ MPD worker 2
1137969          |       \_ MPD worker 3

As the main proc hangs in waitpid, maybe it hangs for some TDL worker.

The last TDL worker:

$ py-spy dump -p 1137253                       
Process 1137253: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=24, pipe_handle=184) --multiprocessing-fork
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/python@3.11/3.11.2_1/bin/python3.11)

Thread 1137253 (idle): "MainThread"
    poll (multiprocessing/popen_fork.py:27)
    wait (multiprocessing/popen_fork.py:43)
    join (multiprocessing/process.py:149)
    join (returnn/returnn/util/multi_proc_non_daemonic_spawn.py:66)
    _exit_function (multiprocessing/util.py:357)
    _bootstrap (multiprocessing/process.py:317)
    _main (multiprocessing/spawn.py:133)
    spawn_main (multiprocessing/spawn.py:120)
    <module> (<string>:1)
Thread 1138171 (idle): "Thread-1 (_serve)"
    accept (socket.py:294)
    accept (multiprocessing/connection.py:608)
    accept (multiprocessing/connection.py:462)
    _serve (multiprocessing/resource_sharer.py:138)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)

So, this also waits for some sub proc. But they all look like this:

Thread 1137967 (idle): "MainThread"
    _recv (multiprocessing/connection.py:378)
    _recv_bytes (multiprocessing/connection.py:413)
    recv (multiprocessing/connection.py:249)
    _worker_proc_loop (returnn/returnn/datasets/multi_proc.py:240)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:133)
    spawn_main (multiprocessing/spawn.py:120)
    <module> (<string>:1)

Originally posted by @albertz in #1496 (comment)

@albertz
Copy link
Member Author

albertz commented Jan 18, 2024

So, in _bootstrap, you see that there is an early call to _exit_function, which will terminate all daemon procs, and then join all procs (daemon + non-daemon). This call to _exit_function happens before any of our atexit handlers run. (I'm not sure why they do this early call. The atexit handlers would run in reverse order, thus our atexit handlers would run first, and do proper cleanup of our procs, and then _exit_function would run at some later point, but then it's all fine.) So this will hang, as all the procs are non-daemonic, and they will just continue to run.

@albertz
Copy link
Member Author

albertz commented Jan 18, 2024

Our solution/workaround now: The atexit handler we missed is for our own custom Process class. We just add an extra check in the overwritten join function, whether we are currently exiting (luckily multiprocessing.util.is_exiting() is exactly there for this purpose), and if so, then we directly send a SIGINT, just as our atexit handler would do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant