-
Notifications
You must be signed in to change notification settings - Fork 533
MultiProc goes into infinite loop (resource management related) #2548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The job MRIQC is running forever in this particular occasion is |
More info - if I run the same workflow in linear it works. If I increase amount of available ram to 15gb it also works. |
My bet is that, due to a MemoryError, one process had to get killed. In the latest refactor of MultiProc, we tried to make sure that child processes killed were reported. However, (and this may be a design problem of multiprocessing) if the worker is killed, a new one is spun up silently (https://github.com/python/cpython/blob/master/Lib/multiprocessing/pool.py#L406). I guess the node kept running as a zombie for a little while since the pool was running in non-daemon mode. I'll start writing a unit test for this:
WDYT @satra? |
A proof of concept that successfully replicates this problem: from nipype.pipeline import engine as pe
from nipype.interfaces import utility as niu
wf = pe.Workflow('testworkflow')
def _print():
import os
from time import sleep
while True:
print(os.getpid())
sleep(10)
wf.add_nodes([
pe.Node(niu.Function(function=_print), name='node1'),
pe.Node(niu.Function(function=_print), name='node2')
])
wf.run('MultiProc', plugin_args={'n_procs': 2}) Then I killed one (or both) PIDs printed during runtime. Killing the processes kept Nipype indefinitely polling without error. |
Okay, since the PID of the worker and the node are the same, I wanted to check whether this happens because the death of the worker, the child or both: from nipype.pipeline import engine as pe
from nipype.interfaces import utility as niu
from nipype.interfaces.base import CommandLine
wf = pe.Workflow('testworkflow')
wf.add_nodes([
pe.Node(CommandLine(command='bash', args='-c "while true; do echo $$: running node1; sleep 10; done"'), terminal_output='stream', name='node1'),
pe.Node(CommandLine(command='bash', args='-c "while true; do echo $$: running node2; sleep 10; done"'), terminal_output='stream', name='node2')
])
wf.run('MultiProc', plugin_args={'n_procs': 2}) Each
|
https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures - that is very recent addition to python (won't work on 2.7 or 3.4- as far as i know). but they are trying to move everyone off of 2.7. |
There is a 2.7 backport, if we want to move to |
About the backport:
I've used One early question I have is: is it possible to change the context? (e.g. we use the Then, I have to say, this change will probably mean a deep rewrite of MultiProc, and I think we have enough refactoring with nipype-2.0. With the bug report (https://bugs.python.org/issue22393), the reporter also filed a patch. We could test this patch in nipype, and distribute a patched multiprocessing.Pool. WDYT? |
Hmm, I'm seeing that in https://bugs.python.org/issue22393 the author says the patch is prone to https://bugs.python.org/issue6721 - deadlocks when forking + logging (which we extensively do). This would point at An alternative option is have MultiProc deprecated. If python >=3.5 then concurrent.futures is used. Otherwise, the old MultiProc is used and a big fat WARNING is issued. Opinions? |
BTW, I will remove my self-assignment because I don't think I can undertake this issue at the moment. If nobody takes it then I can re-consider /cc @chrisfilo |
No worries - thanks for your investigation. This is really helpful. Another alternative is to switch to Dask for local multiprocessing. |
Marking this 1.1 since it might take some more concerted effort and it seems a big-enough change to justify a minor version. |
I've seem a couple more MRIQC jobs stuck with those symptoms: one on |
Potentially relevant: https://neurostars.org/t/fmriprep-v1-0-12-hanging/1661 |
This PR relates to nipy#2700, and should fix the problem underlying nipy#2548. I first considered adding a control thread that monitors the `Pool` of workers, but that would require a large overhead keeping track of PIDs and polling very often. Just adding the core file of [bpo-22393](python/cpython#10441) should fix nipy#2548
Actual behavior
In particular occasions MultiProc claims a job is running, but actually goes into an infinite loop and never finishes.
Expected behavior.
Tasks should either run and finish or an exception should be raised if there aren't enough resources to run the task.
How to replicate the behavior
Run MRIQC 0.10.4 on https://openneuro.org/datasets/ds001338/versions/00001 with docker being allocated 7GB of RAM and 2 CPUs
Full command to use
The text was updated successfully, but these errors were encountered: