Skip to content

MultiProc goes into infinite loop (resource management related) #2548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chrisgorgo opened this issue Apr 22, 2018 · 15 comments
Closed

MultiProc goes into infinite loop (resource management related) #2548

chrisgorgo opened this issue Apr 22, 2018 · 15 comments
Labels
Milestone

Comments

@chrisgorgo
Copy link
Member

Actual behavior

In particular occasions MultiProc claims a job is running, but actually goes into an infinite loop and never finishes.

Expected behavior.

Tasks should either run and finish or an exception should be raised if there aren't enough resources to run the task.

How to replicate the behavior

Run MRIQC 0.10.4 on https://openneuro.org/datasets/ds001338/versions/00001 with docker being allocated 7GB of RAM and 2 CPUs

Full command to use

mriqc //d/data/ParanoiaStory_20180421_154028/ParanoiaStory/ //d/scratch/mriqc_out participant --participant_label tb3858 --n_procs 2 --verbose
@chrisgorgo chrisgorgo added the bug label Apr 22, 2018
@chrisgorgo
Copy link
Member Author

The job MRIQC is running forever in this particular occasion is compute_tsnr.

@chrisgorgo
Copy link
Member Author

More info - if I run the same workflow in linear it works. If I increase amount of available ram to 15gb it also works.

@oesteban oesteban self-assigned this Apr 23, 2018
@oesteban
Copy link
Contributor

oesteban commented Apr 23, 2018

My bet is that, due to a MemoryError, one process had to get killed. In the latest refactor of MultiProc, we tried to make sure that child processes killed were reported. However, (and this may be a design problem of multiprocessing) if the worker is killed, a new one is spun up silently (https://github.com/python/cpython/blob/master/Lib/multiprocessing/pool.py#L406). I guess the node kept running as a zombie for a little while since the pool was running in non-daemon mode.

I'll start writing a unit test for this:

  1. Spin up a workflow, and run MultiProc with n_proc=2.
  2. Start two endless nodes, which store their PID in a common place.
  3. Query the PIDs of the workers
  4. Kill one worker (use the PIDs of 2. to make sure none of the tasks are killed directly)
  5. Check whether Nipype ends with error or keeps going silently.

WDYT @satra?

@oesteban
Copy link
Contributor

A proof of concept that successfully replicates this problem:

from nipype.pipeline import engine as pe
from nipype.interfaces import utility as niu

wf = pe.Workflow('testworkflow')
def _print():
    import os
    from time import sleep
    while True:
        print(os.getpid())
        sleep(10)

wf.add_nodes([
    pe.Node(niu.Function(function=_print), name='node1'),
    pe.Node(niu.Function(function=_print), name='node2')
])
wf.run('MultiProc', plugin_args={'n_procs': 2})

Then I killed one (or both) PIDs printed during runtime. Killing the processes kept Nipype indefinitely polling without error.

@oesteban
Copy link
Contributor

Okay, since the PID of the worker and the node are the same, I wanted to check whether this happens because the death of the worker, the child or both:

from nipype.pipeline import engine as pe
from nipype.interfaces import utility as niu
from nipype.interfaces.base import CommandLine

wf = pe.Workflow('testworkflow')

wf.add_nodes([
    pe.Node(CommandLine(command='bash', args='-c "while true; do echo $$: running node1; sleep 10; done"'), terminal_output='stream', name='node1'),
    pe.Node(CommandLine(command='bash', args='-c "while true; do echo $$: running node2; sleep 10; done"'), terminal_output='stream', name='node2')
])
wf.run('MultiProc', plugin_args={'n_procs': 2})

Each node1 and node2 fork a command /usr/bin/sh -c 'bash -c "while true; do echo $$: running node1; sleep 10; done"', with again forks and starts a bash -c "while true; do echo $$: running node1; sleep 10; done"

  • If I kill any of the workers: everything keeps running as if nothing happened.
  • If I kill the child of the process forked by Nipype (without /usr/bin/sh): everything keeps running as if nothing happened. The process persists as a zombie.
  • If I kill the process directly forked by Nipype (with /usr/bin/sh): the process is killed, the child is finally killed (and the zombie cleared up) and Nipype gets informed, getting:
    RuntimeError: Command:
    bash -c "while true; do echo $$: running node2; sleep 10; done"
    Standard output:
    26044: running node2
    ...
    26044: running node2
    Standard error:
    Return code: -9
    
    Consequently, the other node keeps going:
    180423-15:52:00,776 workflow DEBUG:
    	 Tasks currently running: 1. Pending: 1.
    180423-15:52:00,777 workflow INFO:
    	 [MultiProc] Running 1 tasks, and 0 jobs ready. Free memory (GB): 28.00/28.20, Free processors: 1/2.
                         Currently running:
                           * testworkflow.node1
    180423-15:52:02,774 workflow DEBUG:
    	 Progress: 2 jobs, 1/1/0 (done/running/ready), 1/0 (pending_tasks/waiting).
    180423-15:52:02,774 workflow DEBUG:
    	 Tasks currently running: 1. Pending: 1.
    

@oesteban
Copy link
Contributor

@satra
Copy link
Member

satra commented Apr 24, 2018

https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures - that is very recent addition to python (won't work on 2.7 or 3.4- as far as i know). but they are trying to move everyone off of 2.7.

@effigies
Copy link
Member

There is a 2.7 backport, if we want to move to concurrent.futures: https://pypi.org/project/futures/

@oesteban
Copy link
Contributor

About the backport:

The ProcessPoolExecutor class has known (unfixable) problems on Python 2 and should not be relied on for mission critical work.

I've used concurrent.futures and it is very convenient.

One early question I have is: is it possible to change the context? (e.g. we use the 'forkserver' context in FMRIPREP to save memory). I think it is, but we would need to check first.

Then, I have to say, this change will probably mean a deep rewrite of MultiProc, and I think we have enough refactoring with nipype-2.0.

With the bug report (https://bugs.python.org/issue22393), the reporter also filed a patch. We could test this patch in nipype, and distribute a patched multiprocessing.Pool. WDYT?

@oesteban
Copy link
Contributor

Hmm, I'm seeing that in https://bugs.python.org/issue22393 the author says the patch is prone to https://bugs.python.org/issue6721 - deadlocks when forking + logging (which we extensively do).

This would point at concurrent.futures again.

An alternative option is have MultiProc deprecated. If python >=3.5 then concurrent.futures is used. Otherwise, the old MultiProc is used and a big fat WARNING is issued.

Opinions?

@oesteban
Copy link
Contributor

BTW, I will remove my self-assignment because I don't think I can undertake this issue at the moment. If nobody takes it then I can re-consider /cc @chrisfilo

@oesteban oesteban removed their assignment Apr 24, 2018
@chrisgorgo
Copy link
Member Author

No worries - thanks for your investigation. This is really helpful.

Another alternative is to switch to Dask for local multiprocessing.

@effigies
Copy link
Member

Marking this 1.1 since it might take some more concerted effort and it seems a big-enough change to justify a minor version.

@chrisgorgo
Copy link
Member Author

I've seem a couple more MRIQC jobs stuck with those symptoms:

one on workflow_enumerator.funcMRIQC.ReportsWorkflow.SpikesFinderBgMask and another on workflow_enumerator.funcMRIQC.ComputeIQMs.metadata

@chrisgorgo
Copy link
Member Author

Potentially relevant: https://neurostars.org/t/fmriprep-v1-0-12-hanging/1661

oesteban added a commit to oesteban/nipype that referenced this issue Nov 9, 2018
This PR relates to nipy#2700, and should fix the problem
underlying nipy#2548.

I first considered adding a control thread that monitors
the `Pool` of workers, but that would require a large overhead
keeping track of PIDs and polling very often.

Just adding the core file of [bpo-22393](python/cpython#10441)
should fix nipy#2548
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants