Ability to run Arboreto using a multiprocessing pool in place of Dask #21

cflerin · 2020-01-16T15:25:00Z

The ability to run Arboreto across multiple nodes in Dask is extremely powerful, but the implementation has caused lots of issues for me (and others, it seems). In a lot of cases, I had massive issues with the Dask client -- it would sometimes seem to go on computing for days, or just quit halfway through a run with a cryptic error.

In practice, I have only ever used a single node to run GRNBoost2, and it's still quite fast, even for 10s to 100s of thousands of cells. Therefore, I thought this multiprocessing implementation might be useful. I've been using it extensively, and it's quite reliable. In many cases, the compute time is actually slightly shorter when using multiprocessing (perhaps due to some Dask overhead?).

Summary of changes:

Setting the client_or_address parameter to 'multiprocessing' in either of the grnboost2 or genie3 functions will run these algorithms using a multiprocessing pool. The number of workers is specified with the multiprocessing_workers parameter.
- parameters added to grnboost2, genie3, and diy functions
- logic added to select Dask or multiprocessing in diy function
- added run_arboreto_mp to do the work of setting up a multiprocessing pool and calculate links for each target gene separately
Replaced pd DataFrame as_matrix with to_numpy (minor fix)

As a check, the multiprocessing implementation produces the same results as when using Dask, using a fixed seed:

# test data:
wget https://raw.githubusercontent.com/aertslab/SCENICprotocol/master/example/allTFs_hg38.txt
wget https://raw.githubusercontent.com/aertslab/SCENICprotocol/master/example/expr_mat.loom

pip install --force-reinstall git+https://github.com/cflerin/arboreto@multiprocessing

# python:
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
import loompy as lp
import pandas as pd

lf = lp.connect('expr_mat.loom', mode='r', validate=False )
ex_matrix = pd.DataFrame( lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID ).T
lf.close()

tf_names = load_tf_names('allTFs_hg38.txt')

# Dask test:
network = grnboost2(expression_data=ex_matrix,
                    tf_names=tf_names,
                    seed=777,
                    verbose=True)

>>> network.head()
        TF  target  importance
27   RPS4X   RPL30   57.537719
665   SPI1    CSTA   55.521603
27   RPS4X  EEF1A1   54.931686
27   RPS4X   RPS14   53.646867
692  RPL35    RPL3   52.932191

# multiprocessing test:
networkMP = grnboost2(expression_data=ex_matrix,
                    tf_names=tf_names,
                    client_or_address='multiprocessing',
                    multiprocessing_workers=7,
                    seed=777,
                    verbose=True)

>>> networkMP.head()
        TF  target  importance
27   RPS4X   RPL30   57.537719
665   SPI1    CSTA   55.521603
27   RPS4X  EEF1A1   54.931686
27   RPS4X   RPS14   53.646867
692  RPL35    RPL3   52.932191

- parameters added to grnboost2, genie3, and diy functions - logic added to select Dask or multiprocessing in diy - added run_arboreto_mp to setup a mp pool and calculate links for each target gene separately

cflerin · 2020-02-19T09:45:05Z

After some additional testing, it looks like this implementation is not ideal in terms of memory usage with larger matrices (it causes the expression matrix to be copied to each new process instead of using shared memory with the parent process). A better implementation can be found at aertslab/pySCENIC#140, and replaces this with a stand-alone script which imports the relevant Arboreto/pySCENIC functions. The test results are the same as described above.

cflerin added 3 commits January 16, 2020 15:19

Added option to run with multiprocessing in place of Dask

34545c2

- parameters added to grnboost2, genie3, and diy functions - logic added to select Dask or multiprocessing in diy - added run_arboreto_mp to setup a mp pool and calculate links for each target gene separately

Replaced as_matrix with to_numpy for pd.DataFrame conversion

ccd38f4

Cleanup of starmap arguments for multiprocessing

5b18bf1

cflerin closed this Feb 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to run Arboreto using a multiprocessing pool in place of Dask #21

Ability to run Arboreto using a multiprocessing pool in place of Dask #21

cflerin commented Jan 16, 2020

cflerin commented Feb 19, 2020

Ability to run Arboreto using a multiprocessing pool in place of Dask #21

Ability to run Arboreto using a multiprocessing pool in place of Dask #21

Conversation

cflerin commented Jan 16, 2020

cflerin commented Feb 19, 2020