Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to run Arboreto using a multiprocessing pool in place of Dask #21

Closed
wants to merge 3 commits into from

Conversation

cflerin
Copy link
Contributor

@cflerin cflerin commented Jan 16, 2020

The ability to run Arboreto across multiple nodes in Dask is extremely powerful, but the implementation has caused lots of issues for me (and others, it seems). In a lot of cases, I had massive issues with the Dask client -- it would sometimes seem to go on computing for days, or just quit halfway through a run with a cryptic error.

In practice, I have only ever used a single node to run GRNBoost2, and it's still quite fast, even for 10s to 100s of thousands of cells. Therefore, I thought this multiprocessing implementation might be useful. I've been using it extensively, and it's quite reliable. In many cases, the compute time is actually slightly shorter when using multiprocessing (perhaps due to some Dask overhead?).

Summary of changes:

  • Setting the client_or_address parameter to 'multiprocessing' in either of the grnboost2 or genie3 functions will run these algorithms using a multiprocessing pool. The number of workers is specified with the multiprocessing_workers parameter.
    • parameters added to grnboost2, genie3, and diy functions
    • logic added to select Dask or multiprocessing in diy function
    • added run_arboreto_mp to do the work of setting up a multiprocessing pool and calculate links for each target gene separately
  • Replaced pd DataFrame as_matrix with to_numpy (minor fix)

As a check, the multiprocessing implementation produces the same results as when using Dask, using a fixed seed:

# test data:
wget https://raw.githubusercontent.com/aertslab/SCENICprotocol/master/example/allTFs_hg38.txt
wget https://raw.githubusercontent.com/aertslab/SCENICprotocol/master/example/expr_mat.loom

pip install --force-reinstall git+https://github.com/cflerin/arboreto@multiprocessing
# python:
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
import loompy as lp
import pandas as pd

lf = lp.connect('expr_mat.loom', mode='r', validate=False )
ex_matrix = pd.DataFrame( lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID ).T
lf.close()

tf_names = load_tf_names('allTFs_hg38.txt')

# Dask test:
network = grnboost2(expression_data=ex_matrix,
                    tf_names=tf_names,
                    seed=777,
                    verbose=True)

>>> network.head()
        TF  target  importance
27   RPS4X   RPL30   57.537719
665   SPI1    CSTA   55.521603
27   RPS4X  EEF1A1   54.931686
27   RPS4X   RPS14   53.646867
692  RPL35    RPL3   52.932191

# multiprocessing test:
networkMP = grnboost2(expression_data=ex_matrix,
                    tf_names=tf_names,
                    client_or_address='multiprocessing',
                    multiprocessing_workers=7,
                    seed=777,
                    verbose=True)

>>> networkMP.head()
        TF  target  importance
27   RPS4X   RPL30   57.537719
665   SPI1    CSTA   55.521603
27   RPS4X  EEF1A1   54.931686
27   RPS4X   RPS14   53.646867
692  RPL35    RPL3   52.932191

- parameters added to grnboost2, genie3, and diy functions
- logic added to select Dask or multiprocessing in diy
- added run_arboreto_mp to setup a mp pool and calculate links for each
target gene separately
@cflerin
Copy link
Contributor Author

cflerin commented Feb 19, 2020

After some additional testing, it looks like this implementation is not ideal in terms of memory usage with larger matrices (it causes the expression matrix to be copied to each new process instead of using shared memory with the parent process). A better implementation can be found at aertslab/pySCENIC#140, and replaces this with a stand-alone script which imports the relevant Arboreto/pySCENIC functions. The test results are the same as described above.

@cflerin cflerin closed this Feb 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant