Ability to run Arboreto using a multiprocessing pool in place of Dask #21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The ability to run Arboreto across multiple nodes in Dask is extremely powerful, but the implementation has caused lots of issues for me (and others, it seems). In a lot of cases, I had massive issues with the Dask client -- it would sometimes seem to go on computing for days, or just quit halfway through a run with a cryptic error.
In practice, I have only ever used a single node to run GRNBoost2, and it's still quite fast, even for 10s to 100s of thousands of cells. Therefore, I thought this multiprocessing implementation might be useful. I've been using it extensively, and it's quite reliable. In many cases, the compute time is actually slightly shorter when using multiprocessing (perhaps due to some Dask overhead?).
Summary of changes:
client_or_address
parameter to'multiprocessing'
in either of thegrnboost2
orgenie3
functions will run these algorithms using a multiprocessing pool. The number of workers is specified with themultiprocessing_workers
parameter.run_arboreto_mp
to do the work of setting up a multiprocessing pool and calculate links for each target gene separatelyas_matrix
withto_numpy
(minor fix)As a check, the multiprocessing implementation produces the same results as when using Dask, using a fixed seed: