-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A quicker end to staged parallelism #369
Comments
I have begun work on this in the |
Super excited about #370! Most of the work and most of the gains! |
The next step now is to make the Two things I learned so far in the
|
A very important piece I forgot: We should also use a form of On I should implement this before I debug the |
I'm now seeing some overhead for targets that are cheap to compute but require large-ish input data: library(tidyverse)
library(drake)
N <- 500
gen_data <- function() {
tibble(a = seq_len(N), b = 1, c = 2, d = 3)
}
plan_data <- drake_plan(
data = gen_data()
)
plan_sub <-
gen_data() %>%
transmute(
target = paste0("data", a),
command = paste0("data[", a, ", ]")
)
plan <-
bind_rows(plan_data, plan_sub)
system.time(drake::make(plan, jobs = 8)) CRAN version with GitHub revision 08c67c8: 68 seconds user time, 30 seconds elapsed Is this |
Your reprex is a perfect demonstration of what we're potentially sacrificing in #369. And yes, I did expect to have more overhead, especially at first. Staged parallelism excels when we have a large number of conditionally independent targets (i.e. tall dependency graphs). In the CRAN version of Lines 55 to 75 in 7fe45f0
If there is a way for the master to jump straight to the idle workers, I suspect/hope the master will be faster. Part of the trouble is that I am using |
In the code chunk above I don't understand what happens if We're seeing a lot of user time spent for the GitHub version, this can't be just because the master is busy. Are the workers using busy waiting? Profiling the master loop shows that most of the time is spent in Can you implement an efficient "enumerate all idle workers" with storr? Perhaps if you had a storr namespace for each state and each worker would register/unregister itself from that namespace when the state changes? |
That means all targets have unmet dependencies.
I suspect many of them are. I am only just wrapping up the initial implementation, so I have not spent a lot of time profiling.
Thanks so much! That really narrows down the focus.
Sounds easy to implement. By the way, I am about to add additional |
For anyone else looking on, I have added back staged parallelism in some additional backends: All the code for staged parallelism is in |
Update: we have a new timing vignette. See the strategy section to see how |
We now have a new and improved parallelism guide. Much clearer and up to date. This issue was quite the exciting sprint. I am really looking forward to testing and iterating on the newly reworked backends. Please help spread the word. |
To recap: |
FYI: @krlmlr, coordination among persistent workers seems to be much faster now after the work I did this weekend. Scheduling is now based on |
Things just got even faster in fea4acb. We'll let it play out, but there is a good chance we'll find we won't need staged parallelism anymore. |
system.time(drake::make(plan, parallelism = "mclapply_staged", jobs = 8))
#> target data
#> user system elapsed
#> 2.638 0.133 4.490 options(clustermq.scheduler = "multicore")
system.time(drake::make(plan, parallelism = "clustermq", jobs = 8, verbose = FALSE))
#> Submitting 8 worker jobs (ID: 6150) ...
#> Master: [11.3s 97.5% CPU]; Worker: [avg 1.8% CPU, max 311.3 Mb]
#> user system elapsed
#> 13.854 0.523 14.698 options(clustermq.scheduler = "multicore")
system.time(drake::make(plan, parallelism = "clustermq_staged", jobs = 8, verbose = FALSE))
#> Submitting 8 worker jobs (ID: 6752) ...
#> Running 1 calculations (1 calls/chunk) ...
#> Running 500 calculations (1 calls/chunk) ...
#> Master: [4.7s 86.1% CPU]; Worker: [avg 4.7% CPU, max 310.8 Mb]
#> user system elapsed
#> 6.465 0.299 7.415 On the other hand, I am not sure how much |
The problem
With the exception of
make(parallelism = "future")
andmake(parallelism = "Makefile")
,drake
has an embarrassing approach to parallelism, both literally and figuratively. It divides the dependency graph into conditionally independent stages and then runs those stages sequentially. Targets only run in parallel within those stages. In other words,drake
waits for the slowest target to finish before moving to the next stage.At the time, I could not find an R package to traverse an arbitrary network of jobs in parallel. (Stop me right here if such a tool actually exists!) Parallel computing tools like
foreach
,parLapply
,mclapply
, andfuture_lapply
all assume the workers can act independently. So I thought staged parallelism was natural. However, I was thinking within the constraints of the available technology, and parallel efficiency suffered.Long-term work
In late February, I started the
workers
package. If you give it a job network and a parallel computing technology, the goal is to traverse the graph using a state-of-the-art algorithm. It has a proof-of-concept implementation and an initial design specification, but it has a long way to go. At this stage, it really needs an R-focused message queue. I was hopingliteq
would work, but due to an apparent race condition and some speed/overhead issues on my end, I am stuck for the time being.Now
I say we rework the
make(parallelism = "mclapply")
andmake(parallelism = "parLapply")
backends natively. As I see it,make(parallelism = "mclapply", jobs = 2)
should fork 3 processes: 2 workers and 1 master. The master sends jobs to the workers in the correct order at the correct time, and the workers keep coming back for more without stopping. The goal is to minimize idleness and overhead.The implementation will be similar to
make(parallelism = "future", caching = "worker")
(seefuture.R
). I think the implementation should go intodrake
itself becauseworkers
until I get the chance to do an extensive lit review of parallel graph algorithms.To-do's:
run_lapply()
(Just loop through a toposort of the targets.)run_mclapply()
run_parLapply()
run_future_lapply()
(We still might want persistent workers on a cluster.)on.exit(set_done(...))
to handle failures on both the master process and the workers.parallel_stages()
,max_useful_jobs()
, andrate_limiting_times()
.predict_runtime()
.The text was updated successfully, but these errors were encountered: