-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
within-target parallelism using dosnow hangs unexpectedly #925
Comments
I will take a closer look when I get back in the office on July 8. But for now, would you try make(lock_envir = FALSE)? Related: Any reason in particular why you are using dosnow and parallel socket clusters? I almost always recommend furrr + future.callr for multicore parallelism. It tends to be more reliable than the base parallel package. |
Rather, it tends to be more reliable... Typing on a phone. |
Also related: #675. |
Hi Will. I tried the version of make with the following call before submitting the issue: but I still got the intermittent error. Honestly, I am not sure yet why it is happening, since I was already using the DoSnow framework before without problems. Also, I am not surpassing the constraints of memory or cpu power, so it does not appear to be related to the system crashing. Basically, I am using the DoSnow framework because it is the simplest that I am aware to implement a progress bar. |
When you supply a config argument to make(), it overrides the values of the other arguments. Would you try either just make(config = config) or no config at all? NB drake_config() can take all the non-config args of make(). |
TL;DR: please load your packages outside your functions. More info: https://ropenscilabs.github.io/drake-manual/projects.html. What is happening:
So if you call do_task_parallel <- function(
task,
list_of_arguments,
iterator,
cpu_cores = 6,
export_packages = c("pROC", "caret", "dplyr", "xgboost", "tidyr"),
export_functions = NULL,
iteractive_arguments = NULL
) {
# List of objects must contain all datasets and all functions necessary to realize the parallel computation.
library(parallel)
library(doSNOW)
library(foreach)
# Creates progress bar
print("Creating progress bar...")
opts <- list(progress = function(n) {
setTxtProgressBar(
txtProgressBar(min = 1, max = length(iterator), style = 3),
n
)
})
# Creates cluster and defines number of threads
print("Creating cluster...")
cpu_cores <- cpu_cores
cl <- makePSOCKcluster(cpu_cores)
registerDoSNOW(cl)
# Exports additional functions
print("Exporting objects...")
clusterExport(cl, export_functions)
# Execute the operations
print("Starting parallel computations...")
object <-
foreach(
x = 1:length(iterator),
.options.snow = opts,
.packages = export_packages
) %dopar% {
do.call(task, c(list_of_arguments, lapply(iteractive_arguments, `[[`, x)))
}
# Stop cluster and free memory
stopCluster(cl)
gc()
# Return matches and metrics
return(object)
}
my_function <- function(arg1, arg2, arg3) {
matrix(runif(arg1), ncol = arg2) + arg3
}
# Basic iterator
iterator <- 1:100
# This is what drake does to avoid self-invalidating workflows:
drake:::lock_environment(globalenv())
# Computation outside drake
matrix_dbs <- do_task_parallel(
task = my_function,
list_of_arguments = list(
arg1 = 10^4,
arg2 = 500
),
iterator = iterator,
iteractive_arguments = list(
arg3 = iterator
),
cpu_cores = 4
)
#> Error: package or namespace load failed for 'parallel':
#> .onLoad failed in loadNamespace() for 'parallel', details:
#> call: sample.int(.Machine$integer.max - 1L, 1L)
#> error: cannot add bindings to a locked environment Created on 2019-07-02 by the reprex package (v0.3.0) With packages loaded outside # Packages should go outside the function.
library(parallel)
library(doSNOW)
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: snow
#>
#> Attaching package: 'snow'
#> The following objects are masked from 'package:parallel':
#>
#> clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#> clusterExport, clusterMap, clusterSplit, makeCluster,
#> parApply, parCapply, parLapply, parRapply, parSapply,
#> splitIndices, stopCluster
library(foreach)
do_task_parallel <- function(
task,
list_of_arguments,
iterator,
cpu_cores = 6,
export_packages = c("pROC", "caret", "dplyr", "xgboost", "tidyr"),
export_functions = NULL,
iteractive_arguments = NULL
) {
# Disabling progress bar for the reprex
opts <- list()
# print("Creating progress bar...")
# opts <- list(progress = function(n) {
# setTxtProgressBar(
# txtProgressBar(min = 1, max = length(iterator), style = 3),
# n
# )
# })
# Creates cluster and defines number of threads
print("Creating cluster...")
cpu_cores <- cpu_cores
cl <- makePSOCKcluster(cpu_cores)
registerDoSNOW(cl)
# Exports additional functions
print("Exporting objects...")
clusterExport(cl, export_functions)
# Execute the operations
print("Starting parallel computations...")
object <-
foreach(
x = 1:length(iterator),
.options.snow = opts,
.packages = export_packages
) %dopar% {
do.call(task, c(list_of_arguments, lapply(iteractive_arguments, `[[`, x)))
}
# Stop cluster and free memory
stopCluster(cl)
gc()
# Return matches and metrics
return(object)
}
my_function <- function(arg1, arg2, arg3) {
matrix(runif(arg1), ncol = arg2) + arg3
}
# Basic iterator
iterator <- 1:100
# This is what drake does to avoid self-invalidating workflows:
drake:::lock_environment(globalenv())
# Computation outside drake
length(
do_task_parallel(
task = my_function,
list_of_arguments = list(
arg1 = 10^4,
arg2 = 500
),
iterator = iterator,
iteractive_arguments = list(
arg3 = iterator
),
cpu_cores = 4
)
)
#> [1] "Creating cluster..."
#> [1] "Exporting objects..."
#> [1] "Starting parallel computations..."
#> [1] 100 Created on 2019-07-02 by the reprex package (v0.3.0) |
So I recommend:
If all that fails, let me know and we can continue troubleshooting. |
Hopefully #927 makes (1) clearer. Also, when you post code, would you run it through |
Great. I was not aware of that. Thank you for letting me know.
This is interesting, but as I said before, I don't receive an error. Instead, the computation just hangs unfinished. It doesn't return an error or anything. It just stops at a given percentage of completion and that is it. Later, I will try to investigate this further and see if I can come back with a better reprex.
Sure. I can do that. I overall follow the tidyverse/google style of code. I just put the commas first because it makes it easier to change the order of the arguments in functions while also making them harder to forget. I also prefer to align my arguments with the beginning of the call. I find it easier to see where each function begins/ends. |
What else could make doSNOW hang? Maybe another possibility is that you ran out of memory at some point. Possibly relevant: https://ropenscilabs.github.io/drake-manual/memory.html. |
This I don't know and have to investigate because it never happened to me before using Drake.
That was not the case. I was with at least 10Gb of free RAM while running the plan. I know that because I monitor memory and cpu load when executing heavy calculations (conky ftw). |
By the way, library(furrr)
#> Loading required package: future
library(future.callr)
future::plan(callr, workers = 2L)
x <- rep(0.1, 100)
f <- function(x) {
Sys.sleep(x)
Sys.getpid()
}
pids <- future_map_int(x, f, .progress = TRUE)
#>
Progress: ───────────────────────────────────────────────── 100%
table(pids)
#> pids
#> 13760 13767
#> 50 50 |
Hi Will, Thanks for the suggestions. I am back at running my code at the stage it was before #926, and now I am having the error happening consistently at my routine. Do you know any tool that I can use to debug inside drake or a parallel loop to see what is happening? Something that let me track the communication between master and slaves. e.g. Also, what do you specifically not like about the |
Sorry, I do not know of a debugger that works inside parallel/multicore tasks. Otherwise, browser() or drake_debug() might help. doSNOW relies on parallel socket (PSOCK) clusters, which have known non-drake glitches and limitations, as well as clashes with environment locking in drake. What happens if you use furrr + future.callr instead of doSNOW? What happens if you set lock_envir to FALSE? Environment locking can cause counterintuitive hidden problems unrelated to library(parallel). See #929. |
I finally could get some response from my session. When changing the call to the socket to: cl <- makePSOCKcluster(cpu_cores, timeout=180, outfile = "") In other words, making the sockets print on the master screen ( Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted What is strange is that this only happens when calling my function inside Drake.
Changing the argument to I am now investigating this error response on Stack Overflow.
I will try this later. |
Thanks for trying, this is helpful. Unfortunately, I do not have immediate answers on the PSOCK/serialization issues. Odd that it only clashes with drake. Are you using other forms of hpc? Do you set the parallelism or jobs_preprocess arguments of make() or drake_config()? Would you post a link to your StackOverflow question? I would like to follow it. |
Also, would you post a sessionInfo()? |
Hi Will, So, I tried two other parallel backends. registerDoFuture()
cl <- makeCluster(cpu_cores, timeout= timeout, outfile = here("code/log.txt"))
future::plan(multiprocess, workers = cpu_cores) Returned the following error: Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize Using registerDoFuture()
cl <- makeCluster(cpu_cores, timeout= timeout, outfile = here("code/log.txt"))
future::plan(callr, workers = cpu_cores) Return the following error, after updating some targets: callr failed, could not start R, exited with non-zero status, has crashed or was killed The OS or other thing appears to be killing the slaves, but I am not sure why.
I did not set those.
The results from R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=pt_BR.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=pt_BR.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=pt_BR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] doSNOW_1.0.16 snow_0.4-3 tidyselect_0.2.5 doFuture_0.8.0 iterators_1.0.10 foreach_1.4.4
[7] globals_0.12.4 drake_7.3.0 visNetwork_2.0.7 fst_0.9.0 feather_0.3.3 mlflow_1.0.0
[13] here_0.1 askpass_1.1 RPostgreSQL_0.6-2 DBI_1.0.0 tidyr_0.8.3 readr_1.3.1
[19] lubridate_1.7.4 stringr_1.4.0 stringi_1.4.3 dplyr_0.8.1 data.table_1.12.2 reshape2_1.4.3
[25] future.callr_0.4.0 future_1.14.0
loaded via a namespace (and not attached):
[1] nlme_3.1-139 httr_1.4.0 rprojroot_1.3-2 tools_3.6.0 backports_1.1.4 utf8_1.1.4
[7] R6_2.4.0 rpart_4.1-15 lazyeval_0.2.2 colorspace_1.4-1 nnet_7.3-12 withr_2.1.2
[13] processx_3.3.1 compiler_3.6.0 cli_1.1.0 swagger_3.9.2 forge_0.2.0 scales_1.0.0
[19] callr_3.2.0 digest_0.6.19 ini_0.3.1 base64enc_0.1-3 pkgconfig_2.0.2 htmltools_0.3.6
[25] htmlwidgets_1.3 rlang_0.3.4 rstudioapi_0.10 generics_0.0.2 jsonlite_1.6 ModelMetrics_1.2.2
[31] magrittr_1.5 Matrix_1.2-17 Rcpp_1.0.1 munsell_0.5.0 fansi_0.4.0 reticulate_1.12
[37] pROC_1.14.0 yaml_2.2.0 MASS_7.3-51.1 storr_1.2.1 plyr_1.8.4 recipes_0.1.5
[43] grid_3.6.0 listenv_0.7.0 promises_1.0.1 crayon_1.3.4 lattice_0.20-38 splines_3.6.0
[49] hms_0.4.2 zeallot_0.1.0 ps_1.3.0 pillar_1.4.1 igraph_1.2.4.1 base64url_1.4
[55] xgboost_0.82.1 stats4_3.6.0 codetools_0.2-16 glue_1.3.1 vctrs_0.1.0 httpuv_1.5.1
[61] gtable_0.3.0 openssl_1.3 purrr_0.3.2 assertthat_0.2.1 ggplot2_3.1.1 gower_0.2.0
[67] prodlim_2018.04.18 later_0.8.0 class_7.3-15 survival_2.43-3 timeDate_3043.102 tibble_2.1.2
[73] lava_1.6.5 caret_6.0-84 ipred_0.9-9 Both frameworks worked fine outside I did not yet posted a question on Stack Overflow, but I gave it an extensive search. I found this answer from Steve Weston:
But honestly, I don't know what may be making the OS kill the processes started by |
Thanks, this is helpful. Bizarre, but helpful. Would you post the version of the code you tried with future.callr? I will soon regain access to my Ubuntu 18.04 desktop, and I intend to try it out there. |
Test cases for library(doSNOW)
library(drake)
library(foreach)
library(parallel)
do_task_parallel <- function(
task,
list_of_arguments,
iterator,
cpu_cores,
export_packages = c("pROC", "caret", "dplyr", "xgboost", "tidyr"),
export_functions = NULL,
iteractive_arguments = NULL
) {
opts <- list(progress = function(n) {
setTxtProgressBar(
txtProgressBar(min = 1, max = length(iterator), style = 3),
n
)
})
cpu_cores <- cpu_cores
cl <- makePSOCKcluster(cpu_cores)
on.exit(stopCluster(cl))
registerDoSNOW(cl)
clusterExport(cl, export_functions)
object <- foreach(
x = 1:length(iterator),
.options.snow = opts,
.packages = export_packages
) %dopar% {
do.call(task, c(list_of_arguments, lapply(iteractive_arguments, `[[`, x)))
}
gc()
return(object)
}
my_function <- function(arg1, arg2, arg3) {
my_matrix <- matrix(runif(arg1), ncol = arg2) + arg3
return(my_matrix)
}
plan <- drake_plan(
temp = do_task_parallel(
task = my_function,
list_of_arguments = list(
arg1 = 10L ^ 8L,
arg2 = 500L
),
iterator = seq_len(100L),
iteractive_arguments = list(
arg3 = seq_len(100L)
),
cpu_cores = 4L
)
)
make(plan, console_log_file = "dosnow.log", garbage_collection = TRUE) and library(drake)
library(furrr)
library(future.callr)
future::plan(callr, workers = 4L)
my_function <- function(arg1, arg2, arg3) {
on.exit(gc())
my_matrix <- matrix(runif(arg1), ncol = arg2) + arg3
return(my_matrix)
}
plan <- drake_plan(
temp = future_map(
.x = seq_len(100L),
.f = my_function,
arg1 = 10L ^ 8L,
arg2 = 500L,
.progress = TRUE
)
)
make(plan, console_log_file = "callr.log", garbage_collection = TRUE) I tried running both on a Macbook with 16 GB RAM. For target temp
|============== | 20%fail temp
Error: Target `temp` failed. Call `diagnose(temp)` for details. Error message:
vector memory exhausted (limit reached?) and for Progress: ─────────────────────────────────────────────────── 100%
fail temp
Error: Target `temp` failed. Call `diagnose(temp)` for details. Error message:
vector memory exhausted (limit reached?)
Execution halted I am not sure this is what you are experiencing, but it is one (remote) possibility. |
Hi Will, I read today the log file of my system. It appears that the following is happening. Since drakes runs non-stop from a calculation to the next, and my code involves some serious hard parallel calculations, the cpu is overheating, which makes the OS throttle it. After that, there appears to be occurring some conflict with the power management of plasma (I use Ubuntu + plasma), which apparently tries to use the same ports that the parallel packages are using. I had the following messages: "Jul 6 18:06:51 me kernel: [22878.142070] CPU4: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142071] CPU0: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142073] CPU2: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142074] CPU6: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142075] CPU7: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142077] CPU3: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142078] CPU5: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142079] CPU1: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.146032] CPU0: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146033] CPU7: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146033] CPU4: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146034] CPU1: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146035] CPU5: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146036] CPU3: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146037] CPU6: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146037] CPU2: Package temperature/speed normal"
"Jul 6 18:07:39 me dbus-daemon[1091]: [system] Activating service name='org.kde.powerdevil.backlighthelper' requested by ':1.60' (uid=1000 pid=2994 comm=\"/usr/lib/x86_64-linux-gnu/libexec/org_kde_powerdev\" label=\"unconfined\") (using servicehelper)"
"Jul 6 18:07:39 me org.kde.powerdevil.backlighthelper: QDBusArgument: read from a write-only object"
"Jul 6 18:07:39 me org.kde.powerdevil.backlighthelper: message repeated 2 times: [ QDBusArgument: read from a write-only object]" I will try to disable the powerdevil to see if the clashes stop happening. |
Wow, I have never seen that one before. Are you overclocking? What are the hardware specs? When you run the R process that calls drake, would nice -19 help? Closing the issue because it looks like the difficulties seem specific your rig. Still interested to see how you address this on your end. I will continue to work on improving drake's performance. |
Your computer may actually be on fire, but since you use |
Not that this necessarily addresses the overheating, but |
😅 hahaha So, just to give you a feedback. I disabled powerdevil, but that still did not solved the problem. So, it was probably some specific stuff about my setup that was breaking the code. I am sorry to have bothered you with that. And thanks again for all the help! |
Drive-by comment: #925 (comment) happens often with regular use, I would not worry about it (I think it's Intel TurboBoost starting and getting stopped again). Also note that the CPUs are throttled for a total of 4 ms, so not really at all. |
Prework
drake
's code of conduct.Description
When executing Drake's plan using 'make()' with a within target parallel function the R session seems to hang unexpectedly.
The error occurs intermittently so it is difficult to describe what is the problem. Usually the Master R worker just hangs a long period before starting doing computations (see below - just after
print("Starting parallel computations...")
) or during the parallel backend at a given rate of conclusion (e.g., it always remove all loads from the slaves after completing 60% of the tasks, with the master hanging for a long while using one thread at full computational power).This kind of behaviour does not occur when I try to build the targets outside drake (but occurs when I also try to build them individually using
drake_build(target, config)
). As the process hangs without returning an error, it is hard to say what is going wrong. Also, the stop flag that Rstudio usually shows at the right while executing code disappears completely. I only know that the code is still running by looking at the cpu loads using software such as conky.Also, this kind of behaviour only occurs with functions that are cpu / memory intensive.
I'm using a wrapper to create embarrassingly parallel execution on codes that must be repeated for several datasets/classes on my problem.
Generic execution function
Example of use
This example does not reproduce the problem consistently. Sometimes I have to run it two times in a row and then it hangs in the creating cluster part. On my application, it usually hangs on the foreach loop.
Expected output
Drake would consistently return the parallel computation output.
The text was updated successfully, but these errors were encountered: