-
Notifications
You must be signed in to change notification settings - Fork 986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best practice for using max 2 cores on CRAN? #5658
Comments
I don't think it is that recent. Package Here is what the package does now. In essence, in every example file or unit test file (i.e. files that CRAN runs) I explicitly call a function that throttles cores down to
So this gets us the max of 2 in everything that CRAN runs yet it can be overriden (if, they, your CI has more cores). I already had Importantly, it leaves general startup (i.e. So normal users are not affected and get full speed, but CRAN gets its limit. I can detail that some more you if you care; it is in |
Also reading DT news file can point to commits that set that up for DT |
@eddelbuettel thanks for the quick feedback, that approach seems similar to what I did, but more flexible. You wrote that CRAN sets option Ncpu -- do you know where is that documented? On https://cran.r-project.org/web/packages/policies.html I see "If running a package uses multiple threads/cores it must never use more than two simultaneously: the check farm is a shared resource and will typically be running many checks simultaneously. " but no mention of option Ncpu. Also I was thinking that if CRAN indeed sets option Ncpu then the data.table default number of threads should respect that (avoids having to change example/test code in 1000+ packages with hard dependency on data.table). @jangorecki I checked https://github.com/Rdatatable/data.table/blob/master/NEWS.md but I did not see mention of commits, can you please clarify? |
An alternative to option Ncpu would be to just detect if on CRAN, for example using code below, and then throttle to two cores for data.table default. > testthat:::on_cran
function ()
{
!env_var_is_true("NOT_CRAN")
} I do not see mention of NOT_CRAN env var on CRAN repository policy either, do you know where that is documented? |
Relying on absence of I do not recall where they document but it is documented somewhere that they allow two threads and AFAICR |
There is a related R-devel thread, https://stat.ethz.ch/pipermail/r-devel/2021-November/081289.html that explains the env var |
?tools::check_packages_in_dir says Ncpus option is used to specify how many packages are checked in parallel,
|
But note that what you quoted does not limit it to package checks as your statement implies but to general 'checking'. Which is where we started: how to play nice at CRAN and not exceed limits. |
This check is implemented via this code https://github.com/wch/r-source/blob/1c0545ba7c6c07e8c358eda552b875b1e4d6826d/src/library/tools/R/check.R#L4123 which gets the 2.5 ratio from the environment variable |
Also: Line 95 in 8803918
Recall that my approach was about taking the lower limit from Ncpu amd OMP_NUM_THREADS (and I no longer recall why I do not / did not also set OMP_THREAD_LIMIT). Also data.table/man/openmp-utils.Rd Line 27 in 8803918
|
I also ran in to this on a submission this week and so did others recently (see r-devel thread where Dirk gives similar guidance). I was confused as I assumed this would always be handled on the data.table side. I wonder if the CRAN test machine is no longer setting the |
I haven't found in NEWS file anything useful as well. |
Pondering this a little more, I think this may not really belong here as What may help @tdhock, @TimTaylor and likely others is a new helper function to throttle cores when running tests or examples in their packages. And while those packages may well use |
#3300 quotes an email from Brian Ripley saying that CRAN sets OMP_THREAD_LIMIT=2, but that was back in 2019, so I suspect that @TimTaylor is right that CRAN is no longer doing that, but they are clearly setting |
#3300 (comment) is an issue originally opened in 2019 with a new comment in July 2023 quoting Uwe Ligges saying that CRAN no longer sets the |
Your call between a second-hand quote (!!) in a 2019 ticket (!!) on one hand and the CRAN Repository Policy on the other. The latter still says
|
* version bump * Documented arguments not in \usage * changes for CRAN check; see Rdatatable/data.table#5658 * avoid tests on cran * remove old checks; no longer supported in current python
Is there something against placing: if(any(grepl("_R_CHECK", names(Sys.getenv()), fixed = TRUE))){
setDTthreads(2)
} in It would save all dependent package maintainers to do that down the road. But maybe I'm missing something! |
The impression I get is that CRAN are nudging for packages to default to being single (or very low) threaded unless the package user explicitly says otherwise. related discussions as to whether users are even aware of the default behaviour in #5620 (comment). EDIT: Add a link to the most recent R-package-devel thread with associated comments from Uwe, Dirk et al https://stat.ethz.ch/pipermail/r-package-devel/2023q3/009454.html |
@lrberge That is pretty clever but it would bite folks like me who have values in a dotfile for > grep("_R_", names(Sys.getenv()), value=TRUE)
[1] "_R_CHECK_COMPILATION_FLAGS_KNOWN_" "_R_CHECK_TESTS_NLINES_"
> It would be better if we could nudge Uwe towards setting the OMP variable on his machine! He is the one who wants the lower core and process count there! |
@eddelbuettel how come you have "_R_CHECK" env vars lying around?! :-) @TimTaylor: the current default for many packages with multi threading is usually to max out resources (or close to) which in many cases isn't the best. # A)
library(data.table)
setDTthreads(percent = 0.5)
# B)
library(data.table) Running A) instead of B) is really a pain and nobody will do it: this will follow that R multi-core-disabled functions will look slow on large tasks as compared to other non-R software. So in my view: a) using all available (or many) threads as a default isn't great, b) using single thread mode as default and nudging users to change that has big caveats and isn't really user-firendly. So maybe the way is c)? c) "smart" thread setting, with the number of threads decided using the parameters of the task to achieve? (A bit like the |
A) is already the default one, no need to specify that value |
Wit my "work" hat on: I disagree. We build software for 'big enough' problems and e.g. our underlying library maxes out by default. I strongly believe that is the right thing. (And in our package I put a throttler in as described above). So I actually hold both views: I like what Back to: How do we get Uwe to change his mind? 😁 |
Most critical pieces of DT openmp does not speed up linearly with increase of number of cores. We observed that difference between 50% and 100% was not very big and decided to stay on 50% default to avoid problems on shared machines that have been reported couple times already. Although there are other pieces in DT which will scale more linearly, like froll, and here having 100% would be better default. |
In addition to what @jangorecki says: If you're using It's best to minimise changes to global state from the examples. Start your examples with |
@jangorecki, your suggested solutions does not work in my use case. with tidymodels/textrecipes#251 I still get the CRAN submission NOTE with * checking examples ... [17s/12s] NOTE
Examples with CPU time > 2.5 times elapsed time
user system elapsed ratio
step_dummy_hash 1.579 0.11 0.273 6.187
* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ... [24s/19s] OK
Running ‘testthat.R’ [24s/18s] I don't know if it is because I don't use {data.table} directly and instead using a package {text2vec} that uses {data.table}. Last release I ended up simply deleting the documentation for the affected functions, but that is not a long-term solution. |
@EmilHvitfeldt hard for me to imagine why that could still cause problem, unless there is some nested parallelism, like mclapply + data.table, then inside topmost parallel code should be call to set any nested code to be single threaded, not just for data.table but for any multi threaded computations. Are you able to narrow down from where parallel processing comes? |
@EmilHvitfeldt - Are you seeing this on all platforms? Not dug further but text2vec has this in .onLoad = function(libname, pkgname) {
n_cores = 1L
if(.Platform$OS.type == "unix")
n_cores = parallel::detectCores(logical = FALSE)
options("text2vec.mc.cores" = n_cores)
logger = lgr::get_logger('text2vec')
logger$set_threshold('info')
assign('logger', logger, envir = parent.env(environment()))
} As @jangorecki alludes, could there be more going on than just data.table? |
It again demonstrates that CRAN was not really helpful here by insisting each and every package fix that on their own. |
given that they are considering a general solution to the problem, waiting for that before rigidly enforcing the 2 core thing would have seemed sensible to me... |
@cdriveraus News to me. Can you supply a reference to back up that clain? Last we all saw was Uwe Ligges telling everybody to set up 'max 2 cores' in each package needing it. |
@eddelbuettel I thought it was mentioned somewhere on the pkg devel list at one point, but I'm guessing you're more familiar than me so I'm assuming I misinterpreted / imagined one of the posts I read... |
One of the frustration part of this problem is that I'm not able to reproduce this problem locally. Unless broken, the {text2vec} parallelism is only for certain functions which I'm not using (as far as I know). Only seeing it on the Debian incoming check machine. right now my examples have the following to no avail. #' \dontshow{library(data.table)}
#' \dontshow{data.table::setDTthreads(2)}
#' \dontshow{Sys.setenv("OMP_THREAD_LIMIT" = 2)}
#' \dontshow{library(text2vec)}
#' \dontshow{options("text2vec.mc.cores" = 1)} Also I very much appreciate all the help that is coming in this thread! |
I think there is something else going on on CRAN than just the |
yes, problem is not with data.table but openmp (or other multithreading code). it is interesting that even in your case where it uses single thread by default, it still having the problem. Then I don't think there is any reliable solution that package maintainers can employ. |
comply with directives from CRAN see Rdatatable/data.table#5658
Not sure this a proper solution, but this is how I handle the issue within the webseq package. The package is not yet on CRAN but it does pass In each of my functions that use parallelisation, I set the default value for the number of cores to Within a regular R session I do not set "Ncpu" so R will not find it and will fall back to using many cores by default. However, I start each of my test files with Do you folks think this will work on CRAN? Thanks. |
DT uses srtDTthreads rather than global options. Please read manual. |
R CMD check also runs example() for all your help pages, so any
examples that use parallel processing must pass mc_cores = at most 2.
Do not set options(Ncpu=2L) in examples because the user may have set
it to a different value they are more comfortable with, and running
such an example would change user's observable global state.
What will your code do if parallel::detectCores() returns NA?
|
Thanks @aitap! Oh, some of my examples were wrapped in |
You may want to look into the parallelly package's |
CRAN has a policy that limits the number of cores that can be used during package checks, and they have recently implemented a mechanism for checking compliance. Upon recent CRAN submission of one of my R packages which imports data.table, I got the following rejection message from CRAN:
These messages can be suppressed by adding data.table::setDTthreads(1) at the start of each example.
Is there another/recommended way to avoid using too many cores when checking a CRAN package which uses data.table? The default number of threads in data.table is max cores/2 which is over the CRAN limit (max 2 cores I believe).
Also is there documentation about this somewhere? Probably would be good to mention in the datatable-importing vignette.
The text was updated successfully, but these errors were encountered: