Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] tests and examples should not use more than 2 threads #5102

Closed
Tracked by #5153
jameslamb opened this issue Mar 30, 2022 · 11 comments · Fixed by #6226
Closed
Tracked by #5153

[R-package] tests and examples should not use more than 2 threads #5102

jameslamb opened this issue Mar 30, 2022 · 11 comments · Fixed by #6226

Comments

@jameslamb
Copy link
Collaborator

Description

CRAN's submission policies at https://cran.r-project.org/web/packages/policies.html include the following guidance

If running a package uses multiple threads/cores it must never use more than two simultaneously: the check farm is a shared resource and will typically be running many checks simultaneously.

Currently, most of {lightgbm}'s examples and tests do not explicitly set the number of threads to use, which means LightGBM defaults to using whatever the result of omp_get_num_threads() is.

if (num_threads > 0) {
omp_set_num_threads(num_threads);
} else {
omp_set_num_threads(default_omp_num_threads);
}

Work to be done

To ensure that {lightgbm} is always respectful of the CRAN check farm, all of its examples and tests should be changed to default to using 2 threads when run on CRAN.

Similar to the work done for #4862

VERBOSITY <- as.integer(
Sys.getenv("LIGHTGBM_TEST_VERBOSITY", "-1")
)

this should be done in a way that allows LightGBM's CI to override that behavior and use all available CPUs in its CI environments (which might sometimes be more than 2).

NUM_THREADS <- as.integer(
    Sys.getenv("LGB_NUM_THREADS", "2")
)

References

Opened as a result of #4972 (comment).

@jameslamb
Copy link
Collaborator Author

Thanks for those links!

But since the environment variable CI is such a general-purpose name, and since a CRAN rejection is so expensive (in terms of our time and the risk of CRAN archiving the package), I'd feel safer with a LightGBM-specific environment variable that is very very unlikely to be defined on CRAN.

Maybe I'm just being paranoid, I've just developed this habit of trying to minimize as much as possible assumptions about what CRAN will do or what their check environment will look like.

@StrikerRUS
Copy link
Collaborator

Haha, OK! 😄

@jmoralez
Copy link
Collaborator

I was going to add num_threads in params for all tests but I thought maybe we could set OMP_NUM_THREADS=2 instead. Running OMP_NUM_THREADS=2 Rscript testthat.R restricts the number of threads to 2 since the default is 0 and that means "use what openmp says" and when that variable is set it will use that number of threads. However I don't think this is possible on CRAN so I was trying Sys.setenv(OMP_NUM_THREADS = 2) on testthat.R but that doesn't seem to work. WDYT @jameslamb?

@jmoralez
Copy link
Collaborator

Hmm seems like that is expected https://stackoverflow.com/q/27319619/19410760 and what would be required for that to work is something like what data.table does https://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/setDTthreads

@jameslamb
Copy link
Collaborator Author

Hmm seems like that is expected

Yeah there is some interesting sourcing/loading of R code done by testthat::test_check() and similar, and I don't think we can rely on environment variables set by Sys.setenv() flowing through to that sourced code.

something like what data.table does

What {data.table} does with SetDTthreads() is similar to what has been recommended (but not yet implemented) in #4705.

They allocate a "data.table-specific number of threads" variable in the process.

https://github.com/Rdatatable/data.table/blob/9e6e45301ea89227414a4f6df1ffc679c5c7ef1c/src/openmp-utils.c#L8-L9

And then setDTthreads() modifies that based on environment variables and passed values.

https://github.com/Rdatatable/data.table/blob/9e6e45301ea89227414a4f6df1ffc679c5c7ef1c/src/openmp-utils.c#L138-L140

And then all OpenMP code in the package references that, instead of using openmp_get_threads()

https://github.com/Rdatatable/data.table/blob/9e6e45301ea89227414a4f6df1ffc679c5c7ef1c/src/cj.c#L23

https://github.com/Rdatatable/data.table/blob/9e6e45301ea89227414a4f6df1ffc679c5c7ef1c/src/openmp-utils.c#L60-L70

Instead of doing that, LightGBM modifies OMP_NUM_THREADS (which is a value that affects all OpenMP code running in the same process and I think any of its child processes).

For #4705, LightGBM should be probably be modified to do something like what {data.table} does.

WDYT?

Maybe for now, we could use R's options() functionality to do this?

Like this in testthat.R:

options("lightgbm.num.testing.threads" = 2)

And then adding something like the following near the beginning of lgb.get.default.num.threads()

lgb.get.default.num.threads <- function() {

threads_from_opts <- options()[["lightgbm.num.testing.threads"]]
if (!is.null(threads_from_opts)) {
    return (as.integer(threads_from_opts))
}

I see mentions in https://cran.r-project.org/doc/manuals/r-release/R-exts.html that encourage the use of options() during tests, so I think it's probably ok with CRAN.

I think that might work more reliably with {testthat} than environment variables? And it would have the benefit of not requiring changes to every test case. If that approach works, we'd probably also need similar code in vignettes.

For examples in docs, I think it would be fine to just explicitly pass num_threads = 2 as literal code in docs.

Could you try that?

@jmoralez
Copy link
Collaborator

I think that doesn't work because lgb.get.default.num.threads is only called by lightgbm, not by lgb.train :(

@jameslamb
Copy link
Collaborator Author

hmmmm ok, and I realize now that even for lightgbm() that might not work, since tests that pass num_threads = {something} to lightgbm() will also not end up calling that function

if (is.null(num_threads)) {
num_threads <- lgb.get.default.num.threads()
}

What about adding instead something like the following in lgb.train() and lgb.cv()?

# LightGBM-internal fix to comply with CRAN policy of only using up to 2 threads in tests and example
threads_from_opts <- options()[["lightgbm.num.testing.threads"]]
if (!is.null(threads_from_opts)) {
    params[["num_threads"]] <- threads_from_opts
}

And now that I see it...maybe we should use a name that makes it clearer that this isn't intended to be used by users. Like lightgbm.internal.override.num.threads or something like that.

I think that could work as a minimally-invasive change that could be reverted once #4705 is addressed.

@jameslamb
Copy link
Collaborator Author

Based on #5367 (comment), I am going to pick this up. Assigning it to myself to claim it so no one else spends time on it.

@jmoralez
Copy link
Collaborator

Hey @jameslamb. I was thinking that since we can't set OMP_NUM_THREADS in the test file itself we could maybe use system or something similar so that we can set the env variable before the process starts. I tried this by renaming testthat.R to _testthat.R and then defining testthat.R as:

system("OMP_NUM_THREADS=1 Rscript _testthat.R")

and tried different values for the variable (by calling time Rscript testthat.R) and it seems to work (1 shows less cpu usage than 4).

This is way too hacky and may not work but we could maybe try something in that direction.

@jameslamb
Copy link
Collaborator Author

Interesting idea @jmoralez ! I hadn't considered that.

I would expect it to work in local testing, but I don't think we should pursue it as a way to satisfy CRAN.

  1. that will cause Rscript to be looked up on PATH, and there's no guarantee that that'll be the same build of R used by R CMD check (e.g., in some environments the binary might be called RDscript or similar)
  2. that still doesn't help with the issue of examples in documentation

I think we should pursue more permanent solutions, on the C++ side of LightGBM, to more tightly control the number of threads used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment