Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LGBM_DatasetCreateFromCSC does not allow thread control #4598

Closed
david-cortes opened this issue Sep 7, 2021 · 4 comments
Closed

LGBM_DatasetCreateFromCSC does not allow thread control #4598

david-cortes opened this issue Sep 7, 2021 · 4 comments
Labels

Comments

@david-cortes
Copy link
Contributor

The C api has a function named LGBM_DatasetCreateFromCSC which runs a parallel openmp block with the default number of threads, which will typically be the maximum number of available threads.

I think it'd be better and more transparent if the number of threads were user-controllable, allowing a parameter int nthreads in the function signature instead, just like it does for other functions within the package.

This function is also problematic for the R package as it gets called by LGBM_DatasetCreateFromCSC_R, which is used in one of the R tests (currently skipped because it fails some checks). The CRAN policies state that examples and tests are not supposed to use more than 1 or 2 threads, and this thus prevents the R test from becoming enabled once the other issues it raises become solved.

@jameslamb jameslamb changed the title LGBM_DatasetCreateFromCSC does not allow thread control [R-package] LGBM_DatasetCreateFromCSC does not allow thread control Sep 7, 2021
@jameslamb
Copy link
Collaborator

Thanks very much for reporting this. Can you please provide specific links / quotes for what you're referring to when you say "the CRAN policies"?

@david-cortes
Copy link
Contributor Author

The CRAN Repository Policy:
https://cran.r-project.org/web/packages/policies.html

Says the following:

If running a package uses multiple threads/cores it must never use more than two simultaneously: the check farm is a shared resource and will typically be running many checks simultaneously.

Some of the configs they test on might even crash the R process if you attempt to do some multi-threaded computation with the maximum number of threads.


The issue is BTW not just about the R package since this function is used by other interfaces too.

@jameslamb jameslamb changed the title [R-package] LGBM_DatasetCreateFromCSC does not allow thread control LGBM_DatasetCreateFromCSC does not allow thread control Apr 10, 2022
@jameslamb
Copy link
Collaborator

I think it'd be better and more transparent if the number of threads were user-controllable, allowing a parameter int nthreads in the function signature instead, just like it does for other functions within the package.

None of the Dataset creation functions in LightGBM's C API have nthreads in their function signatures.

LBGM_DatasetCreate function signatures in C API (click me)

LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromFile(const char* filename,
const char* parameters,
const DatasetHandle reference,
DatasetHandle* out);

LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromSampledColumn(double** sample_data,
int** sample_indices,
int32_t ncol,
const int* num_per_col,
int32_t num_sample_row,
int32_t num_total_row,
const char* parameters,
DatasetHandle* out);

LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromCSR(const void* indptr,
int indptr_type,
const int32_t* indices,
const void* data,
int data_type,
int64_t nindptr,
int64_t nelem,
int64_t num_col,
const char* parameters,
const DatasetHandle reference,
DatasetHandle* out);

LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromCSRFunc(void* get_row_funptr,
int num_rows,
int64_t num_col,
const char* parameters,
const DatasetHandle reference,
DatasetHandle* out);

LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromCSC(const void* col_ptr,
int col_ptr_type,
const int32_t* indices,
const void* data,
int data_type,
int64_t ncol_ptr,
int64_t nelem,
int64_t num_row,
const char* parameters,
const DatasetHandle reference,
DatasetHandle* out);

LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromMat(const void* data,
int data_type,
int32_t nrow,
int32_t ncol,
int is_row_major,
const char* parameters,
const DatasetHandle reference,
DatasetHandle* out);

LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromMats(int32_t nmat,
const void** data,
int data_type,
int32_t* nrow,
int32_t ncol,
int is_row_major,
const char* parameters,
const DatasetHandle reference,
DatasetHandle* out);

if the number of threads were user-controllable
runs a parallel openmp block with the default number of threads, which will typically be the maximum number of available threads

The number of threads used in Dataset construction is user-controllable, via parameter num_threads.

As you can see in the links above, all of the LGBM_DatasetCreate* functions support passing parameters.

In each of those methods, the num_threads found in parameters is passed to OMP_SET_NUM_THREADS(), which will set the result of any later call to omp_get_num_threads() (and implicitly set the number of threads used for OpenMP parallel regions that don't explicitly set a number of threads, as noted at https://www.openmp.org/spec-html/5.0/openmpsu110.html).

For example, in LGBM_DatasetCreateFromCSC(), the parameter value num_threads is passed to OMP_SET_NUM_THREADS().

LightGBM/src/c_api.cpp

Lines 1326 to 1329 in b462d0a

auto param = Config::Str2Map(parameters);
Config config;
config.Set(param);
OMP_SET_NUM_THREADS(config.num_threads);

Given all that....I don't think it is accurate that "LGBM_DatasetCreateFromCSC does not allow thread control".

Passing num_thread=1 (or any of the aliases of num_thread listed at https://lightgbm.readthedocs.io/en/latest/Parameters.html#num_threads) through params will change the number of threads used in Dataset construction.

The following example demonstrates construction of a LightGBM Dataset from a CSC matrix in the R package, as of latest master (b462d0a).

# test.R
library(lightgbm)
library(Matrix)

X <- matrix(
    data = rnorm(1e8)
    , nrow = 1e5
)
Xcsc <- as(X, "CsparseMatrix")
num_threads <- as.integer(Sys.getenv("EXAMPLE_NTHREAD"))

start_time <- Sys.time()
dtrain <- lightgbm::lgb.Dataset(
    data = Xcsc
    , params = list(num_threads = num_threads)
)
dtrain$construct()
print(sprintf("--- construction (num_threads=%d) ---", num_threads))
print(Sys.time() - start_time)

On my Mac, I ran this script 3 times each with num_threads 1, 2, and 4. As you can see below, higher values of that parameter resulted in faster construction.

EXAMPLE_NTHREAD=1 Rscript --vanilla ./test.R
# [1] "--- construction (num_threads=1) ---"
# Time difference of 17.88016 secs
# Time difference of 17.72601 secs
# Time difference of 17.60532 secs

EXAMPLE_NTHREAD=2 Rscript --vanilla ./test.R
# [1] "--- construction (num_threads=2) ---"
# Time difference of 10.77466 secs
# Time difference of 10.12384 secs
# Time difference of 10.49387 secs

EXAMPLE_NTHREAD=4 Rscript --vanilla ./test.R
# [1] "--- construction (num_threads=4) ---"
# Time difference of 7.150842 secs
# Time difference of 7.223807 secs
# Time difference of 7.122708 secs

Therefore, I think that the proposal in this issue (add an nthreads argument to the signature of LGBM_DatasetCreate* functions) is just one possible solution to #4705, and that this issue should be closed to consolidate discussion of the problem "thread control in LightGBM without globally altering OpenMP's default number of threads" in one place.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants