[gpu] modify histogram construction (fixes #4946) #5819
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fixes #4946
Reason for Change:
when the X_train is more than (1800w, 1000), lgb-gpu will has a bug like this:
[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at LightGBM/src/treelearner/serial_tree_learner.app, line 686
Solution:
I was able to successfully run a large dataset with a change to src/treelearner/ocl/histogram256.cl. I
This function is defined as the following (with #ifdef constants removed for this post for clarity)
__kernel void histogram256(
__global const uchar4* feature_data_base,
__constant const uchar4* restrict feature_masks attribute((max_constant_size(65536))),
const data_size_t feature_size,
__global const data_size_t* data_indices,
const data_size_t num_data,
const score_t const_hessian,
__global const score_t* ordered_gradients, // <----- change to : __global const * ordered_gradients
__global char* restrict output_buf,
__global volatile int * sync_counters,
__global acc_type* restrict hist_buf_base
)
However, if you redefine ordered_gradients as __global const * ordered_gradients, the context will fill in the type, and the large training set runs. At first, I thought score_t was defined differently in the OpenCL code and C++, but I verified that they are both floats.
In order to validate the results. I ran two smaller dummy datasets with ordered_gradients defined explicitly and not. I compared the resulting model files and found that they were the same.
It's not yet clear to me why the change allows the program to finish I suspect there must be some type of difference. Any ideas would be welcome.