Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu] modify histogram construction (fixes #4946) #5819

Closed
wants to merge 1 commit into from

Conversation

tolleybot
Copy link

@tolleybot tolleybot commented Apr 4, 2023

fixes #4946

Reason for Change:

when the X_train is more than (1800w, 1000), lgb-gpu will has a bug like this:
[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at LightGBM/src/treelearner/serial_tree_learner.app, line 686

Solution:

I was able to successfully run a large dataset with a change to src/treelearner/ocl/histogram256.cl. I

This function is defined as the following (with #ifdef constants removed for this post for clarity)

__kernel void histogram256(
__global const uchar4* feature_data_base,
__constant const uchar4* restrict feature_masks attribute((max_constant_size(65536))),
const data_size_t feature_size,
__global const data_size_t* data_indices,
const data_size_t num_data,
const score_t const_hessian,
__global const score_t* ordered_gradients, // <----- change to : __global const * ordered_gradients
__global char* restrict output_buf,
__global volatile int * sync_counters,
__global acc_type* restrict hist_buf_base
)
However, if you redefine ordered_gradients as __global const * ordered_gradients, the context will fill in the type, and the large training set runs. At first, I thought score_t was defined differently in the OpenCL code and C++, but I verified that they are both floats.
In order to validate the results. I ran two smaller dummy datasets with ordered_gradients defined explicitly and not. I compared the resulting model files and found that they were the same.

It's not yet clear to me why the change allows the program to finish I suspect there must be some type of difference. Any ideas would be welcome.

@jameslamb jameslamb changed the title This change fixes issue #4946, https://github.com/microsoft/LightGBM/… [gpu] modify histogram construction (fixes #4946) Apr 6, 2023
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to contribute to LightGBM.

I don't know much about OpenCL so can't provide a very helpful review here, but I can say with high confidence that this change shouldn't be accepted as-is.

The CI jobs for the GPU build suggest that this seriously breaks something about the data flow through the library.

For example: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=14510&view=logs&j=582743ec-83fe-58c4-937e-4197b5816801

>           assert r2_score(y, preds) > 0.86
E           assert -14777383275568.44 > 0.86

@jameslamb
Copy link
Collaborator

I'm closing this due to lack of response. We'd welcome contributions in the future if you have time to work with us.

Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check failed: (best_split_info.left_count) > (0)
2 participants