Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704 #2793

Open
pseudotensor opened this issue Feb 21, 2020 · 39 comments
Labels

Comments

@pseudotensor
Copy link

version: 2.3.2

[LightGBM] [Fatal] Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704

Traceback (most recent call last):
  File "lgb_prefit_4ff5fa97-86b3-420c-aa87-5f01abcc18c3.py", line 10, in <module>
    model.fit(X, y, sample_weight=sample_weight, init_score=init_score, eval_set=eval_set, eval_names=valid_X_features, eval_sample_weight=eval_sample_weight, eval_init_score=init_score, eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds, feature_name=X_features, verbose=verbose_fit)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 818, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 610, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 250, in train
    booster.update(fobj=fobj)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2106, in update
    ctypes.byref(is_finished)))
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 46, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704

script and pickle file:

lgbm_histbug.zip

@sh1ng need help seeing if this is fixed in even later master.

@StrikerRUS StrikerRUS added the bug label Feb 22, 2020
@guolinke
Copy link
Collaborator

guolinke commented Feb 22, 2020

I think the latest master branch will not produce this error anymore, as cnt is removed in histogram.

But this still is a potential bug in GPU learner. ping @huanzhang12

@sh1ng
Copy link
Contributor

sh1ng commented Feb 22, 2020

On master

[LightGBM] [Fatal] Check failed: best_split_info.right_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 706 .

Traceback (most recent call last):
  File "lgbm_histbug.py", line 8, in <module>
    model.fit(X, y, sample_weight=sample_weight, init_score=init_score, eval_set=eval_set, eval_names=valid_X_features, eval_sample_weight=eval_sample_weight, eval_init_score=init_score, eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds, feature_name=X_features, verbose=verbose_fit)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 829, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 614, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 250, in train
    booster.update(fobj=fobj)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2145, in update
    ctypes.byref(is_finished)))
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 46, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Check failed: best_split_info.right_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 706 .

@guolinke
Copy link
Collaborator

it is still a GPU bug.
ping @huanzhang12

@pseudotensor
Copy link
Author

@guFalcon @huanzhang12 FYI, we are tracking a major accuracy issue with latest lightgbm compared to before. This is just a heads-up, perhaps it's related to this issue. But we'll post a separate issue once we have moment to generate MRE.

@guolinke
Copy link
Collaborator

Thanks @pseudotensor , could the accuracy issue reproduce in CPU?

@guolinke
Copy link
Collaborator

BTW, maybe this is related: #2811

@pseudotensor
Copy link
Author

pseudotensor commented Feb 24, 2020

#2813 yes, it's CPU run. Same setup with GPU hits this GPU histogram bug issue so can't be run.

But I think the GPU histogram is more generally occurring than the accuracy Issue #2813

@guolinke
Copy link
Collaborator

I think this may be fixed by #2811 too.

@guolinke
Copy link
Collaborator

So in the latest master branch, the CPU version is okay, while the GPU version failed?

@sh1ng
Copy link
Contributor

sh1ng commented Feb 27, 2020

@guolinke correct

stack trace of the error

/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py:893: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
  .format(key))
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 22008
[LightGBM] [Info] Number of data points in the train set: 1348045, number of used features: 150
[LightGBM] [Info] Using GPU Device: GeForce MX150, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 138 dense feature groups (179.98 MB) transferred to GPU in 0.273129 secs. 1 sparse feature groups
[LightGBM] [Info] Start training from score -11.811581
[LightGBM] [Info] Start training from score -7.921803
[LightGBM] [Info] Start training from score -0.432866
[LightGBM] [Info] Start training from score -1.142893
[LightGBM] [Info] Start training from score -3.439298
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Fatal] Check failed: best_split_info.left_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 702 .

Traceback (most recent call last):
  File "lgb_accuracyissue.py", line 14, in <module>
    eval_init_score=init_score, eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds, feature_name=X_features, verbose=verbose_fit)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 829, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 614, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 250, in train
    booster.update(fobj=fobj)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2145, in update
    ctypes.byref(is_finished)))
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 46, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Check failed: best_split_info.left_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 702 .

@sh1ng
Copy link
Contributor

sh1ng commented Feb 27, 2020

Just letting you know that I'm unable to reproduce the issue with dataset originally provided, but it's easily reproducible with data from #2813

@imatiach-msft
Copy link
Contributor

@guolinke I'm trying to track down an issue where after upgrading to latest master branch in mmlspark I am seeing a similar error - any recommendations for code/commits I should look into to investigate what might be the root cause:

[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 12422...
[LightGBM] [Info] Binding port 12422 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 12426...
[LightGBM] [Info] Binding port 12426 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 610, number of negative: 762
[LightGBM] [Info] Number of positive: 610, number of negative: 762
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000514 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 916
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000664 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 916
[LightGBM] [Info] Number of data points in the train set: 686, number of used features: 4
[LightGBM] [Info] Number of data points in the train set: 686, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.438776 -> initscore=-0.246133
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.450437 -> initscore=-0.198904
[LightGBM] [Info] Start training from score -0.222518
[LightGBM] [Info] Start training from score -0.222518
[LightGBM] [Info] Finished linking network in 0.003935 seconds
[LightGBM] [Fatal] Check failed: best_split_info.left_count > 0 at /home/ilya/LightGBM/src/treelearner/serial_tree_learner.cpp, line 709 .

20/02/29 00:35:01 WARN LightGBMClassifier: LightGBM reached early termination on one worker, stopping training on worker. This message should rarely occur

@guolinke
Copy link
Collaborator

Could it run by only one node?

@imatiach-msft
Copy link
Contributor

@guolinke amazing insight! I tried 1 node instead of 2 and almost all of my tests passed (except 1 test due to the number of nodes which is expected)

image

Here is the output from the same test as above (except it was successful):

[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000942 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 327
[LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9
[LightGBM] [Info] Start training from score -1.572397
[LightGBM] [Info] Start training from score -1.618917
[LightGBM] [Info] Start training from score -2.024382
[LightGBM] [Info] Start training from score -1.955389
[LightGBM] [Info] Start training from score -1.890850
[LightGBM] [Info] Start training from score -1.773067
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002017 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 327
[LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9
[LightGBM] [Info] Start training from score -1.572397
[LightGBM] [Info] Start training from score -1.618917
[LightGBM] [Info] Start training from score -2.024382
[LightGBM] [Info] Start training from score -1.955389
[LightGBM] [Info] Start training from score -1.890850
[LightGBM] [Info] Start training from score -1.773067
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000835 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 327
[LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9
[LightGBM] [Info] Start training from score -1.572397
[LightGBM] [Info] Start training from score -1.618917
[LightGBM] [Info] Start training from score -2.024382
[LightGBM] [Info] Start training from score -1.955389
[LightGBM] [Info] Start training from score -1.890850
[LightGBM] [Info] Start training from score -1.773067
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001298 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 327
[LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9
[LightGBM] [Info] Using GOSS
[LightGBM] [Info] Start training from score -1.572397
[LightGBM] [Info] Start training from score -1.618917
[LightGBM] [Info] Start training from score -2.024382
[LightGBM] [Info] Start training from score -1.955389
[LightGBM] [Info] Start training from score -1.890850
[LightGBM] [Info] Start training from score -1.773067
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

@imatiach-msft
Copy link
Contributor

Note this is from this commit on 2/21 (both failing and successful runs):
"Better documentation for Contributing (#2781)"
I'm currently trying to work back to older versions/commits of lightgbm to see which commit is causing the tests to fail, but it is a slow process to build and update the jar and rerun the tests (I'm currently skipping small batches of commits at a time but I might do a binary search to make this optimal since it looks like the issue goes back before 2/21).

@guolinke
Copy link
Collaborator

guolinke commented Mar 1, 2020

@imatiach-msft you can try the commit (509c2e5) and its parent (bc7bc4a)

@imatiach-msft
Copy link
Contributor

@guolinke you're right, it looks like the issue is with commit (509c2e5).
I validated that including that commit causes the error, and removing it fixes the issue.

@guolinke
Copy link
Collaborator

guolinke commented Mar 2, 2020

@imatiach-msft could you share the data (and config) to me for the debugging?

@imatiach-msft
Copy link
Contributor

@guolinke I'm running the mmlspark scala tests, maybe I can try to create an example that you can easily run?
You can find the lightgbm classifier tests here:
https://github.com/Azure/mmlspark/blob/master/src/test/scala/com/microsoft/ml/spark/lightgbm/split1/VerifyLightGBMClassifier.scala

The first test that failed was below, but I tried several others and they failed as well:
https://github.com/Azure/mmlspark/blob/master/src/test/scala/com/microsoft/ml/spark/lightgbm/split1/VerifyLightGBMClassifier.scala#L169

The compressed file with most datasets used in mmlspark can be found here:
https://mmlspark.blob.core.windows.net/installers/datasets-2020-01-20.tgz

@StrikerRUS StrikerRUS mentioned this issue May 11, 2020
@guolinke
Copy link
Collaborator

guolinke commented Aug 6, 2020

@shiyu1994 con you help to investigate this too?
you can start from @imatiach-msft 's test.

@sh1ng
Copy link
Contributor

sh1ng commented Oct 5, 2020

Still happens in version 3.0

lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 630

https://github.com/h2oai/h2o4gpu/blob/master/tests/python/open_data/gbm/test_lightgbm.py#L265-L284

@shiyu1994
Copy link
Collaborator

@shiyu1994 con you help to investigate this too?
you can start from @imatiach-msft 's test.

Ok.

@imatiach-msft
Copy link
Contributor

@shiyu1994 @guolinke FYI my issue was resolved when I upgraded after my fix #3110 , but it sounds like others are still encountering issues similar to what I had

@diditforlulz273
Copy link

I have this issue with CPU learner, not GPU. Got it after upgrading from 2.3.1 to 3.0.0, makes every test with a tiny testing dataset fail for exactly the same reason:

lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/python-package/compile/src/treelearner/serial_tree_learner.cpp, line 630 .

@guolinke
Copy link
Collaborator

@diditforlulz273 could you try the latest master branch?
if the problem still exists, you can create a new issue and would be better if you can provide a reproducible example.

@diditforlulz273
Copy link

@guolinke Have just built it from the latest master branch, still fails. I'll try to separate a minimum reproducible example and create an issue then.

@grasevski
Copy link

+1, this bug makes lightgbm GPU useless. still happens to me on latest master

@asimraja77
Copy link

Hi,
I'm using the GPU setting and have the same issue. I tried "deterministic = True" but it did not solve the problem. I saw that the LightGBM v3.2.0 may fix this defect. I have a few question as follows:

  1. In the v3.2.0 release thread, I noticed that this bug lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704 #2793 is not in bold. Does this mean that it may not be fixed until a later release?
  2. Does a fix exists for it in a non-release (build from source) option? If so, can you please guide me to it?
  3. Assuming that a fix may be part of v3.2.0 release, is this release about to happen? I noticed that v3.1.1 was released 3 months ago.

I apologize if my questions are a bit out of bound.
Best regards

@nightflight-dk
Copy link

nightflight-dk commented Sep 29, 2021

It's unfortunate that a known issue of this severity is left open for over 1.5 years. The error affects every other attempt to train on GPUs when using the latest 'stable' bits in Business Division (Dynamics). I can help with a business case from inside Microsoft, to push this if necessary. My alias: dakowalc. Thanks

@guolinke
Copy link
Collaborator

guolinke commented Oct 1, 2021

Thank you @nightflight-dk , actually we had re-written the LightGBM GPU version, and previous OpenCL and CUDA versions will be deprecated. refer to PR #4528

@nightflight-dk
Copy link

nightflight-dk commented Oct 6, 2021

Great to hear the GPU acceleration is under further development @guolinke. I have just tested the code from PR #4528 unfortunately it's affected by the same bug, triggering the same assert error in the serial_tree_learner (even in data parallel exec. device=cuda / device=gpu)
Please suggest a workaround or older version that is not affected (if any?) thanks

@guolinke
Copy link
Collaborator

guolinke commented Oct 6, 2021

cc @shiyu1994 for above bug.

@shiyu1994
Copy link
Collaborator

I will double check that. But the new CUDA tree learners reuse no training logic of the old serial tree learner or old CUDA tree learner (only initialization code in serial_tree_learner.cpp is executed when a new CUDA tree learner is used, and it will not touch the check code which raises the error in this issue. Since the errors come from the source code of old CUDA tree learner and training part of serial tree learner), so I think it is not likely that the new CUDA version should result in the same bug.

@shiyu1994
Copy link
Collaborator

@nightflight-dk Thanks for the testing. It would be really appreciated if the error log of the new CUDA version could be provided. :)

@shiyu1994
Copy link
Collaborator

In addition, no distributed training is supported with the new CUDA versions in the PRs so far. So if distributed training is enabled, it will switch to old CUDA version.

@nightflight-dk
Copy link

@shiyu1994 @guolinke after disabling distribution (tree_learner: serial) the latest bits from PR #4528 finish the training without issues. Moreover the GPU utilization appears dramatically improved (mean up to ca. 50% from 2%). Well done.
Is there an ETA for PR #4528 part of master? it would help our planning. Also, if you plan data_parallel GPU or multi-GPU, please point out the items for us to track. Happy to help with testing. Please keep up the good work. Thanks a lot. - dakowalc, Business 360 AI team

@shiyu1994
Copy link
Collaborator

@nightflight-dk Thanks for having a trial. Since #4528 is a very large PR, we plan to decompose it into several parts, and merge them one by one. We expect to finish the merge process by the end of this month.
Multi-GPU and distributed training will be added after #4528 is being merged. I will point that out once PRs are open for that.

@pavlexander
Copy link

Since there hasn't been any activity for a year, I would like to bring this topic up again.

Got the version 3.3.3, python. Training on GPU, on Windows.

The issue is bugging me for the past 2 days.. The data set is 500k, with 1500 features. There seems to be some correlation with min_gain_to_split parameter. When the value is 1 I have not yet seen any errors, however on value 0 (default) it seems to crash quite often. Take this comment with caution since I have not ran enough tests yet..

crashed when

{'learning_rate': 0.43467624523546383, 'max_depth': 8, 'num_leaves': 201, 'feature_fraction': 0.9, 'bagging_fraction': 0.7000000000000001, 'bagging_freq': 8}

{'learning_rate': 0.021403440298427053, 'max_depth': 2, 'num_leaves': 176, 'lambda_l1': 3.8066251775052895, 'lambda_l2': 1.08526150100961e-08, 'feature_fraction': 0.6, 'bagging_fraction': 0.9, 'bagging_freq': 6}

{'learning_rate': 0.3493368922746614, 'max_depth': 6, 'num_leaves': 109, 'lambda_l1': 4.506588272812341e-05, 'lambda_l2': 2.5452579091348995e-07, 'feature_fraction': 0.7000000000000001, 'bagging_fraction': 1.0, 'bagging_freq': 6, 'min_gain_to_split': 0}

{'learning_rate': 0.17840010040986135, 'max_depth': 12, 'num_leaves': 251, 'lambda_l1': 0.004509589012189404, 'lambda_l2': 3.882151732343819e-08, 'feature_fraction': 0.30000000000000004, 'bagging_fraction': 1.0, 'bagging_freq': 8, 'min_gain_to_split': 0}

the code is:

    params = {
        'device_type': "gpu",
        'objective': 'multiclass',  # 
        'metric': 'multi_logloss',  # 
        "boosting_type": "gbdt",
        "num_class": 3,
        'random_state': 123,
        'verbosity': -1,  # hides "No further splits with positive gain, best gain: -inf" warnings
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.9, log=True),  # 0.1
        'max_depth': trial.suggest_int('max_depth', 2, 12),
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),  # def 31
        'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),  # 0
        'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),  # 0
        'feature_fraction': trial.suggest_float('feature_fraction', 0.1, 1.0, step=0.1),  # 1
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.1, 1.0, step=0.1),  # 1
        'bagging_freq': trial.suggest_int('bagging_freq', 0, 10),  # 0
        'min_gain_to_split': trial.suggest_int('min_gain_to_split', 0, 5),
    }

with a few changes here and there

exception is:

[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .

[W 2022-11-07 09:49:32,774] Trial 49 failed because of the following error: LightGBMError('Check failed: (best_split_info.left_count) > (0) at D:\\a\\1\\s\\python-package\\compile\\src\\treelearner\\serial_tree_learner.cpp, line 653 .\n')
Traceback (most recent call last):
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 174, in objective
    model = lgb.train(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\engine.py", line 292, in train
    booster.update(fobj=fobj)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 3021, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .

Traceback (most recent call last):
  File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 237, in <module>
    study.optimize(objective, n_trials=_NUMBER_OF_TRIALS)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\study.py", line 419, in optimize
    _optimize(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 66, in _optimize
    _optimize_sequential(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 160, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 234, in _run_trial
    raise func_err
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 174, in objective
    model = lgb.train(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\engine.py", line 292, in train
    booster.update(fobj=fobj)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 3021, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .


Process finished with exit code 1

I am using optuna for optimization so the set of parameters is always different.

Tried using different split ratio (0.19/0.20/0.21) - does not seem to fix anything

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.19, random_state=42, shuffle=True)

as well as tried experimenting with the amount of data (600_000/600_001/200_001). Nothing seems to help fix the issue.. Can this fix be expected in the next major release? I see that the topic is still active..

@JisongXie
Copy link

JisongXie commented Dec 13, 2022

I build the docker image with this dockerfile.gpu. And I encounter this issue, too.

LightGBMError: Check failed: (best_split_info.left_count) > (0) at /usr/local/src/lightgbm/LightGBM/src/treelearner/serial_tree_learner.cpp, line 653 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests