[GPU/OpenCL] Can use at most 8.3GB GPU memory #4480

xingyuansun · 2021-07-19T15:17:15Z

Hi, thanks for the package! I noticed a similar issue as #3899. I am using LightGBM with a version of 3.2.1, and an NVIDIA Tesla V100-SXM2-16GB (16GB memory). By running the following code, we can see the first run only takes 8299MB GPU memory, which means the second one should fit, since the number of data points only slightly increases. However, it turns out we get the following error message. Could someone let me know if there is an internal memory limit in the LightGBM library? Thanks very much for the help!

Code:

import lightgbm as lgb
import numpy as np


if __name__ == '__main__':
    for n_data in [1040000, 1100000]:
        n_feature = 8000
        print(n_data, n_feature)
        x = np.random.rand(n_data, n_feature)
        y = np.random.rand(n_data)
        model = lgb.LGBMRegressor(device_type='gpu')
        model.fit(X=x, y=y, eval_set=[(x, y)], verbose=1)

Error message:

terminate called after throwing an instance of 'boost::wrapexcept<boost::compute::opencl_error>'
  what():  Memory Object Allocation Failure

The text was updated successfully, but these errors were encountered:

Anstinus · 2021-12-05T04:09:42Z

Got the same issue.

I'm using C API. When the training dataset is too large, the program would just crash when I call 'LGBM_BoosterCreate()'

With 7.5GB data, it works fine:

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 830854
[LightGBM] [Info] Number of data points in the train set: 2413300, number of used features: 3273
[LightGBM] [Info] Using requested OpenCL platform 0 device 0
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 3090, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 3273 dense feature groups (7539.72 MB) transferred to GPU in 2.319649 secs. 0 sparse feature groups

If I add more data (estimate ~8.9GB), it would crash before the last line:

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 939820
[LightGBM] [Info] Number of data points in the train set: 2413300, number of used features: 3702
[LightGBM] [Info] Using requested OpenCL platform 0 device 0
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 3090, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
=> Crash here. No log any more

And it happens either when I increase number of features or number of data points, as long as the total size is more than ~8GB.
I'm using
LightGbm v3.3.1 C API
CUDA 11.5.56
RTX 3090 with 24GB memory

chixujohnny · 2022-01-13T02:58:55Z

I have this same problem!!

My company has 2 types GPU: V100-32G and A100-40G

When the X_train is more than shape=(850w, 1000), LGB-GPU will have the same problem. It seems like OOM but I'm not sure. The GPU memory usage about 8.3G same as you.

Anstinus · 2022-01-15T13:20:40Z

I see this bug is fixed with #4928
I verified it on my machine with more than 9GB GPU memory and it worked smoothly.
I think this issue could be closed now.

StrikerRUS · 2022-01-15T23:04:24Z

@Anstinus

I verified it on my machine with more than 9GB GPU memory and it worked smoothly.

Thank you so much for getting back and sharing this observation!

111qqz · 2023-01-31T01:07:17Z

May I ask which release has this fix? I use v3.3.5 version, the problem can still be reproduced

jameslamb · 2023-01-31T15:13:23Z

It will be in release v4.0.0. You can follow #5153 to be notified when that release is published.

github-actions · 2023-08-15T20:14:02Z

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

jameslamb added the question label Jul 25, 2021

Anstinus mentioned this issue Dec 4, 2021

Crash when use more than 8GB data for training #4853

Closed

chixujohnny mentioned this issue Jan 13, 2022

Check failed: (best_split_info.left_count) > (0) #4946

Open

xingyuansun closed this as completed Jan 15, 2022

github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU/OpenCL] Can use at most 8.3GB GPU memory #4480

[GPU/OpenCL] Can use at most 8.3GB GPU memory #4480

xingyuansun commented Jul 19, 2021

Anstinus commented Dec 5, 2021 •

edited

Loading

chixujohnny commented Jan 13, 2022

Anstinus commented Jan 15, 2022

StrikerRUS commented Jan 15, 2022

111qqz commented Jan 31, 2023

jameslamb commented Jan 31, 2023

github-actions bot commented Aug 15, 2023

[GPU/OpenCL] Can use at most 8.3GB GPU memory #4480

[GPU/OpenCL] Can use at most 8.3GB GPU memory #4480

Comments

xingyuansun commented Jul 19, 2021

Anstinus commented Dec 5, 2021 • edited Loading

chixujohnny commented Jan 13, 2022

Anstinus commented Jan 15, 2022

StrikerRUS commented Jan 15, 2022

111qqz commented Jan 31, 2023

jameslamb commented Jan 31, 2023

github-actions bot commented Aug 15, 2023

Anstinus commented Dec 5, 2021 •

edited

Loading