Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU/OpenCL] Can use at most 8.3GB GPU memory #4480

Closed
xingyuansun opened this issue Jul 19, 2021 · 7 comments
Closed

[GPU/OpenCL] Can use at most 8.3GB GPU memory #4480

xingyuansun opened this issue Jul 19, 2021 · 7 comments
Labels

Comments

@xingyuansun
Copy link

Hi, thanks for the package! I noticed a similar issue as #3899. I am using LightGBM with a version of 3.2.1, and an NVIDIA Tesla V100-SXM2-16GB (16GB memory). By running the following code, we can see the first run only takes 8299MB GPU memory, which means the second one should fit, since the number of data points only slightly increases. However, it turns out we get the following error message. Could someone let me know if there is an internal memory limit in the LightGBM library? Thanks very much for the help!

Code:

import lightgbm as lgb
import numpy as np


if __name__ == '__main__':
    for n_data in [1040000, 1100000]:
        n_feature = 8000
        print(n_data, n_feature)
        x = np.random.rand(n_data, n_feature)
        y = np.random.rand(n_data)
        model = lgb.LGBMRegressor(device_type='gpu')
        model.fit(X=x, y=y, eval_set=[(x, y)], verbose=1)

Error message:

terminate called after throwing an instance of 'boost::wrapexcept<boost::compute::opencl_error>'
  what():  Memory Object Allocation Failure
@Anstinus
Copy link

Anstinus commented Dec 5, 2021

Got the same issue.

I'm using C API. When the training dataset is too large, the program would just crash when I call 'LGBM_BoosterCreate()'

With 7.5GB data, it works fine:

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 830854
[LightGBM] [Info] Number of data points in the train set: 2413300, number of used features: 3273
[LightGBM] [Info] Using requested OpenCL platform 0 device 0
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 3090, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 3273 dense feature groups (7539.72 MB) transferred to GPU in 2.319649 secs. 0 sparse feature groups

If I add more data (estimate ~8.9GB), it would crash before the last line:

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 939820
[LightGBM] [Info] Number of data points in the train set: 2413300, number of used features: 3702
[LightGBM] [Info] Using requested OpenCL platform 0 device 0
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 3090, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
=> Crash here. No log any more

And it happens either when I increase number of features or number of data points, as long as the total size is more than ~8GB.
I'm using
LightGbm v3.3.1 C API
CUDA 11.5.56
RTX 3090 with 24GB memory

@chixujohnny
Copy link

I have this same problem!!

My company has 2 types GPU: V100-32G and A100-40G

When the X_train is more than shape=(850w, 1000), LGB-GPU will have the same problem. It seems like OOM but I'm not sure. The GPU memory usage about 8.3G same as you.

@Anstinus
Copy link

I see this bug is fixed with #4928
I verified it on my machine with more than 9GB GPU memory and it worked smoothly.
I think this issue could be closed now.

@StrikerRUS
Copy link
Collaborator

@Anstinus

I verified it on my machine with more than 9GB GPU memory and it worked smoothly.

Thank you so much for getting back and sharing this observation!

@111qqz
Copy link

111qqz commented Jan 31, 2023

May I ask which release has this fix? I use v3.3.5 version, the problem can still be reproduced

@jameslamb
Copy link
Collaborator

It will be in release v4.0.0. You can follow #5153 to be notified when that release is published.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants