-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training many models with gpu_hist in Optuna yields ‘parallel_for failed: out of memory’ #6225
Comments
Let me dive into Optuna. At the same time could you please share the shape of your data? |
Thanks for the effort. |
Any update on this issue? |
Could you please provide a more complete script that I can run? I can't guess your configuration. |
This is my code, which stops after 28 rounds with the errors stated above.
|
Running the script. See what's gonna happen later. |
Are you sure that your script is correct? I'm not familiar with optuna, but this message seems unnnormal:
|
Hm.. Never saw that before. I will check and let you know. |
So, the code above is working for me - until the stated error.
Still giving me the same error (in round i=31):
|
I was able to get a different error, after 8 hours .... |
Do you think it could be hardware related? |
I can't make any guess at the moment. |
I have been facing similar issues when running xgb.cv with optuna. On closer inspection I saw that this was because of GPU going out of memory. This was confirmed because when I lowered down the cv to 2 (which fits 2 instances of my data on my GPU) and removed the flag n_jobs=-1 (no parallelization), it ran without issues. So most probably it's because optuna is parallely trying to train multiple models and the GPU is running out of memory in that case. |
I meet with this issue too. |
I have similar issue using sklearn rfe and hyperopt when trying to run multiple iterations of GPU model training. Is there any control for sklearn API to gc the GPU memory? |
Hi, I am having an issue using XGBClassifier on GPU running OOM and tried to implement a workaround by saving the model, deleting the model and loading it back in.
I am on xgb 1.3.0 and the models are very small. I am running a HO with Optuna with a 1000x Bootstrapping CV in each iteration. After 50 - 120 Optuna iteration, it throws the error:
and
Looking at nvidia-smi it only takes a constant ~210 MB… (RTX TITAN)
My parameter space looks like this:
I thought this is related to issue #4668, but I am not sure about that anymore.
BTW, everything works fine running the same code on CPU. Other libraries like RAPIDS cuML are working fine on GPU.
The text was updated successfully, but these errors were encountered: