Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xgboost GPU models do not release memory after training #3045

Closed
tRosenflanz opened this issue Jan 17, 2018 · 14 comments
Closed

Xgboost GPU models do not release memory after training #3045

tRosenflanz opened this issue Jan 17, 2018 · 14 comments

Comments

@tRosenflanz
Copy link

tRosenflanz commented Jan 17, 2018

Xgboost doesn't release gpu memory after training/predicting the model on large data.
Every further rerun of .fit causes more memory allocation until eventual crash of the kernel because GPU memory is out of bounds.

Environment info

Operating System: Ubuntu 16.04 on PowerPC

Compiler:

Package used (python/R/jvm/C++): python

xgboost version used:

If installing from source, please provide

  1. The commit hash (git rev-parse HEAD) 84ab74f

If you are using python package, please provide

  1. The python version and distribution: Python 2.7.12

Steps to reproduce

Following code should not cause issues but causes out of memory issues if you run it twice. You might have to decrease the repeat number for data depending on the GPU memory you have (16gb on my side)

import numpy as np
import xgboost as xgb
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.datasets import dump_svmlight_file
from sklearn.externals import joblib
from sklearn.metrics import precision_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# use DMatrix for xgbosot
dtrain = xgb.DMatrix(X_train.repeat(300000,axis=0), label=y_train.repeat(300000))
dtest = xgb.DMatrix(X_test.repeat(300000,axis=0), label=y_test.repeat(300000))

# set xgboost params
param = {
    'tree_method': 'gpu_exact',
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3,
    'n_jobs':10}  # the number of classes that exist in this datset
num_round = 20  # the number of training iterations

#------------- numpy array ------------------
# training and testing - numpy matrices
bst = xgb.train(param, dtrain, num_round)
preds = bst.predict(dtest)

@tRosenflanz tRosenflanz changed the title Xgboost models do not release memory after training Xgboost GPU models do not release memory after training Jan 17, 2018
@RAMitchell
Copy link
Member

RAMitchell commented Jan 19, 2018

Can you try calling bst.__delete__() after each round? Python is garbage collected so it may keep the booster object around. If the error persists after this then it may be a bug.

@RAMitchell
Copy link
Member

Closing as no response. Can reopen if the issue persists.

@tRosenflanz
Copy link
Author

Sorry, I haven't gotten around to test it on the original system. I will give it a try and see what happens

@tRosenflanz
Copy link
Author

tRosenflanz commented Jan 23, 2018

Okay. I have called bst.__del__() which seems to work. Two things to note:

  • If bst.__del__() is called before the .predict the kernel dies and the core is dumped (makes sense that it won't work but kernel death can be prevented by some check I assume)
  • Booster keeps training data on the gpu before you call __del__() which means that if your training+inference data exceed GPU memory you will get OOM even though individual datasets might fit into the memory.That seems limiting since there is no need to keep training data in the GPU memory after training is completed. .predict() method on the other hand, purges the data after the call.

This raises a question - is there any way to purge the data off the GPU but keep the trained model?

P.S. By no means an expert in how the things are handled in this amazing package. I will understand if it is necessary to keep the training data after .fit is complete

@RAMitchell
Copy link
Member

Saving the model, deleting the booster then loading the model again should achieve this.

@tRosenflanz
Copy link
Author

Sounds good, thanks for the help!

@jpbowman01
Copy link

jpbowman01 commented Feb 14, 2018

I am having what appears to be the same problem, but using R. I'm not sure what the equivalent of "deleting the booster" in R would be, since what is returned in R is considered a model object. There also does not appear to be a close match to the bst.__del__() call in Python. Any suggestions for what might work in a similar manner to purge the data off the GPU would be much appreciated.

Since this is a closely-related issue, I'm hoping to piggyback on this ticket rather than opening a nearly-duplicate ticket.

@khotilov
Copy link
Member

@jpbowman01 "deleting the booster" in R would be

rm(bst)
gc()

@aliyesilkanat
Copy link

Have the same problem.
Did the delete() trick but it does not work,
bst.__delete__()

Booster' object has no attribute '__delete__'

@se-l
Copy link

se-l commented May 17, 2018

@aliyesilkanat typo above. it needs to be bst.del()

@se-l
Copy link

se-l commented May 17, 2018

nonetheless not working for me. single process. applying .del(). also seeing in nvidia-smi that the GPU mem is being cleared. still always running into this issue even predictably. compiled with different nvidia drivers, GCCs, linux headers, cmake. dont understand why this issue is closed.

@caolanko
Copy link

se-I, I had the same problem and was able to solve it using garbage collect gc.collect() after the del() command.

@aviolov
Copy link

aviolov commented Sep 26, 2018

I also have this problem on a windows machine, with xgboost 0.7 and tree_method='gpu_hist'. the GPU memory does not get released if, for example, xgbReggressor.fit finishes successfully, but some post-processing results in a Python Error).

del xgbRegressor gc.collect();

does not seem to release the GPU memory (but a kernel restart does:).

@vss888
Copy link

vss888 commented Dec 12, 2018

Trying to call bst.__del__(), I get an exception:

'XGBRegressor' object has no attribute 'del'

I run my models with {'predictor':'cpu_predictor'} (including, due to issue 3756), and so would like to free GPU memory as soon as training is finished. This way I would be able to test more hyper-parameter sets in parallel.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants