Xgboost GPU models do not release memory after training #3045

tRosenflanz · 2018-01-17T17:46:57Z

Xgboost doesn't release gpu memory after training/predicting the model on large data.
Every further rerun of .fit causes more memory allocation until eventual crash of the kernel because GPU memory is out of bounds.

Environment info

Operating System: Ubuntu 16.04 on PowerPC

Compiler:

Package used (python/R/jvm/C++): python

xgboost version used:

If installing from source, please provide

The commit hash (git rev-parse HEAD) 84ab74f

If you are using python package, please provide

The python version and distribution: Python 2.7.12

Steps to reproduce

Following code should not cause issues but causes out of memory issues if you run it twice. You might have to decrease the repeat number for data depending on the GPU memory you have (16gb on my side)

import numpy as np
import xgboost as xgb
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.datasets import dump_svmlight_file
from sklearn.externals import joblib
from sklearn.metrics import precision_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# use DMatrix for xgbosot
dtrain = xgb.DMatrix(X_train.repeat(300000,axis=0), label=y_train.repeat(300000))
dtest = xgb.DMatrix(X_test.repeat(300000,axis=0), label=y_test.repeat(300000))

# set xgboost params
param = {
    'tree_method': 'gpu_exact',
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3,
    'n_jobs':10}  # the number of classes that exist in this datset
num_round = 20  # the number of training iterations

#------------- numpy array ------------------
# training and testing - numpy matrices
bst = xgb.train(param, dtrain, num_round)
preds = bst.predict(dtest)

The text was updated successfully, but these errors were encountered:

RAMitchell · 2018-01-19T22:40:47Z

Can you try calling bst.__delete__() after each round? Python is garbage collected so it may keep the booster object around. If the error persists after this then it may be a bug.

RAMitchell · 2018-01-23T00:19:55Z

Closing as no response. Can reopen if the issue persists.

tRosenflanz · 2018-01-23T00:21:25Z

Sorry, I haven't gotten around to test it on the original system. I will give it a try and see what happens

tRosenflanz · 2018-01-23T20:05:30Z

Okay. I have called bst.__del__() which seems to work. Two things to note:

If bst.__del__() is called before the .predict the kernel dies and the core is dumped (makes sense that it won't work but kernel death can be prevented by some check I assume)
Booster keeps training data on the gpu before you call __del__() which means that if your training+inference data exceed GPU memory you will get OOM even though individual datasets might fit into the memory.That seems limiting since there is no need to keep training data in the GPU memory after training is completed. .predict() method on the other hand, purges the data after the call.

This raises a question - is there any way to purge the data off the GPU but keep the trained model?

P.S. By no means an expert in how the things are handled in this amazing package. I will understand if it is necessary to keep the training data after .fit is complete

RAMitchell · 2018-01-23T20:49:55Z

Saving the model, deleting the booster then loading the model again should achieve this.

tRosenflanz · 2018-01-23T21:50:09Z

Sounds good, thanks for the help!

jpbowman01 · 2018-02-14T21:43:40Z

I am having what appears to be the same problem, but using R. I'm not sure what the equivalent of "deleting the booster" in R would be, since what is returned in R is considered a model object. There also does not appear to be a close match to the bst.__del__() call in Python. Any suggestions for what might work in a similar manner to purge the data off the GPU would be much appreciated.

Since this is a closely-related issue, I'm hoping to piggyback on this ticket rather than opening a nearly-duplicate ticket.

khotilov · 2018-02-15T23:44:34Z

@jpbowman01 "deleting the booster" in R would be

rm(bst)
gc()

aliyesilkanat · 2018-04-22T19:27:23Z

Have the same problem.
Did the delete() trick but it does not work,
bst.__delete__()

Booster' object has no attribute '__delete__'

se-l · 2018-05-17T04:05:17Z

@aliyesilkanat typo above. it needs to be bst.del()

se-l · 2018-05-17T05:14:56Z

nonetheless not working for me. single process. applying .del(). also seeing in nvidia-smi that the GPU mem is being cleared. still always running into this issue even predictably. compiled with different nvidia drivers, GCCs, linux headers, cmake. dont understand why this issue is closed.

caolanko · 2018-08-28T13:44:13Z

se-I, I had the same problem and was able to solve it using garbage collect gc.collect() after the del() command.

aviolov · 2018-09-26T06:42:19Z

I also have this problem on a windows machine, with xgboost 0.7 and tree_method='gpu_hist'. the GPU memory does not get released if, for example, xgbReggressor.fit finishes successfully, but some post-processing results in a Python Error).

del xgbRegressor gc.collect();

does not seem to release the GPU memory (but a kernel restart does:).

vss888 · 2018-12-12T23:06:43Z

Trying to call bst.__del__(), I get an exception:

'XGBRegressor' object has no attribute 'del'

I run my models with {'predictor':'cpu_predictor'} (including, due to issue 3756), and so would like to free GPU memory as soon as training is finished. This way I would be able to test more hyper-parameter sets in parallel.

tRosenflanz changed the title ~~Xgboost models do not release memory after training~~ Xgboost GPU models do not release memory after training Jan 17, 2018

RAMitchell closed this as completed Jan 23, 2018

pavithrasv mentioned this issue Jan 29, 2018

GPU model does not release memory #3083

Closed

Japrin mentioned this issue May 28, 2018

caret::train requires excessive amounts of memory with the xgbTree method topepo/caret#880

Closed

Japrin mentioned this issue Jun 5, 2018

To solve the problem: when using xgbTree, the memory would not be released topepo/caret#893

Closed

vss888 mentioned this issue Dec 24, 2018

GPU memory is not released after training with {'predictor':'cpu_predictor'} #4018

Closed

lock bot locked as resolved and limited conversation to collaborators Mar 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xgboost GPU models do not release memory after training #3045

Xgboost GPU models do not release memory after training #3045

tRosenflanz commented Jan 17, 2018 •

edited

Loading

RAMitchell commented Jan 19, 2018 •

edited

Loading

RAMitchell commented Jan 23, 2018

tRosenflanz commented Jan 23, 2018

tRosenflanz commented Jan 23, 2018 •

edited

Loading

RAMitchell commented Jan 23, 2018

tRosenflanz commented Jan 23, 2018

jpbowman01 commented Feb 14, 2018 •

edited

Loading

khotilov commented Feb 15, 2018

aliyesilkanat commented Apr 22, 2018

se-l commented May 17, 2018

se-l commented May 17, 2018 •

edited

Loading

caolanko commented Aug 28, 2018

aviolov commented Sep 26, 2018

vss888 commented Dec 12, 2018

Xgboost GPU models do not release memory after training #3045

Xgboost GPU models do not release memory after training #3045

Comments

tRosenflanz commented Jan 17, 2018 • edited Loading

Environment info

Steps to reproduce

RAMitchell commented Jan 19, 2018 • edited Loading

RAMitchell commented Jan 23, 2018

tRosenflanz commented Jan 23, 2018

tRosenflanz commented Jan 23, 2018 • edited Loading

RAMitchell commented Jan 23, 2018

tRosenflanz commented Jan 23, 2018

jpbowman01 commented Feb 14, 2018 • edited Loading

khotilov commented Feb 15, 2018

aliyesilkanat commented Apr 22, 2018

se-l commented May 17, 2018

se-l commented May 17, 2018 • edited Loading

caolanko commented Aug 28, 2018

aviolov commented Sep 26, 2018

vss888 commented Dec 12, 2018

tRosenflanz commented Jan 17, 2018 •

edited

Loading

RAMitchell commented Jan 19, 2018 •

edited

Loading

tRosenflanz commented Jan 23, 2018 •

edited

Loading

jpbowman01 commented Feb 14, 2018 •

edited

Loading

se-l commented May 17, 2018 •

edited

Loading