Allow releasing GPU memory #4668

RAMitchell · 2019-07-16T22:51:17Z

One common piece of feedback we receive about the GPU algorithms is that memory is not released after training. It may be possible to release memory by deleting the booster object but this is not a great user experience.

See
#4018
#3083
#2663
#3045

The reason why we have not implemented this already is that the internal C++ code does not actually know when training is finished. The language bindings call each training iteration one by one and I don't believe we have any information inside the GPU training code to say if another training iteration is expected or not.

I see a few solutions:

We try to use some heuristic internally to decide if it is a good time to free all memory from inside GBTree.
We implement an API function for cleanup - this function can be specific to GPU memory or it can just be a general hint for xgboost to delete any working memory or temporary data structures. I do not like this option as it will propagate through the entire code base - the learner, booster, updaters and predictor will all have to implement these methods.
We implement a method in the language bindings where the booster object serializes itself and then deserializes from disk. Doing this will clear all temporary data structures and should leave the booster in a usable state to resume training or do prediction.

I am leaning towards option 3) but I think it relies on #3980 to make sure all parameters are correctly saved. Maybe it's still possible to do this with current serialization and not have any unexpected side-effects due to parameters not all being saved.

@trivialfis @sriramch @rongou

trivialfis · 2019-07-17T00:17:15Z

Or we pass num boost round to c++?

seanthegreat7 · 2019-07-17T03:41:54Z

For those who look for a quick workaround till you fix it properly check my solution here

trivialfis · 2019-07-17T14:04:04Z

@seanthegreat7 Thanks. That's actually an interesting workaround.

Lauler · 2019-07-18T22:09:47Z

None of the workarounds seem to be working on Windows 10. Tried deleting and loading the booster object (still crashed).

Tried predicting in a subprocess similar to @seanthegreat7 (but for R instead of python). The subprocess just ran indefinitely without finishing.

Would indeed be greatly appreciated if you provided a solution for this issue!

jtromans · 2019-07-23T07:35:54Z

I'm finding this very difficult especially when performing a wide parameter search in a loop of some kind.

For example:

exp_models= []
for cnt, mdl_version in enumerate(range(200)):
    clf = xgb.XGBClassifier(booster='gbtree', objective='binary:logistic', 
                tree_method='gpu_hist', n_gpus=1, gpu_id=1, n_estimators=30) 
    trained_model = clf.fit(X_train, y_train, verbose=False)
    exp_models.append(trained_model)

This will crash, since I guess the trained_model hangs around on the GPU indefinitely. Alternatively, if I exp_models.append(trained_model.get_booster().copy()) all is well.

However, I'm also running into the same issue when submitting numerous jobs via a Dask scheduler (note not dask-xgboost).

In both cases I eventually get:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: out of memory

I don't have a view on the best solution, but would love to resolve.

aviolov · 2019-10-28T14:05:45Z

My hack is to do this

xgbPredictor = xgboost.XGBRegressor(**self.xgb_params)            
        xgbPredictor .fit(Xs,ys)
        
        # This hack should only be used if tree_method == gpu_hist or gpu_exact
        if self.xgb_params['tree_method'][:3] == 'gpu':        
            with tempfile.TemporaryFile() as dump_file:
                pickle.dump(xgbPredictor , dump_file)
                dump_file.seek(0)
                self.predictor_ = pickle.load(dump_file)
        else:
            self.predictor_= xgbPredictor

and it has solved my GPU mem-leak

paantya · 2020-03-02T17:32:01Z

wouldn't it be easier to implement the function as in pytorch?
Like:
torch.cuda.empty_cache()

trivialfis · 2020-03-03T00:51:28Z

It wouldn't be easier, but that's an option.

paantya · 2020-03-05T15:59:47Z

@trivialfis Do you (or someone else) plan to fix this problem at all?
in Dask and Spark it is not like this?

maxmetzger · 2020-04-01T22:00:56Z

I am running into this same issue, when training many small gpu_hist models.

trivialfis · 2020-10-09T15:15:59Z

Could you please open a new issue?

RAMitchell added the feature-request label Jul 16, 2019

RAMitchell mentioned this issue Jul 16, 2019

GPU memory is not released after training with {'predictor':'cpu_predictor'} #4018

Closed

CodingCat mentioned this issue Jul 23, 2019

[Roadmap] XGBoost 1.0.0 Roadmap #4680

Closed

9 tasks

RAMitchell mentioned this issue Nov 14, 2019

Some guidelines on device memory usage #5038

Merged

paantya mentioned this issue Mar 2, 2020

GPU memory-leak #5382

Closed

RAMitchell mentioned this issue Apr 4, 2020

Serialise booster after training to reset state #5484

Merged

trivialfis closed this as completed Sep 27, 2020

nils-fl mentioned this issue Oct 12, 2020

Training many models with gpu_hist in Optuna yields ‘parallel_for failed: out of memory’ #6225

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow releasing GPU memory #4668

Allow releasing GPU memory #4668

RAMitchell commented Jul 16, 2019

trivialfis commented Jul 17, 2019

seanthegreat7 commented Jul 17, 2019

trivialfis commented Jul 17, 2019

Lauler commented Jul 18, 2019

jtromans commented Jul 23, 2019

aviolov commented Oct 28, 2019

paantya commented Mar 2, 2020

trivialfis commented Mar 3, 2020

paantya commented Mar 5, 2020

maxmetzger commented Apr 1, 2020

trivialfis commented Oct 9, 2020

Allow releasing GPU memory #4668

Allow releasing GPU memory #4668

Comments

RAMitchell commented Jul 16, 2019

trivialfis commented Jul 17, 2019

seanthegreat7 commented Jul 17, 2019

trivialfis commented Jul 17, 2019

Lauler commented Jul 18, 2019

jtromans commented Jul 23, 2019

aviolov commented Oct 28, 2019

paantya commented Mar 2, 2020

trivialfis commented Mar 3, 2020

paantya commented Mar 5, 2020

maxmetzger commented Apr 1, 2020

trivialfis commented Oct 9, 2020