GPU memory is not released after training with {'predictor':'cpu_predictor'} #4018

vss888 · 2018-12-24T16:34:27Z

It seems reasonable to claim that there is no justifiable reason not to release GPU memory after training if xgboost is used with predictor='cpu_predictor' (please, correct me if I am wrong), and so I was wondering if you could, please, put it on the list of features to be implemented.

It would make the process of hyper-parameter optimization much more efficient since the bottle neck (at least in my use case) is the available GPU memory, and so if GPU memory is released after training, many more models could be trained/tested in parallel using the same GPU.

This is related to issues 3045 and 3083.

The text was updated successfully, but these errors were encountered:

trivialfis · 2018-12-24T17:53:33Z

@vss888 This requires us to design and implement a global GPU memory management system inside XGBoost, which is never a trivial thing to do, especially when memory can be distributed across multiple GPUs. Alternatively we add a ReleaseMemory method to every components inside XGBoost, which is definitely not preferred with such amount of boilerplate code.

The methods used in deep learning framework which has a distributable computation graph does not apply on tree boosting algorithms. More thoughts are needed.

trivialfis · 2018-12-24T18:04:33Z

@vss888 I tried to see how bad the situation is in master branch, with a dataset of shape (200000, 3000), taking up 2GB GPU memory during training, but with only 159 MB left after deleting trained model (in Python del model). Are you sure that your bottle neck lies in memory usage?

vss888 · 2018-12-24T18:27:20Z

@trivialfis Thank you very much for giving the feature a thought! My hope was that an xgboost model object keeps track of the GPU memory it allocated, and so releasing it would be as easy as calling cudaFree(allocatedGPUChunk) for all chunks.

Looking at my larger data set, the shape is (101750935, 6) (4 features, 1 target, 1 weight) and it uses about 4GB of GPU memory, which means that I can train <=4 of such models in parallel (Tesla P100). Smaller data sets are about 10 times smaller, and then I can train up to 40 such models in parallel. GPU is used pretty lightly from the computing point of view (the highest utilization I have seen with multiple models trained in parallel was 27%, and most of the time it is 0% or only some single digit percentage as reported by nvidia-smi). In the same python process, a model object is recreated a number of times (i.e. 10+ currently) and I do not call del model and instead just reassign to the same variable a new model (i.e. model = xgb.XGBRegressor(...)), but I do call import gc; gc.collect() after such reassignment to (hopefully) release any memory associated with a model. I will try to call del model before assigning a new model to the variable, but as far as I know, del only decreases the reference count of an object in python, and the actual destructor is called by the garbage collector.

vss888 · 2018-12-24T18:33:47Z

@trivialfis Plus, I can not delete a model until all the predictions are finished, and so the GPU memory remains allocated.

One possible solution would be to create a def release_gpu_memory(self) in python, which would simply create a new model with predictor='cpu_predictor' and then somehow enforce deletion of the old model.

trivialfis · 2018-12-24T19:26:04Z

@vss888

I understand your use cases now. :)

My hope was that an xgboost model object keeps track of the GPU memory it allocated, and so releasing it would be as easy as calling cudaFree(allocatedGPUChunk) for all chunks.

Yap, I thought about that but gave up the idea. Here is the problem, inside XGBoost, every component allocates memory as needed, objectives, metrics, updaters ... and the memory is allocated on different GPU with different threads.

If you let another class (in the sense of OOP) to delete the memory, ownership of these memory becomes a problem, especially with multi-threading. If there's a bug caused by access after free, all we can get is an issue on GitHub with a segfault message. Such bugs are very hard to prevent, and even harder to debug with many issues not providing reproducible script. The simplest way to think about it is how to write a test that shows the code is correct. I can't think of one. :(

I will put more thoughts on this as I go along with refactoring current GPU code base. But the simplest workaround for you might be save the model first and delete it.

redpoint13 · 2019-07-11T17:17:17Z

has there been any progress or new options?

jonimatix · 2019-07-16T17:07:53Z

Yes this is a teething problem

RAMitchell · 2019-07-16T22:52:02Z

Lets continue this discussion in #4668. I think this needs to be a priority for us.

seanthegreat7 · 2019-07-17T03:42:29Z

For those who look for a quick workaround till you fix it properly check my solution here

trivialfis added the feature-request label Dec 24, 2018

RAMitchell mentioned this issue Jul 16, 2019

Allow releasing GPU memory #4668

Closed

RAMitchell closed this as completed Jul 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory is not released after training with {'predictor':'cpu_predictor'} #4018

GPU memory is not released after training with {'predictor':'cpu_predictor'} #4018

vss888 commented Dec 24, 2018 •

edited

Loading

trivialfis commented Dec 24, 2018

trivialfis commented Dec 24, 2018 •

edited

Loading

vss888 commented Dec 24, 2018 •

edited

Loading

vss888 commented Dec 24, 2018 •

edited

Loading

trivialfis commented Dec 24, 2018

redpoint13 commented Jul 11, 2019

jonimatix commented Jul 16, 2019

RAMitchell commented Jul 16, 2019

seanthegreat7 commented Jul 17, 2019

GPU memory is not released after training with {'predictor':'cpu_predictor'} #4018

GPU memory is not released after training with {'predictor':'cpu_predictor'} #4018

Comments

vss888 commented Dec 24, 2018 • edited Loading

trivialfis commented Dec 24, 2018

trivialfis commented Dec 24, 2018 • edited Loading

vss888 commented Dec 24, 2018 • edited Loading

vss888 commented Dec 24, 2018 • edited Loading

trivialfis commented Dec 24, 2018

redpoint13 commented Jul 11, 2019

jonimatix commented Jul 16, 2019

RAMitchell commented Jul 16, 2019

seanthegreat7 commented Jul 17, 2019

vss888 commented Dec 24, 2018 •

edited

Loading

trivialfis commented Dec 24, 2018 •

edited

Loading

vss888 commented Dec 24, 2018 •

edited

Loading

vss888 commented Dec 24, 2018 •

edited

Loading