Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory is not released after training with {'predictor':'cpu_predictor'} #4018

Closed
vss888 opened this issue Dec 24, 2018 · 9 comments
Closed

Comments

@vss888
Copy link

vss888 commented Dec 24, 2018

It seems reasonable to claim that there is no justifiable reason not to release GPU memory after training if xgboost is used with predictor='cpu_predictor' (please, correct me if I am wrong), and so I was wondering if you could, please, put it on the list of features to be implemented.

It would make the process of hyper-parameter optimization much more efficient since the bottle neck (at least in my use case) is the available GPU memory, and so if GPU memory is released after training, many more models could be trained/tested in parallel using the same GPU.

This is related to issues 3045 and 3083.

@trivialfis
Copy link
Member

@vss888 This requires us to design and implement a global GPU memory management system inside XGBoost, which is never a trivial thing to do, especially when memory can be distributed across multiple GPUs. Alternatively we add a ReleaseMemory method to every components inside XGBoost, which is definitely not preferred with such amount of boilerplate code.

The methods used in deep learning framework which has a distributable computation graph does not apply on tree boosting algorithms. More thoughts are needed.

@trivialfis
Copy link
Member

trivialfis commented Dec 24, 2018

@vss888 I tried to see how bad the situation is in master branch, with a dataset of shape (200000, 3000), taking up 2GB GPU memory during training, but with only 159 MB left after deleting trained model (in Python del model). Are you sure that your bottle neck lies in memory usage?

@vss888
Copy link
Author

vss888 commented Dec 24, 2018

@trivialfis Thank you very much for giving the feature a thought! My hope was that an xgboost model object keeps track of the GPU memory it allocated, and so releasing it would be as easy as calling cudaFree(allocatedGPUChunk) for all chunks.

Looking at my larger data set, the shape is (101750935, 6) (4 features, 1 target, 1 weight) and it uses about 4GB of GPU memory, which means that I can train <=4 of such models in parallel (Tesla P100). Smaller data sets are about 10 times smaller, and then I can train up to 40 such models in parallel. GPU is used pretty lightly from the computing point of view (the highest utilization I have seen with multiple models trained in parallel was 27%, and most of the time it is 0% or only some single digit percentage as reported by nvidia-smi). In the same python process, a model object is recreated a number of times (i.e. 10+ currently) and I do not call del model and instead just reassign to the same variable a new model (i.e. model = xgb.XGBRegressor(...)), but I do call import gc; gc.collect() after such reassignment to (hopefully) release any memory associated with a model. I will try to call del model before assigning a new model to the variable, but as far as I know, del only decreases the reference count of an object in python, and the actual destructor is called by the garbage collector.

@vss888
Copy link
Author

vss888 commented Dec 24, 2018

@trivialfis Plus, I can not delete a model until all the predictions are finished, and so the GPU memory remains allocated.

One possible solution would be to create a def release_gpu_memory(self) in python, which would simply create a new model with predictor='cpu_predictor' and then somehow enforce deletion of the old model.

@trivialfis
Copy link
Member

@vss888

I understand your use cases now. :)

My hope was that an xgboost model object keeps track of the GPU memory it allocated, and so releasing it would be as easy as calling cudaFree(allocatedGPUChunk) for all chunks.

Yap, I thought about that but gave up the idea. Here is the problem, inside XGBoost, every component allocates memory as needed, objectives, metrics, updaters ... and the memory is allocated on different GPU with different threads.

If you let another class (in the sense of OOP) to delete the memory, ownership of these memory becomes a problem, especially with multi-threading. If there's a bug caused by access after free, all we can get is an issue on GitHub with a segfault message. Such bugs are very hard to prevent, and even harder to debug with many issues not providing reproducible script. The simplest way to think about it is how to write a test that shows the code is correct. I can't think of one. :(

I will put more thoughts on this as I go along with refactoring current GPU code base. But the simplest workaround for you might be save the model first and delete it.

@redpoint13
Copy link

has there been any progress or new options?

@jonimatix
Copy link

Yes this is a teething problem

@RAMitchell
Copy link
Member

Lets continue this discussion in #4668. I think this needs to be a priority for us.

@seanthegreat7
Copy link

For those who look for a quick workaround till you fix it properly check my solution here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants