-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU out of memory when using SupervisedTrainer
in a loop
#3423
Comments
Hi @tao558 , Thanks for your experiments and feedback. Thanks in advance. |
I don't think so. Replacing the trainer with the classic "pytorch training loop" doesn't give out of memory issues.
Meaning just comment out |
Interestingly, setting |
Hi @tao558 , I also don't understand why you created Thanks. |
This is all in the context of k-fold cross validation. So I need to make a new optimizer for each fold, a new trainer to run the training, and run the trainer for each fold. Moving the optimizer outside of the loop causes a memory leak, anyway. |
I think what's happening is that the instances of the trainer being created still aren't getting cleaned up after each loop. Once you assign a new value to We have a garbage collector handler that may work but you probably want to collect before Pytorch also likes to cache things a lot so calling If that still doesn't work then this comes down to a memory retention problem within Pytorch that I've observed in the past that I haven't resolved. I've seen this in notebooks such that after clearing local variables Pytorch still doesn't give back 100% of its allocated VRAM, this may be related to how Jupyter keeps history and such internally but it may also be the way Pytorch internals keep memory. The solution then is to run each iteration in a separate process, so having a shell script that will call your script how many times you like passing in the iteration number, and you would then store results in numbered files. This is more complex and requires loading your dataset each time, but will isolate runs from one another and ensure cleanup. |
Thanks for the detailed response, Ill give your suggestions a shot. Is there a better way to do k-fold cross validation with monai that I'm unaware of? Thanks again. |
Hi @tao558 , I think maybe you can leverage something in this tutorial: Thanks. |
Describe the bug
Constructing and running a
SupervisedTrainer
in a loop eventually leads to GPU out of memory. See below exampleTo Reproduce
Around the 4th iteration or so, I get
RuntimeError: CUDA out of memory
. If this doesn't happen for anyone trying out the example, try increasing theNUM_IMAGES
variable or the number of iterations of the loop. I know that there are a few common causes for out of memory issues in pytorch, outlined here, but I can't really find where I'm doing any of these things. I've tried callingdel trainer
and moving the initialization of the model inside the loop and deleting it afterwards. Callinggc.collect()
works, which makes me think that there is some kind of circular reference holding up the garbage collection. I'm not convinced that this isn't user error, though.Environment
ubuntu 18.04, python 3.8
Ensuring you use the relevant python executable, please paste the output of:
================================
Printing MONAI config...
MONAI version: 0.8.0
Numpy version: 1.21.2
Pytorch version: 1.10.0+cu102
MONAI flags: HAS_EXT = False, USE_COMPILED = False
MONAI rev id: 714d00d
Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: NOT INSTALLED or UNKNOWN VERSION.
scikit-image version: 0.18.3
Pillow version: 8.4.0
Tensorboard version: NOT INSTALLED or UNKNOWN VERSION.
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: 0.10.1+cu102
tqdm version: 4.62.3
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: NOT INSTALLED or UNKNOWN VERSION.
pandas version: 1.3.3
einops version: NOT INSTALLED or UNKNOWN VERSION.
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.
For details about installing the optional dependencies, please visit:
https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
================================
Printing system config...
psutil
required forprint_system_info
================================
Printing GPU config...
Num GPUs: 1
Has CUDA: True
CUDA version: 10.2
cuDNN enabled: True
cuDNN version: 7605
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70']
GPU 0 Name: Quadro T2000
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 16
GPU 0 Total memory (GB): 3.8
GPU 0 CUDA capability (maj.min): 7.5
Additional context
Originally used for k-fold cross validation.
The text was updated successfully, but these errors were encountered: