GPU out of memory when using `SupervisedTrainer` in a loop #3423

tao558 · 2021-11-30T20:08:00Z

Describe the bug
Constructing and running a SupervisedTrainer in a loop eventually leads to GPU out of memory. See below example

To Reproduce

import segmentation_models_pytorch as smp
import torch
from torch import optim, nn
from monai.engines import SupervisedTrainer
from monai.data import DataLoader, ArrayDataset
import gc


NETWORK_INPUT_SHAPE = (1, 256, 256)
NUM_IMAGES = 50

def get_xy():
    xs = [256 * torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    ys = [torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    return xs, ys


def get_data_loader():
    x, y = get_xy()
    dataset = ArrayDataset(x, seg=y)
    loader = DataLoader(dataset, batch_size=16)
    return loader


def get_model():
    return smp.Unet(
        encoder_weights="imagenet", in_channels=1, classes=2, activation=None
    )

if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    train_loader = get_data_loader()
    model = get_model()

    for i in range(50):
        print(f"On iteration {i}")

        model.to(device)
        optimizer = optim.Adam(model.parameters())

        trainer = SupervisedTrainer(
            device=device,
            max_epochs=10,
            train_data_loader=train_loader,
            network=model,
            optimizer=optimizer,
            loss_function=nn.CrossEntropyLoss(),
            prepare_batch=lambda batchdata, device, non_blocking: (
                batchdata[0].to(device),
                batchdata[1].squeeze(1).to(device, dtype=torch.long),
            ),
        )

        trainer.run()
        # gc.collect()

Around the 4th iteration or so, I get RuntimeError: CUDA out of memory. If this doesn't happen for anyone trying out the example, try increasing the NUM_IMAGES variable or the number of iterations of the loop. I know that there are a few common causes for out of memory issues in pytorch, outlined here, but I can't really find where I'm doing any of these things. I've tried calling del trainer and moving the initialization of the model inside the loop and deleting it afterwards. Calling gc.collect() works, which makes me think that there is some kind of circular reference holding up the garbage collection. I'm not convinced that this isn't user error, though.

Environment
ubuntu 18.04, python 3.8

Ensuring you use the relevant python executable, please paste the output of:

python -c 'import monai; monai.config.print_debug_info()'

================================
Printing MONAI config...

MONAI version: 0.8.0
Numpy version: 1.21.2
Pytorch version: 1.10.0+cu102
MONAI flags: HAS_EXT = False, USE_COMPILED = False
MONAI rev id: 714d00d

Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: NOT INSTALLED or UNKNOWN VERSION.
scikit-image version: 0.18.3
Pillow version: 8.4.0
Tensorboard version: NOT INSTALLED or UNKNOWN VERSION.
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: 0.10.1+cu102
tqdm version: 4.62.3
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: NOT INSTALLED or UNKNOWN VERSION.
pandas version: 1.3.3
einops version: NOT INSTALLED or UNKNOWN VERSION.
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.

For details about installing the optional dependencies, please visit:
https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================
Printing system config...

psutil required for print_system_info

================================
Printing GPU config...

Num GPUs: 1
Has CUDA: True
CUDA version: 10.2
cuDNN enabled: True
cuDNN version: 7605
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70']
GPU 0 Name: Quadro T2000
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 16
GPU 0 Total memory (GB): 3.8
GPU 0 CUDA capability (maj.min): 7.5

Additional context
Originally used for k-fold cross validation.

The text was updated successfully, but these errors were encountered:

Nic-Ma · 2021-12-01T02:31:19Z

Hi @tao558 ,

Thanks for your experiments and feedback.
Is it possible that some memory leaking issue in segmentation_models_pytorch?
Could you please help test again without any network computation?

Thanks in advance.

tao558 · 2021-12-01T16:41:54Z

Is it possible that some memory leaking issue in segmentation_models_pytorch?

I don't think so. Replacing the trainer with the classic "pytorch training loop" doesn't give out of memory issues.

Could you please help test again without any network computation?

Meaning just comment out trainer.run()? No memory leak in that case

tao558 · 2021-12-01T18:05:55Z

Interestingly, setting trainer.optimizer = None at the end of the loop reduces the memory overhead substantially. Memory usage is still climbing but I'm past iteration 15 with no memory issue yet. May have actually fixed the leak

Nic-Ma · 2021-12-01T22:24:45Z

Hi @tao558 ,

I also don't understand why you created optimizer, trainer for 50 times and also run() for 50 times?

Thanks.

tao558 · 2021-12-02T16:13:56Z

This is all in the context of k-fold cross validation. So I need to make a new optimizer for each fold, a new trainer to run the training, and run the trainer for each fold. Moving the optimizer outside of the loop causes a memory leak, anyway.

Nic-Ma · 2021-12-03T00:03:27Z

Interesting finding!
CC @ericspod @rijobro @wyli for visibility.

Thanks.

ericspod · 2021-12-05T17:41:25Z

I think what's happening is that the instances of the trainer being created still aren't getting cleaned up after each loop. Once you assign a new value to trainer the previous object should be collectabled but the Python garbage collector won't do that until it needs to, and decides need based on allocated system memory and not GPU memory. That trainer objects keeps its reference to the optimizer which keeps references to tensors, this is a big source of the retained memory that you're seeing and why clearing the optimizer attribute has an effect.

We have a garbage collector handler that may work but you probably want to collect before run so try calling collect twice in a row before run to see if that works. Doing it twice gives the collector the chance to find second level connected objects that become collectable after the first call.

Pytorch also likes to cache things a lot so calling torch.cuda.empty_cache may also help. Some of the CUDA semantics on this are discussed here, one idea mentioned is to set the environment variable PYTORCH_NO_CUDA_MEMORY_CACHING=1 to disable caching and see if that helps.

If that still doesn't work then this comes down to a memory retention problem within Pytorch that I've observed in the past that I haven't resolved. I've seen this in notebooks such that after clearing local variables Pytorch still doesn't give back 100% of its allocated VRAM, this may be related to how Jupyter keeps history and such internally but it may also be the way Pytorch internals keep memory. The solution then is to run each iteration in a separate process, so having a shell script that will call your script how many times you like passing in the iteration number, and you would then store results in numbered files. This is more complex and requires loading your dataset each time, but will isolate runs from one another and ensure cleanup.

tao558 · 2021-12-13T17:37:58Z

Thanks for the detailed response, Ill give your suggestions a shot. Is there a better way to do k-fold cross validation with monai that I'm unaware of?

Thanks again.

Nic-Ma · 2021-12-13T22:31:32Z

Hi @tao558 ,

I think maybe you can leverage something in this tutorial:
https://github.com/Project-MONAI/tutorials/blob/master/modules/cross_validation_models_ensemble.ipynb

Thanks.

wyli closed this as completed Dec 20, 2021

matt3o mentioned this issue Jun 19, 2023

Cuda OOM / GPU memory leak due to transforms on the GPU (DeepEdit) #6626

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU out of memory when using `SupervisedTrainer` in a loop #3423

GPU out of memory when using `SupervisedTrainer` in a loop #3423

tao558 commented Nov 30, 2021

Nic-Ma commented Dec 1, 2021

tao558 commented Dec 1, 2021

tao558 commented Dec 1, 2021 •

edited

Loading

Nic-Ma commented Dec 1, 2021

tao558 commented Dec 2, 2021

Nic-Ma commented Dec 3, 2021

ericspod commented Dec 5, 2021

tao558 commented Dec 13, 2021

Nic-Ma commented Dec 13, 2021

GPU out of memory when using SupervisedTrainer in a loop #3423

GPU out of memory when using SupervisedTrainer in a loop #3423

Comments

tao558 commented Nov 30, 2021

================================ Printing MONAI config...

================================ Printing system config...

================================ Printing GPU config...

Nic-Ma commented Dec 1, 2021

tao558 commented Dec 1, 2021

tao558 commented Dec 1, 2021 • edited Loading

Nic-Ma commented Dec 1, 2021

tao558 commented Dec 2, 2021

Nic-Ma commented Dec 3, 2021

ericspod commented Dec 5, 2021

tao558 commented Dec 13, 2021

Nic-Ma commented Dec 13, 2021

GPU out of memory when using `SupervisedTrainer` in a loop #3423

GPU out of memory when using `SupervisedTrainer` in a loop #3423

================================
Printing MONAI config...

================================
Printing system config...

================================
Printing GPU config...

tao558 commented Dec 1, 2021 •

edited

Loading