Memory issue when training 1024 resolution #33

BlueberryGin · 2022-02-15T06:22:52Z

I'm trying to train a 1024x1024 database on a V100 GPU.
I tried both the tensorflow version and the pytorch version.
Despite setting batch-gpu to 1, the tensorflow version always run out of system RAM(after the first tick, system ram total 51gb), and
the pytorch version alway run out of cuda memory(before the first tick).

Here are my training settings:

python run_network.py --train --metrics 'none' --gpus 0 --batch-gpu 1 --resolution 1024 \
 --ganformer-default --expname art1 --dataset 1024art

Also, I always encounter the warning:
tcmalloc: large alloc

The text was updated successfully, but these errors were encountered:

BlueberryGin · 2022-02-15T17:26:09Z

Did dome investigation on the tensorflow model, turns out the problem occurs when saving snapshot images. There is probably some kind of memory leak when saving large images in visualize. It is fine when not saving images.

Will investigate further later

dorarad · 2022-02-15T18:51:36Z

Hi, thanks for reaching out! I noticed indeed that the visualization takes a lot of RAM but haven't yet tracked the issue since it's not a stateful module and so I'm not sure where specifically it could lead to a memory leak. However, I think the issue is that when it tries to make a visualization it holds in a memory 28 model outputs at the same time (including stacks of attention maps), so reducing the grid size of saves images here:
https://github.com/dorarad/gansformer/blob/main/training/misc.py#L306
could mitigate the issue.

I'll be making couple changes so that memory consumption is reduced by default, and looking forward to hearing if you find by any chance anything further!

BlueberryGin · 2022-02-17T18:01:15Z

Cool, I will try that tomorrow and keep investigating:)
Thanks!

BlueberryGin · 2022-02-17T18:06:23Z

Btw, I'm currently running fine with only saving output images and not saving attention maps

dorarad · 2022-02-21T22:39:28Z

I'll update the default options in accordance with that so the people won't get memory issues. Thank you for the openning this issue!

kwhuang88228 mentioned this issue Feb 21, 2022

PyTorch implementation generates same image samples #34

Closed

dorarad closed this as completed Feb 21, 2022

dorarad pinned this issue Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issue when training 1024 resolution #33

Memory issue when training 1024 resolution #33

BlueberryGin commented Feb 15, 2022 •

edited

Loading

BlueberryGin commented Feb 15, 2022

dorarad commented Feb 15, 2022

BlueberryGin commented Feb 17, 2022

BlueberryGin commented Feb 17, 2022

dorarad commented Feb 21, 2022

Memory issue when training 1024 resolution #33

Memory issue when training 1024 resolution #33

Comments

BlueberryGin commented Feb 15, 2022 • edited Loading

BlueberryGin commented Feb 15, 2022

dorarad commented Feb 15, 2022

BlueberryGin commented Feb 17, 2022

BlueberryGin commented Feb 17, 2022

dorarad commented Feb 21, 2022

BlueberryGin commented Feb 15, 2022 •

edited

Loading