Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issue when training 1024 resolution #33

Closed
BlueberryGin opened this issue Feb 15, 2022 · 5 comments
Closed

Memory issue when training 1024 resolution #33

BlueberryGin opened this issue Feb 15, 2022 · 5 comments

Comments

@BlueberryGin
Copy link

BlueberryGin commented Feb 15, 2022

I'm trying to train a 1024x1024 database on a V100 GPU.
I tried both the tensorflow version and the pytorch version.
Despite setting batch-gpu to 1, the tensorflow version always run out of system RAM(after the first tick, system ram total 51gb), and
the pytorch version alway run out of cuda memory(before the first tick).

Here are my training settings:

python run_network.py --train --metrics 'none' --gpus 0 --batch-gpu 1 --resolution 1024 \
 --ganformer-default --expname art1 --dataset 1024art

Also, I always encounter the warning:
tcmalloc: large alloc

@BlueberryGin
Copy link
Author

Did dome investigation on the tensorflow model, turns out the problem occurs when saving snapshot images. There is probably some kind of memory leak when saving large images in visualize. It is fine when not saving images.

Will investigate further later

@dorarad
Copy link
Owner

dorarad commented Feb 15, 2022

Hi, thanks for reaching out! I noticed indeed that the visualization takes a lot of RAM but haven't yet tracked the issue since it's not a stateful module and so I'm not sure where specifically it could lead to a memory leak. However, I think the issue is that when it tries to make a visualization it holds in a memory 28 model outputs at the same time (including stacks of attention maps), so reducing the grid size of saves images here:
https://github.com/dorarad/gansformer/blob/main/training/misc.py#L306
could mitigate the issue.

I'll be making couple changes so that memory consumption is reduced by default, and looking forward to hearing if you find by any chance anything further!

@BlueberryGin
Copy link
Author

Cool, I will try that tomorrow and keep investigating:)
Thanks!

@BlueberryGin
Copy link
Author

Btw, I'm currently running fine with only saving output images and not saving attention maps

@dorarad
Copy link
Owner

dorarad commented Feb 21, 2022

I'll update the default options in accordance with that so the people won't get memory issues. Thank you for the openning this issue!

@dorarad dorarad closed this as completed Feb 21, 2022
@dorarad dorarad pinned this issue Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants