-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issue when training 1024 resolution #33
Comments
Did dome investigation on the tensorflow model, turns out the problem occurs when saving snapshot images. There is probably some kind of memory leak when saving large images in visualize. It is fine when not saving images. Will investigate further later |
Hi, thanks for reaching out! I noticed indeed that the visualization takes a lot of RAM but haven't yet tracked the issue since it's not a stateful module and so I'm not sure where specifically it could lead to a memory leak. However, I think the issue is that when it tries to make a visualization it holds in a memory 28 model outputs at the same time (including stacks of attention maps), so reducing the grid size of saves images here: I'll be making couple changes so that memory consumption is reduced by default, and looking forward to hearing if you find by any chance anything further! |
Cool, I will try that tomorrow and keep investigating:) |
Btw, I'm currently running fine with only saving output images and not saving attention maps |
I'll update the default options in accordance with that so the people won't get memory issues. Thank you for the openning this issue! |
I'm trying to train a 1024x1024 database on a V100 GPU.
I tried both the tensorflow version and the pytorch version.
Despite setting batch-gpu to 1, the tensorflow version always run out of system RAM(after the first tick, system ram total 51gb), and
the pytorch version alway run out of cuda memory(before the first tick).
Here are my training settings:
Also, I always encounter the warning:
tcmalloc: large alloc
The text was updated successfully, but these errors were encountered: