-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
probable memory leak in cpu ram #1
Comments
It goes without saying that I am using 1 GPU which seems to be fully used (up to 100%) during all epochs. |
Hi @gillmac13. Yes, I also noticed the mem leak during training, but haven't got enough time to fix it due to other tasks recently. Will try to make it done ASAP and let you know when finished. Sorry for the trouble. |
Many thanks ! Evidently some objects in memory are not released when needed. Since I only have 1 worker selected, I thought setting "use_multiprocessing" to True would help release the old objects without other consequences such as data duplication. I do get a cryptic warning about "causing nondeterministic deadlocks", but with 1 worker ? Good news: launching a training session with "use_multiprocessing=True" works with minimal memory leak and with comparable accuracy results! |
I must update the above statement: there is a residual leak despite the multiprocessing scheme, it allowed me to reach epoch 56 (val accuracy of 0.8, as expected), but the process footprint reached 13 GB (started at 5.4 at the 1rst epoch). |
Hi @gillmac13, many thanks for the info you provided. I've commit a fix for the mem leak. You can pick the latest code and have a try. |
Hi @david8862, Perfect! It works. And Thanks for fixing this promptly. |
Hi @gillmac13, many thanks for sharing. The main target of my work is to deploy CNN model to IOT/embedded platform, so I focus more on the lightweight models. But I'll also check these enhancement later and try to pick them if it fit for my platform. |
Hi David,
I have tried to run a training session, at this point I would like to compare performance between your stacked-hourglass-keypoint-detection solution with another solution I am familiar with in pytorch (deep hrnet).
My dataset is an extension of MPII (same format) with a lot of proprietary images and annotations. So it's easy to run:
$ python3 train.py
because all defaults are appropriate.
However, on my first trial, the running process was "Killed" at epoch 25/100. The accuracy curve looked OK (0.69 at that point). So I retried, but this time I monitored the CPU and RAM usage. I found out that at each epoch (using "top"), the RAM loses more or less 1 GB of free memory. It is not gradual over the course of an epoch, it's lost as a lump at the beginning of a new epoch.
I changed --batch_size to 8, and the leak is about 0.5 GB per epoch.
Have you ever experienced this, and would you have an idea as to how to fix it ?
I am running on Ubuntu 18.04 and TF2.1
The text was updated successfully, but these errors were encountered: