Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

probable memory leak in cpu ram #1

Open
gillmac13 opened this issue Mar 6, 2020 · 7 comments
Open

probable memory leak in cpu ram #1

gillmac13 opened this issue Mar 6, 2020 · 7 comments

Comments

@gillmac13
Copy link

Hi David,

I have tried to run a training session, at this point I would like to compare performance between your stacked-hourglass-keypoint-detection solution with another solution I am familiar with in pytorch (deep hrnet).

My dataset is an extension of MPII (same format) with a lot of proprietary images and annotations. So it's easy to run:
$ python3 train.py
because all defaults are appropriate.

However, on my first trial, the running process was "Killed" at epoch 25/100. The accuracy curve looked OK (0.69 at that point). So I retried, but this time I monitored the CPU and RAM usage. I found out that at each epoch (using "top"), the RAM loses more or less 1 GB of free memory. It is not gradual over the course of an epoch, it's lost as a lump at the beginning of a new epoch.
I changed --batch_size to 8, and the leak is about 0.5 GB per epoch.

Have you ever experienced this, and would you have an idea as to how to fix it ?
I am running on Ubuntu 18.04 and TF2.1

@gillmac13
Copy link
Author

It goes without saying that I am using 1 GPU which seems to be fully used (up to 100%) during all epochs.

@david8862
Copy link
Owner

Hi @gillmac13. Yes, I also noticed the mem leak during training, but haven't got enough time to fix it due to other tasks recently. Will try to make it done ASAP and let you know when finished. Sorry for the trouble.

@gillmac13
Copy link
Author

gillmac13 commented Mar 6, 2020

Many thanks !
I may have a hint towards a solution.
When I take "eval_callback" off the list of callbacks, I get a small mem leak during each epoch.
When I put it back, the mem leak is much larger (x3) and occurs both during an epoch and during the evaluation.

Evidently some objects in memory are not released when needed. Since I only have 1 worker selected, I thought setting "use_multiprocessing" to True would help release the old objects without other consequences such as data duplication. I do get a cryptic warning about "causing nondeterministic deadlocks", but with 1 worker ?

Good news: launching a training session with "use_multiprocessing=True" works with minimal memory leak and with comparable accuracy results!

@gillmac13
Copy link
Author

gillmac13 commented Mar 7, 2020

I must update the above statement: there is a residual leak despite the multiprocessing scheme, it allowed me to reach epoch 56 (val accuracy of 0.8, as expected), but the process footprint reached 13 GB (started at 5.4 at the 1rst epoch).
One solution for training further is to retrain from the lastest saved checkpoint. Trying that...

@david8862
Copy link
Owner

I must update the above statement: there is a residual leak despite the multiprocessing scheme, it allowed me to reach epoch 56 (val accuracy of 0.8, as expected), but the process footprint reached 13 GB (started at 5.4 at the 1rst epoch).
One solution for training further is to retrain from the lastest saved checkpoint. Trying that...

Hi @gillmac13, many thanks for the info you provided. I've commit a fix for the mem leak. You can pick the latest code and have a try.

@gillmac13
Copy link
Author

Hi @david8862,

Perfect! It works. And Thanks for fixing this promptly.
My task is now to evolve "good" models. So far HG2 seems a bit light for my dataset (82% accuracy), some Pytorch implementations of HG2-8 claim accuracies of up to 90% on MPII like here:
https://github.com/crockwell/pytorch_stacked_hourglass_cutout
or even here (but no code available and it's an evolution of HG)
https://openreview.net/pdf?id=HkM3vjCcF7

@david8862
Copy link
Owner

Hi @david8862,

Perfect! It works. And Thanks for fixing this promptly.
My task is now to evolve "good" models. So far HG2 seems a bit light for my dataset (82% accuracy), some Pytorch implementations of HG2-8 claim accuracies of up to 90% on MPII like here:
https://github.com/crockwell/pytorch_stacked_hourglass_cutout
or even here (but no code available and it's an evolution of HG)
https://openreview.net/pdf?id=HkM3vjCcF7

Hi @gillmac13, many thanks for sharing. The main target of my work is to deploy CNN model to IOT/embedded platform, so I focus more on the lightweight models. But I'll also check these enhancement later and try to pick them if it fit for my platform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants