probable memory leak in cpu ram #1

gillmac13 · 2020-03-06T11:29:41Z

Hi David,

I have tried to run a training session, at this point I would like to compare performance between your stacked-hourglass-keypoint-detection solution with another solution I am familiar with in pytorch (deep hrnet).

My dataset is an extension of MPII (same format) with a lot of proprietary images and annotations. So it's easy to run:
$ python3 train.py
because all defaults are appropriate.

However, on my first trial, the running process was "Killed" at epoch 25/100. The accuracy curve looked OK (0.69 at that point). So I retried, but this time I monitored the CPU and RAM usage. I found out that at each epoch (using "top"), the RAM loses more or less 1 GB of free memory. It is not gradual over the course of an epoch, it's lost as a lump at the beginning of a new epoch.
I changed --batch_size to 8, and the leak is about 0.5 GB per epoch.

Have you ever experienced this, and would you have an idea as to how to fix it ?
I am running on Ubuntu 18.04 and TF2.1

gillmac13 · 2020-03-06T11:34:09Z

It goes without saying that I am using 1 GPU which seems to be fully used (up to 100%) during all epochs.

david8862 · 2020-03-06T15:21:41Z

Hi @gillmac13. Yes, I also noticed the mem leak during training, but haven't got enough time to fix it due to other tasks recently. Will try to make it done ASAP and let you know when finished. Sorry for the trouble.

gillmac13 · 2020-03-06T19:17:19Z

Many thanks !
I may have a hint towards a solution.
When I take "eval_callback" off the list of callbacks, I get a small mem leak during each epoch.
When I put it back, the mem leak is much larger (x3) and occurs both during an epoch and during the evaluation.

Evidently some objects in memory are not released when needed. Since I only have 1 worker selected, I thought setting "use_multiprocessing" to True would help release the old objects without other consequences such as data duplication. I do get a cryptic warning about "causing nondeterministic deadlocks", but with 1 worker ?

Good news: launching a training session with "use_multiprocessing=True" works with minimal memory leak and with comparable accuracy results!

gillmac13 · 2020-03-07T09:46:40Z

I must update the above statement: there is a residual leak despite the multiprocessing scheme, it allowed me to reach epoch 56 (val accuracy of 0.8, as expected), but the process footprint reached 13 GB (started at 5.4 at the 1rst epoch).
One solution for training further is to retrain from the lastest saved checkpoint. Trying that...

david8862 · 2020-03-09T15:33:59Z

I must update the above statement: there is a residual leak despite the multiprocessing scheme, it allowed me to reach epoch 56 (val accuracy of 0.8, as expected), but the process footprint reached 13 GB (started at 5.4 at the 1rst epoch).
One solution for training further is to retrain from the lastest saved checkpoint. Trying that...

Hi @gillmac13, many thanks for the info you provided. I've commit a fix for the mem leak. You can pick the latest code and have a try.

gillmac13 · 2020-03-09T22:38:12Z

Hi @david8862,

Perfect! It works. And Thanks for fixing this promptly.
My task is now to evolve "good" models. So far HG2 seems a bit light for my dataset (82% accuracy), some Pytorch implementations of HG2-8 claim accuracies of up to 90% on MPII like here:
https://github.com/crockwell/pytorch_stacked_hourglass_cutout
or even here (but no code available and it's an evolution of HG)
https://openreview.net/pdf?id=HkM3vjCcF7

david8862 · 2020-03-10T06:04:16Z

Hi @david8862,

Perfect! It works. And Thanks for fixing this promptly.
My task is now to evolve "good" models. So far HG2 seems a bit light for my dataset (82% accuracy), some Pytorch implementations of HG2-8 claim accuracies of up to 90% on MPII like here:
https://github.com/crockwell/pytorch_stacked_hourglass_cutout
or even here (but no code available and it's an evolution of HG)
https://openreview.net/pdf?id=HkM3vjCcF7

Hi @gillmac13, many thanks for sharing. The main target of my work is to deploy CNN model to IOT/embedded platform, so I focus more on the lightweight models. But I'll also check these enhancement later and try to pick them if it fit for my platform.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

probable memory leak in cpu ram #1

probable memory leak in cpu ram #1

gillmac13 commented Mar 6, 2020

gillmac13 commented Mar 6, 2020

david8862 commented Mar 6, 2020

gillmac13 commented Mar 6, 2020 •

edited

Loading

gillmac13 commented Mar 7, 2020 •

edited

Loading

david8862 commented Mar 9, 2020

gillmac13 commented Mar 9, 2020

david8862 commented Mar 10, 2020

probable memory leak in cpu ram #1

probable memory leak in cpu ram #1

Comments

gillmac13 commented Mar 6, 2020

gillmac13 commented Mar 6, 2020

david8862 commented Mar 6, 2020

gillmac13 commented Mar 6, 2020 • edited Loading

gillmac13 commented Mar 7, 2020 • edited Loading

david8862 commented Mar 9, 2020

gillmac13 commented Mar 9, 2020

david8862 commented Mar 10, 2020

gillmac13 commented Mar 6, 2020 •

edited

Loading

gillmac13 commented Mar 7, 2020 •

edited

Loading