Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix import error, Copy-on-read Overhead ( called memory leak in repository ) and slightly refactor dist_utils.py for improved readability #418

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

int11
Copy link

@int11 int11 commented Aug 15, 2024

Currently, there is a problem with memory exploding in the coco dataset class.

#93 #172 #207

The cause by Copy-on-read of the Forked CPython object
if you want to explore this problem, check this blog post Demystify-RAM-Usage-in-Multiprocess-DataLoader

The CocoDetection_share_memory class uses less total pss memory than current repository coco dataset class.
This can be found in memory_check.py.

@int11 int11 changed the title fix import error, Copy-on-read Overhead ( called memory leak in repository ) and refactor dist_utils.py slightly for improved readability fix import error, Copy-on-read Overhead ( called memory leak in repository ) and slightly refactor dist_utils.py for improved readability Aug 15, 2024
@lyuwenyu
Copy link
Owner

lyuwenyu commented Aug 16, 2024

Thanks for your pr for these problems.

For memory leak problem.

If evaluation operation is not performed during the training process, there will be no problem of continuous memory growth. So I wonder if it's a problem with the dataloader? But not coco evaluation ?

@int11
Copy link
Author

int11 commented Aug 16, 2024

it's a problem multiprocess when dataloader num_workers > 0
check out this pytorch issue pytorch/pytorch#13246

As a result, Train / evaluation dataset both caus unnecessary memory usage.
evaluation datasets also cause unnecessary memory growth, but not as much as Train datasets.
the important thing is memory stops growing when the workers has accessed the entire data as far as i know

It sounds strange that there is no problem without evaluation operation.
I already checked the huge memory increase in the coco dataset
This has nothing to do with the train and evaluation datasets/operation. Any dataset has this problem with cpython object.

Even the memory_check.py code that I provided doesn't do either of the train/evaluation operations.
I'm only doing fake read the data using pickle.dumps.

@lyuwenyu lyuwenyu force-pushed the main branch 6 times, most recently from c9c9e18 to 55ca98f Compare August 22, 2024 05:30
@VladKatsman
Copy link

VladKatsman commented Aug 22, 2024

From what i see, the problem starts at synchronization and accumulation part at det_solver.py
CocoDetection_share_memory doesnt help me

    print("Averaged stats:", metric_logger)
    if coco_evaluator is not None:
        coco_evaluator.synchronize_between_processes()

    # accumulate predictions from all images
    if coco_evaluator is not None:
        coco_evaluator.accumulate()
        coco_evaluator.summarize()

VladKatsman

This comment was marked as off-topic.

@int11
Copy link
Author

int11 commented Aug 25, 2024

From what i see, the problem starts at synchronization and accumulation part at det_solver.py CocoDetection_share_memory doesnt help me

@VladKatsman
Show me two classes of memory usage in your environment.
In my case, improve the overall memory efficiency by 1.5x when I use CocoDetection_share_memory.

main(dataset_class=CocoDetection, range_num=30000)

  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 55122  1993233  3.2G   1.7G   892.9M  2.3G      44.2M
 55122  1993491  2.9G   1.5G   835.0M  2.1G      50.4M
 55122  1993496  3.4G   2.1G   1.4G    2.0G      50.4M
totle pss : 5.367GB
iteration : 920 / 937, time : 10.729
  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 55133  1993233  3.1G   1.6G   763.3M  2.3G      44.2M
 55133  1993491  2.9G   1.5G   833.1M  2.0G      50.4M
 55133  1993496  3.1G   1.8G   1.0G    2.0G      50.4M
totle pss : 4.899GB
iteration : 930 / 937, time : 10.298

main(dataset_class=CocoDetection_share_memory, share_memory=False, range_num=30000)

  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 57902  2003746  3.2G   1.7G   899.0M  2.3G      43.9M
 57902  2004024  2.9G   1.5G   864.9M  2.0G      50.8M
 57902  2004029  3.2G   1.9G   1.2G    2.0G      50.8M
totle pss : 5.138GB
iteration : 910 / 937, time : 11.612
  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 57914  2003746  3.1G   1.6G   792.5M  2.3G      43.9M
 57914  2004024  2.8G   1.5G   852.8M  2.0G      50.8M
 57914  2004029  2.9G   1.6G   864.9M  2.0G      50.8M
totle pss : 4.706GB
iteration : 920 / 937, time : 11.366
  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 57925  2003746  3.2G   1.8G   932.8M  2.3G      43.9M
 57925  2004024  2.9G   1.6G   903.0M  2.0G      50.8M
 57925  2004029  2.8G   1.5G   854.4M  2.0G      50.8M
totle pss : 4.880GB
iteration : 930 / 937, time : 11.577

main(dataset_class=CocoDetection_share_memory, share_memory=True, range_num=30000)

  time      PID  rss    pss      uss     shared    shared_file
------  -------  -----  -------  ------  --------  -------------
 58961  2010117  2.1G   1.6G     1.3G    745.5M    44.9M
 58961  2010422  1.5G   1010.2M  765.1M  764.0M    51.1M
 58961  2010427  1.5G   1010.5M  765.8M  763.3M    51.2M
totle pss : 3.550GB
iteration : 910 / 937, time : 10.558
  time      PID  rss    pss      uss     shared    shared_file
------  -------  -----  -------  ------  --------  -------------
 58972  2010117  1.8G   1.3G     1.1G    745.5M    44.9M
 58972  2010422  1.5G   1010.2M  765.1M  764.0M    51.1M
 58972  2010427  1.5G   1010.5M  765.8M  763.3M    51.2M
totle pss : 3.254GB
iteration : 920 / 937, time : 11.179
  time      PID  rss    pss      uss     shared    shared_file
------  -------  -----  -------  ------  --------  -------------
 58982  2010117  2.1G   1.6G     1.3G    745.5M    44.9M
 58982  2010422  1.6G   1.1G     900.5M  764.0M    51.1M
 58982  2010427  1.5G   1010.5M  765.8M  763.3M    51.2M
totle pss : 3.682GB
iteration : 930 / 937, time : 9.704

Unless you don't use your cython object, I'm guessing that memory efficiency will definitely increase.
When you test memory, you must consider swap memory and garbage collectors.
so you must have enough memory when you testing.

This only reduces memory usage, not erasing memory altogether. If you don't have enough memory, it may seem like it doesn't help you in terms of memory.

Or as you said, det_solver.py problem also exists at the same time.
Please provide an example code and a testing code that can help you resolve the det_solver.py problem.

@VladKatsman
Copy link

VladKatsman commented Aug 26, 2024

I am sorry, i will reply to you from high level point of view, without code.
My training and val datasets are about the same size (80k .jpg images 640x640 each). During training i use single machine 3 GPU with batch size of 24 each (72 total).

That is command i used before train params:
CUDA_VISIBLE_DEVICES=0,1,2 torchrun --nproc_per_node=3

It takes about 21GB out of 24GB of each GPU memory and about 20 GB RAM. Now during evaluation, the number raises over 128 GB RAM (which my total RAM size). Your updated code did not solve that problem as well. There is still SEGFAULT error.

I've evaluated model using 1 GPU and 1 process so it took about 50 GB RAM for evaluation which is huge number as well. I dont know where to start to look for the problem, it looks like evaluation code itself is not memory efficient somehow.

If we will choose to use your project I will be happy to debug it and commit fixes and changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants