fix import error, Copy-on-read Overhead ( called memory leak in repository ) and slightly refactor dist_utils.py for improved readability #418

int11 · 2024-08-15T08:17:37Z

Currently, there is a problem with memory exploding in the coco dataset class.

The cause by Copy-on-read of the Forked CPython object
if you want to explore this problem, check this blog post Demystify-RAM-Usage-in-Multiprocess-DataLoader

The CocoDetection_share_memory class uses less total pss memory than current repository coco dataset class.
This can be found in memory_check.py.

lyuwenyu · 2024-08-16T03:14:27Z

Thanks for your pr for these problems.

For memory leak problem.

If evaluation operation is not performed during the training process, there will be no problem of continuous memory growth. So I wonder if it's a problem with the dataloader? But not coco evaluation ?

int11 · 2024-08-16T15:41:42Z

it's a problem multiprocess when dataloader num_workers > 0
check out this pytorch issue pytorch/pytorch#13246

As a result, Train / evaluation dataset both caus unnecessary memory usage.
evaluation datasets also cause unnecessary memory growth, but not as much as Train datasets.
the important thing is memory stops growing when the workers has accessed the entire data as far as i know

It sounds strange that there is no problem without evaluation operation.
I already checked the huge memory increase in the coco dataset
This has nothing to do with the train and evaluation datasets/operation. Any dataset has this problem with cpython object.

Even the memory_check.py code that I provided doesn't do either of the train/evaluation operations.
I'm only doing fake read the data using pickle.dumps.

VladKatsman · 2024-08-22T12:39:36Z

From what i see, the problem starts at synchronization and accumulation part at det_solver.py
CocoDetection_share_memory doesnt help me

    print("Averaged stats:", metric_logger)
    if coco_evaluator is not None:
        coco_evaluator.synchronize_between_processes()

    # accumulate predictions from all images
    if coco_evaluator is not None:
        coco_evaluator.accumulate()
        coco_evaluator.summarize()

rtdetrv2_pytorch/src/data/dataset/coco_dataset.py

int11 · 2024-08-25T14:28:54Z

From what i see, the problem starts at synchronization and accumulation part at det_solver.py CocoDetection_share_memory doesnt help me

@VladKatsman
Show me two classes of memory usage in your environment.
In my case, improve the overall memory efficiency by 1.5x when I use CocoDetection_share_memory.

main(dataset_class=CocoDetection, range_num=30000)

  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 55122  1993233  3.2G   1.7G   892.9M  2.3G      44.2M
 55122  1993491  2.9G   1.5G   835.0M  2.1G      50.4M
 55122  1993496  3.4G   2.1G   1.4G    2.0G      50.4M
totle pss : 5.367GB
iteration : 920 / 937, time : 10.729
  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 55133  1993233  3.1G   1.6G   763.3M  2.3G      44.2M
 55133  1993491  2.9G   1.5G   833.1M  2.0G      50.4M
 55133  1993496  3.1G   1.8G   1.0G    2.0G      50.4M
totle pss : 4.899GB
iteration : 930 / 937, time : 10.298

main(dataset_class=CocoDetection_share_memory, share_memory=False, range_num=30000)

  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 57902  2003746  3.2G   1.7G   899.0M  2.3G      43.9M
 57902  2004024  2.9G   1.5G   864.9M  2.0G      50.8M
 57902  2004029  3.2G   1.9G   1.2G    2.0G      50.8M
totle pss : 5.138GB
iteration : 910 / 937, time : 11.612
  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 57914  2003746  3.1G   1.6G   792.5M  2.3G      43.9M
 57914  2004024  2.8G   1.5G   852.8M  2.0G      50.8M
 57914  2004029  2.9G   1.6G   864.9M  2.0G      50.8M
totle pss : 4.706GB
iteration : 920 / 937, time : 11.366
  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 57925  2003746  3.2G   1.8G   932.8M  2.3G      43.9M
 57925  2004024  2.9G   1.6G   903.0M  2.0G      50.8M
 57925  2004029  2.8G   1.5G   854.4M  2.0G      50.8M
totle pss : 4.880GB
iteration : 930 / 937, time : 11.577

main(dataset_class=CocoDetection_share_memory, share_memory=True, range_num=30000)

  time      PID  rss    pss      uss     shared    shared_file
------  -------  -----  -------  ------  --------  -------------
 58961  2010117  2.1G   1.6G     1.3G    745.5M    44.9M
 58961  2010422  1.5G   1010.2M  765.1M  764.0M    51.1M
 58961  2010427  1.5G   1010.5M  765.8M  763.3M    51.2M
totle pss : 3.550GB
iteration : 910 / 937, time : 10.558
  time      PID  rss    pss      uss     shared    shared_file
------  -------  -----  -------  ------  --------  -------------
 58972  2010117  1.8G   1.3G     1.1G    745.5M    44.9M
 58972  2010422  1.5G   1010.2M  765.1M  764.0M    51.1M
 58972  2010427  1.5G   1010.5M  765.8M  763.3M    51.2M
totle pss : 3.254GB
iteration : 920 / 937, time : 11.179
  time      PID  rss    pss      uss     shared    shared_file
------  -------  -----  -------  ------  --------  -------------
 58982  2010117  2.1G   1.6G     1.3G    745.5M    44.9M
 58982  2010422  1.6G   1.1G     900.5M  764.0M    51.1M
 58982  2010427  1.5G   1010.5M  765.8M  763.3M    51.2M
totle pss : 3.682GB
iteration : 930 / 937, time : 9.704

Unless you don't use your cython object, I'm guessing that memory efficiency will definitely increase.
When you test memory, you must consider swap memory and garbage collectors.
so you must have enough memory when you testing.

This only reduces memory usage, not erasing memory altogether. If you don't have enough memory, it may seem like it doesn't help you in terms of memory.

Or as you said, det_solver.py problem also exists at the same time.
Please provide an example code and a testing code that can help you resolve the det_solver.py problem.

VladKatsman · 2024-08-26T09:47:13Z

I am sorry, i will reply to you from high level point of view, without code.
My training and val datasets are about the same size (80k .jpg images 640x640 each). During training i use single machine 3 GPU with batch size of 24 each (72 total).

That is command i used before train params:
CUDA_VISIBLE_DEVICES=0,1,2 torchrun --nproc_per_node=3

It takes about 21GB out of 24GB of each GPU memory and about 20 GB RAM. Now during evaluation, the number raises over 128 GB RAM (which my total RAM size). Your updated code did not solve that problem as well. There is still SEGFAULT error.

I've evaluated model using 1 GPU and 1 process so it took about 50 GB RAM for evaluation which is huge number as well. I dont know where to start to look for the problem, it looks like evaluation code itself is not memory efficient somehow.

If we will choose to use your project I will be happy to debug it and commit fixes and changes.

int11 added 3 commits August 12, 2024 11:09

Refactor dist_utils.py for improved readability

c3457d6

rtdetrv2_pytorch/src/zoo/rtdetr/matcher.py import scipy error

d9e4f50

fix Copy-on-read Overhead ( called memory leak in repository )

5f244e4

lyuwenyu force-pushed the main branch 6 times, most recently from c9c9e18 to 55ca98f Compare August 22, 2024 05:30

VladKatsman reviewed Aug 22, 2024

View reviewed changes

rtdetrv2_pytorch/src/data/dataset/coco_dataset.py Show resolved Hide resolved

This comment was marked as off-topic.

Sign in to view

add init

e08ef11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix import error, Copy-on-read Overhead ( called memory leak in repository ) and slightly refactor dist_utils.py for improved readability #418

fix import error, Copy-on-read Overhead ( called memory leak in repository ) and slightly refactor dist_utils.py for improved readability #418

int11 commented Aug 15, 2024

lyuwenyu commented Aug 16, 2024 •

edited

Loading

int11 commented Aug 16, 2024 •

edited

Loading

VladKatsman commented Aug 22, 2024 •

edited

Loading

This comment was marked as off-topic.

int11 commented Aug 25, 2024 •

edited

Loading

VladKatsman commented Aug 26, 2024 •

edited

Loading

fix import error, Copy-on-read Overhead ( called memory leak in repository ) and slightly refactor dist_utils.py for improved readability #418

Are you sure you want to change the base?

fix import error, Copy-on-read Overhead ( called memory leak in repository ) and slightly refactor dist_utils.py for improved readability #418

Conversation

int11 commented Aug 15, 2024

lyuwenyu commented Aug 16, 2024 • edited Loading

int11 commented Aug 16, 2024 • edited Loading

VladKatsman commented Aug 22, 2024 • edited Loading

This comment was marked as off-topic.

int11 commented Aug 25, 2024 • edited Loading

VladKatsman commented Aug 26, 2024 • edited Loading

lyuwenyu commented Aug 16, 2024 •

edited

Loading

int11 commented Aug 16, 2024 •

edited

Loading

VladKatsman commented Aug 22, 2024 •

edited

Loading

int11 commented Aug 25, 2024 •

edited

Loading

VladKatsman commented Aug 26, 2024 •

edited

Loading