Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak when single GPU Test on 2.7+ images #287

Closed
pavanteja295 opened this issue Dec 1, 2020 · 12 comments
Closed

Memory Leak when single GPU Test on 2.7+ images #287

pavanteja295 opened this issue Dec 1, 2020 · 12 comments

Comments

@pavanteja295
Copy link

pavanteja295 commented Dec 1, 2020

Describe the bug
Testing a pre-trained cityscapes model using a single gpu for train images leads to exhaustion of RAM memory.

Reproduction

  1. What command or script did you run?

     python tools/test.py configs/fcn/fcn_r50-d8_512x1024_80k_cityscapes.py  {checkpoint}  --data_path {path of the data} --eval mIoU  ( without using distributed data parallel params)
    
  2. Did you make any modifications on the code or config? Did you understand what you have modified?
    Added data_path parse argument to input the path to cityscapes dataset. Changed test paths to train paths to test the model on training images.

  3. What dataset did you use?
    Cityscapes and also observed the same with a custom dataset
    Environment

  4. Please run python mmseg/utils/collect_env.py to collect necessary environment infomation and paste it here.
    sys.platform: linux
    Python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0]
    CUDA available: False
    GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
    PyTorch: 1.3.0
    PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.4.1a0+d94043a
OpenCV: 4.4.0
MMCV: 1.2.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMSegmentation: 0.8.0+0d10921

  1. You may add addition that may be helpful for locating the problem, such as
    • I tried to switch to pytorch1.6.0 and also changed corresponding mmcv library but still the problem still persists. I also tried with the latest master its still the same.
    • I have tried using memory_profiler to locate the memory leak but this did not help.
    • Tried with setting num_workers to 0 and also LRU_CACHE_CAPACITY=1 to avoid excessive memory usage.
    • I also observed memory exhaustions during training the model on cityscapes and my custom dataset . Like RAM with 60GB exhausts after 20k epochs for cityscapes.
    • Testing the model on cityscapes validation also leads to continuous increase in memory usage but since the number of val images are 500 and my RAM allocated is 60GB this does not crash.

Error traceback

I'm running my code on a single node on a headless SLURM so I cannot perform any interactive debugging. I have not made any
changes to the source code except those mentioned above. I'm trying to debug this since a week but still no luck. Please let me know if you can find a solution for my problem.

A placeholder for trackback.

Bug fix

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

@pavanteja295 pavanteja295 changed the title Memory Leak when single GPU Test on 5k+ images Memory Leak when single GPU Test on 2.7+ images Dec 1, 2020
@ndcuong91
Copy link

I have the same problem when testing with publaynet dataset. Hope someone can help

@xvjiarui
Copy link
Collaborator

Hi @pavanteja295
We haven't observed any memory leaking from our side. But we are planning to have a more efficient inference and evaluation process.

@pavanteja295
Copy link
Author

HI @xvjiarui. I have tried to stick to your exact code base and it looks like the problem still persists when evaluating on large datasets (around 2.5+ images). I think one of the reason you are not observing is probably because you might be having huge RAM availability(around 75 GB). Could you please do me a favour and run it in under lesser RAM say around 30GB. Please note that this leak wont be visible if your evaluation just uses 500 images. I can paste my memory usage here and I see consistent increase in memory as I load images.

@tetsu-kikuchi
Copy link

tetsu-kikuchi commented Dec 26, 2020

Hello @pavanteja295
I do not fully understand the problem, but does memory swapping not help you?


Below is my opinion on memory increase.

I also observed that the used RAM memory increases during validation (I used tools/train.py and I have not yet used tools/test.py). I also observed, though I do not know why, that the used RAM memory decreased on some timing during training. So, the amount of the used memory fluctuates during whole training and validation repetition (largely it fluctuates between 15GB~40GB for my custom dataset).

I am not familiar with semantic segmentation (mmseg is the first repository I use). But in my opinion, the increase of the used RAM itself during validation is not an error, but is reasonable. It is because in semantic segmentation we need pixel-wise prediction, which has much more information than that of, for example, box detection, where only five numbers (box coordinates and score) are predicted for each detection.

During test or validation, we should keep the whole model prediction to evaluate the score. I have not done a precise estimation by myself, but if your test or validation dataset is large, the model prediction might go as large as a few tens of GB, and it is kept in your RAM during the test or validation.

@pavanteja295
Copy link
Author

pavanteja295 commented Dec 26, 2020

Hello @tetsu-kikuchi . Thanks a lot for your opinion. Summary of the issue I'm facing: Even with 60GB-75GB of RAM when I try to use the script for validating (tried with both multi gpu (nn.DistributedDataparallel) and single gpu testing(nn.Dataparallel)) around 3000 samples, the entire RAM memory gets consumed. Basically, what you are suggesting is that with current version of the validation it is not possible to perform validation on more than 3000 samples given we have only 60-75GB of RAM? Is this not an undesirable thing and can we better optimize the evaluation scripts? Also, do you have some quick solution that I can try in my case as I'm forced to switch to other code bases(which seem to handle this problem) and I really want to stick with this highly organised code base. One solution I can think of is for using mIoU which doesn't necessarily need all the predictions to be available to compute the score. So one can compute the scores already and get rid of the previous predictions.

My follow up question this happens even during training(using tools/train.py) where sometimes training stops after n epochs due to consumption of the entire memory.

@tetsu-kikuchi
Copy link

tetsu-kikuchi commented Dec 26, 2020

@pavanteja295 Thanks for your reply. I'm a novice and do not have enough understanding to the problem, but I will try to write my opinion to some of your questions.

My validation dataset is around 600 samples, and the RAM reaches 40 GB sometimes.

Is this not an undesirable thing and can we better optimize the evaluation scripts?

I agree that this is an undesirable thing, and possibly we can optimize the evaluation scripts. I do not know whether this problem is specific to the (possibly problematic) implementation of mmsegmentation or it is a general problem for semantic segmentation, due to my few experience. One suggestion is to try other implementations and see if the same problem occurs.

do you have some quick solution that I can try
So one can compute the scores already and get rid of the previous predictions.

As you wrote, it will be good to avoid keeping the whole model prediction during test or validation. One solution will be using a kind of running average, so that you can discard the prediction on the previous sample when you go to the next. Another solution will to divide the test or validation dataset, saving model prediction for each divided dataset somewhere on your file (that is, on your hard disk, not on your RAM), and finally unify the whole prediction and evaluate the score by yourself.

@tetsu-kikuchi
Copy link

tetsu-kikuchi commented Dec 29, 2020

I see that in the current implementation
1 . model prediction is first done for all test samples.
https://github.com/open-mmlab/mmsegmentation/blob/master/mmseg/apis/test.py#L31
2. Then, the overall predictions are decomposed to the per-sample prediction and IoU is accumulated sample-by-sample.
https://github.com/open-mmlab/mmsegmentation/blob/master/mmseg/core/evaluation/metrics.py#L62

So, the current implementation is simple but seems not RAM-efficient. You can rearrange the code so that the step 1 and step 2 are unified in a RAM-efficient way.

@pavanteja295
Copy link
Author

Yes true I made modifications accordingly to my local repository for the above and it works now! Thanks.

@tetsu-kikuchi
Copy link

@pavanteja295
I'm glad if it helped.
It seems I pasted a wrong link for the first URL in my comment. I edited it.

@tetsu-kikuchi
Copy link

tetsu-kikuchi commented Jan 13, 2021

I do not know it precisely, but It seems that mmseg/apis/test.py has been recently modified to a more memory-efficient style.
#330
https://github.com/open-mmlab/mmsegmentation/commits/ce46d70d2080e00d40d739b1033e4a07ed016388/mmseg/apis/test.py

@xvjiarui
Copy link
Collaborator

Yep. We have support memory efficient test in #330

@MELSunny
Copy link

MELSunny commented Mar 8, 2022

The reason why memory over leak is

result = model(return_loss=False, **data)

result is a list, including one prediction label, with np.int64 type.
if pre_eval:
# TODO: adapt samples_per_gpu > 1.
# only samples_per_gpu=1 valid now
result = dataset.pre_eval(result, indices=batch_indices)
results.extend(result)
else:
results.extend(result)

It is appended into results for every loop, which will cause a memory explosion if testing on a large dataset.
To solve it, just add a line before Line 125
result = [_.astype(np.uint16) for _ in result]

It will reduce the ram usage x4 less. If the num_classes of the dataset is less than 254, 'np.uint8' can be considered which reduce the ram usage x8 less
#189

sibozhang pushed a commit to sibozhang/mmsegmentation that referenced this issue Mar 22, 2024
* resolve comments

* update changelog

* add hvu training configs

* update README

* add missing ckpt links

* update changelog

* make mean_average_percision robust (some category may missing in testing)

* fix

* fix bad configs

* check if results is empty
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants