-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak when single GPU Test on 2.7+ images #287
Comments
I have the same problem when testing with publaynet dataset. Hope someone can help |
Hi @pavanteja295 |
HI @xvjiarui. I have tried to stick to your exact code base and it looks like the problem still persists when evaluating on large datasets (around 2.5+ images). I think one of the reason you are not observing is probably because you might be having huge RAM availability(around 75 GB). Could you please do me a favour and run it in under lesser RAM say around 30GB. Please note that this leak wont be visible if your evaluation just uses 500 images. I can paste my memory usage here and I see consistent increase in memory as I load images. |
Hello @pavanteja295 Below is my opinion on memory increase. I also observed that the used RAM memory increases during validation (I used tools/train.py and I have not yet used tools/test.py). I also observed, though I do not know why, that the used RAM memory decreased on some timing during training. So, the amount of the used memory fluctuates during whole training and validation repetition (largely it fluctuates between 15GB~40GB for my custom dataset). I am not familiar with semantic segmentation (mmseg is the first repository I use). But in my opinion, the increase of the used RAM itself during validation is not an error, but is reasonable. It is because in semantic segmentation we need pixel-wise prediction, which has much more information than that of, for example, box detection, where only five numbers (box coordinates and score) are predicted for each detection. During test or validation, we should keep the whole model prediction to evaluate the score. I have not done a precise estimation by myself, but if your test or validation dataset is large, the model prediction might go as large as a few tens of GB, and it is kept in your RAM during the test or validation. |
Hello @tetsu-kikuchi . Thanks a lot for your opinion. Summary of the issue I'm facing: Even with 60GB-75GB of RAM when I try to use the script for validating (tried with both multi gpu (nn.DistributedDataparallel) and single gpu testing(nn.Dataparallel)) around 3000 samples, the entire RAM memory gets consumed. Basically, what you are suggesting is that with current version of the validation it is not possible to perform validation on more than 3000 samples given we have only 60-75GB of RAM? Is this not an undesirable thing and can we better optimize the evaluation scripts? Also, do you have some quick solution that I can try in my case as I'm forced to switch to other code bases(which seem to handle this problem) and I really want to stick with this highly organised code base. One solution I can think of is for using mIoU which doesn't necessarily need all the predictions to be available to compute the score. So one can compute the scores already and get rid of the previous predictions. My follow up question this happens even during training(using tools/train.py) where sometimes training stops after n epochs due to consumption of the entire memory. |
@pavanteja295 Thanks for your reply. I'm a novice and do not have enough understanding to the problem, but I will try to write my opinion to some of your questions. My validation dataset is around 600 samples, and the RAM reaches 40 GB sometimes.
I agree that this is an undesirable thing, and possibly we can optimize the evaluation scripts. I do not know whether this problem is specific to the (possibly problematic) implementation of mmsegmentation or it is a general problem for semantic segmentation, due to my few experience. One suggestion is to try other implementations and see if the same problem occurs.
As you wrote, it will be good to avoid keeping the whole model prediction during test or validation. One solution will be using a kind of running average, so that you can discard the prediction on the previous sample when you go to the next. Another solution will to divide the test or validation dataset, saving model prediction for each divided dataset somewhere on your file (that is, on your hard disk, not on your RAM), and finally unify the whole prediction and evaluate the score by yourself. |
I see that in the current implementation So, the current implementation is simple but seems not RAM-efficient. You can rearrange the code so that the step 1 and step 2 are unified in a RAM-efficient way. |
Yes true I made modifications accordingly to my local repository for the above and it works now! Thanks. |
@pavanteja295 |
I do not know it precisely, but It seems that mmseg/apis/test.py has been recently modified to a more memory-efficient style. |
Yep. We have support memory efficient test in #330 |
The reason why memory over leak is mmsegmentation/mmseg/apis/test.py Line 91 in e8cc322
result is a list, including one prediction label, with np.int64 type.mmsegmentation/mmseg/apis/test.py Lines 125 to 131 in e8cc322
It is appended into results for every loop, which will cause a memory explosion if testing on a large dataset. To solve it, just add a line before Line 125 result = [_.astype(np.uint16) for _ in result]
It will reduce the ram usage x4 less. If the num_classes of the dataset is less than 254, 'np.uint8' can be considered which reduce the ram usage x8 less |
* resolve comments * update changelog * add hvu training configs * update README * add missing ckpt links * update changelog * make mean_average_percision robust (some category may missing in testing) * fix * fix bad configs * check if results is empty
Describe the bug
Testing a pre-trained cityscapes model using a single gpu for train images leads to exhaustion of RAM memory.
Reproduction
What command or script did you run?
Did you make any modifications on the code or config? Did you understand what you have modified?
Added data_path parse argument to input the path to cityscapes dataset. Changed test paths to train paths to test the model on training images.
What dataset did you use?
Cityscapes and also observed the same with a custom dataset
Environment
Please run
python mmseg/utils/collect_env.py
to collect necessary environment infomation and paste it here.sys.platform: linux
Python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0]
CUDA available: False
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
PyTorch: 1.3.0
PyTorch compiling details: PyTorch built with:
TorchVision: 0.4.1a0+d94043a
OpenCV: 4.4.0
MMCV: 1.2.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMSegmentation: 0.8.0+0d10921
Error traceback
I'm running my code on a single node on a headless SLURM so I cannot perform any interactive debugging. I have not made any
changes to the source code except those mentioned above. I'm trying to debug this since a week but still no luck. Please let me know if you can find a solution for my problem.
Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
The text was updated successfully, but these errors were encountered: