Description
Describe the bug
Testing a pre-trained cityscapes model using a single gpu for train images leads to exhaustion of RAM memory.
Reproduction
-
What command or script did you run?
python tools/test.py configs/fcn/fcn_r50-d8_512x1024_80k_cityscapes.py {checkpoint} --data_path {path of the data} --eval mIoU ( without using distributed data parallel params)
-
Did you make any modifications on the code or config? Did you understand what you have modified?
Added data_path parse argument to input the path to cityscapes dataset. Changed test paths to train paths to test the model on training images. -
What dataset did you use?
Cityscapes and also observed the same with a custom dataset
Environment -
Please run
python mmseg/utils/collect_env.py
to collect necessary environment infomation and paste it here.
sys.platform: linux
Python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0]
CUDA available: False
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
PyTorch: 1.3.0
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.4.1a0+d94043a
OpenCV: 4.4.0
MMCV: 1.2.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMSegmentation: 0.8.0+0d10921
- You may add addition that may be helpful for locating the problem, such as
- I tried to switch to pytorch1.6.0 and also changed corresponding mmcv library but still the problem still persists. I also tried with the latest master its still the same.
- I have tried using memory_profiler to locate the memory leak but this did not help.
- Tried with setting num_workers to 0 and also LRU_CACHE_CAPACITY=1 to avoid excessive memory usage.
- I also observed memory exhaustions during training the model on cityscapes and my custom dataset . Like RAM with 60GB exhausts after 20k epochs for cityscapes.
- Testing the model on cityscapes validation also leads to continuous increase in memory usage but since the number of val images are 500 and my RAM allocated is 60GB this does not crash.
Error traceback
I'm running my code on a single node on a headless SLURM so I cannot perform any interactive debugging. I have not made any
changes to the source code except those mentioned above. I'm trying to debug this since a week but still no luck. Please let me know if you can find a solution for my problem.
A placeholder for trackback.
Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!