evaluation process issue (died with <signals.SIGKILL>) #135

MacavityT · 2020-09-17T06:56:14Z

When I use the test command, it will work until the all test data end but the "result.pkl" can't be saved, and then get an error message like this ：

"subprocess.CalledProcessError: Command '['/home/taiyan/anaconda3/envs/mmseg/bin/python', '-u', './tools/test.py', '--local_rank=3', './configs/highway/deeplabv3plus.py', '/home/taiyan/highway-0915/work_dir_0916/iter_40000.pth', '--launcher', 'pytorch', '--out', '/home/taiyan/highway-0915/result.pkl', '--eval', 'mIoU']' died with <Signals.SIGKILL: 9>."

And then, I have to execute the command "kill process-id" to end the process because the GPUs were still be occupied.

Also in the training process, the same issue will occur so that I cancel the evaluation process by setting the "evaluation = dict(interval=0, metric='mIoU')".

Please tell me how to solve it.

xvjiarui · 2020-09-17T12:55:24Z

Hi @MacavityT

It may due to insufficient memory. If you are using docker, please enlarge share memory.
You may disable validation by passing --no-validate.

MacavityT · 2020-09-17T15:40:09Z

Hi @xvjiarui
Thank you for the help , and I found some mistakes.

1.In the file "mmseg/apis/test.py" , line 53 "out_file = osp.join(out_dir, img_meta['ori_filename'])". The code "img_meta['ori_filename']" will return an absolute path ,so that in the following "join" operator will not work as "given_path+image_name". Actually，the code in line 53 didn't do anything,thus the train image will be covered by result image.

2.In the file "mmseg/apis/test.py". Both the function "single_gpu_test" and "multi_gpu_test" have common problem. "Result" will be appended into "Results" in the loop, and "Results" are dumped at the end of test process. So if the test data have 2k images and each of them with shape (1024,512), it will be out of memory.

Hope the two problems above could help something.

xvjiarui · 2020-09-26T17:38:15Z

Hi @MacavityT

is fixed by Use img_prefix and seg_prefix for loading #153
You may train single GPU non-distributed test instead.

xvjiarui closed this as completed Sep 26, 2020

jbji mentioned this issue Feb 27, 2023

Cannot train PETRv2_BEVSeg (died with <Signals.SIGKILL: 9>) megvii-research/PETR#101

Open

wjkim81 pushed a commit to wjkim81/mmsegmentation that referenced this issue Dec 3, 2023

Modified version files. (open-mmlab#135)

42cfbcd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation process issue (died with <signals.SIGKILL>) #135

evaluation process issue (died with <signals.SIGKILL>) #135

MacavityT commented Sep 17, 2020

xvjiarui commented Sep 17, 2020

MacavityT commented Sep 17, 2020

xvjiarui commented Sep 26, 2020

evaluation process issue (died with <signals.SIGKILL>) #135

evaluation process issue (died with <signals.SIGKILL>) #135

Comments

MacavityT commented Sep 17, 2020

xvjiarui commented Sep 17, 2020

MacavityT commented Sep 17, 2020

xvjiarui commented Sep 26, 2020