errors occur when using “efficient_test= True” #706

MengHao666 · 2021-07-15T01:32:00Z

2021-07-14 21:23:30,745 - mmseg - INFO - Iter [3600/1000000] lr: 3.190e-01, eta: 21 days, 7:10:22, time: 1.815, data_time: 0.069, memory: 9345, decode_0.loss_seg: 0.0155, decode_0.acc_seg: 98.4233, dec
ode_1.loss_seg: 0.0381, decode_1.acc_seg: 98.4257, loss: 0.0536
2021-07-14 21:29:36,107 - mmseg - INFO - Iter [3800/1000000] lr: 3.189e-01, eta: 21 days, 6:46:41, time: 1.827, data_time: 0.070, memory: 9345, decode_0.loss_seg: 0.0155, decode_0.acc_seg: 98.4214, dec
ode_1.loss_seg: 0.0381, decode_1.acc_seg: 98.4251, loss: 0.0537
[>>>>>>>>>>>>>>>>>>> ] 245570/255537, 212.8 task/s, elapsed: 1154s, ETA: 47sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 247168/255537, 212.7 task/s, elapsed: 1162s, ETA: 39sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 250896/255537, 212.7 task/s, elapsed: 1180s, ETA: 22sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 252704/255537, 212.6 task/s, elapsed: 1189s, ETA: 13sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 252720/255537, 212.6 task/s, elapsed: 1189s, ETA: 13sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 254368/255537, 212.4 task/s, elapsed: 1198s, ETA: 6sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 254976/255537, 212.3 task/s, elapsed: 1201s, ETA: 3sefficient_test= True
[>>>>>>>>>>>>>>>>>>>>] 255552/255537, 212.3 task/s, elapsed: 1204s, ETA: 0sefficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True

Traceback (most recent call last):
File "tools/train.py", line 166, in
main()
File "tools/train.py", line 162, in main
meta=meta)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/apis/train.py", line 116, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/iter_based_runner.py", line 131, in run
iter_runner(iter_loaders[i], **kwargs)
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train
self.call_hook('after_train_iter')
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/hooks/evaluation.py", line 172, in after_train_iter
self._do_evaluate(runner)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/core/evaluation/eval_hooks.py", line 101, in _do_evaluate
key_score = self.evaluate(runner, results)
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/hooks/evaluation.py", line 269, in evaluate
results, logger=runner.logger, **self.eval_kwargs)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/datasets/custom.py", line 344, in evaluate
reduce_zero_label=self.reduce_zero_label)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/core/evaluation/metrics.py", line 293, in eval_metrics
reduce_zero_label)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/core/evaluation/metrics.py", line 124, in total_intersect_and_union
label_map, reduce_zero_label)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/core/evaluation/metrics.py", line 55, in intersect_and_union
pred_label = torch.from_numpy(np.load(pred_label))
File "/mnt/lustre/share/platform/env/miniconda3.6/envs/pt1.3v1/lib/python3.6/site-packages/numpy/lib/npyio.py", line 416, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpxcgehf20.npy'
srun: error: SH-IDC1-10-5-36-198: task 0: Exited with exit code 1
srun: Terminating job step 10774941.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: *** STEP 10774941.0 ON SH-IDC1-10-5-36-198 CANCELLED AT 2021-07-15T00:34:16 ***
srun: error: Timed out waiting for job step to complete

xvjiarui · 2021-07-15T05:23:19Z

Hi @MengHao666
May I know whether you are training on single node or multi-node?

MengHao666 · 2021-07-15T05:27:37Z

I use slurm_train.sh , but it should support multi node and single node.

MengHao666 · 2021-07-15T05:43:39Z

@xvjiarui hi,have you tested your code after bug fix?

MengHao666 · 2021-07-15T07:11:10Z

After lot of failure, I turn to minimize the size of my validate set to 25k samples in effcient_test=False, and now no problem occurs. I stronly suggest mmsegmentation to debug on effcient_test=True situation yourself.

xvjiarui · 2021-07-15T18:56:41Z

Hi @MengHao666
The issue is fixed by #707.
Sorry for the inconvenience, we are planning to refactor the test and evaluation pipelines this month.

MengHao666 · 2021-07-16T01:17:55Z

Hi @MengHao666
The issue is fixed by #707.
Sorry for the inconvenience, we are planning to refactor the test and evaluation pipelines this month.

Thanks for the reply. I will try and give feedback later when test on test set.

* update setup.py to link or copy files required by mim into mmpose/.mim * add MIM introduction in README

* cn_faq * polish * polish * polish * modfy backbone t ranslate * fix * polish * polish * polish * polish * polish * polish * polish

xvjiarui mentioned this issue Jul 15, 2021

[Bug fix] Fix efficient test for multi-node #707

Merged

MengHao666 closed this as completed Jul 15, 2021

wjkim81 pushed a commit to wjkim81/mmsegmentation that referenced this issue Dec 3, 2023

Support MIM (open-mmlab#706)

128c1bd

* update setup.py to link or copy files required by mim into mmpose/.mim * add MIM introduction in README

sibozhang pushed a commit to sibozhang/mmsegmentation that referenced this issue Mar 22, 2024

[Docs] CN faq (open-mmlab#706)

c195307

* cn_faq * polish * polish * polish * modfy backbone t ranslate * fix * polish * polish * polish * polish * polish * polish * polish

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

errors occur when using “efficient_test= True” #706

errors occur when using “efficient_test= True” #706

MengHao666 commented Jul 15, 2021

xvjiarui commented Jul 15, 2021

MengHao666 commented Jul 15, 2021 •

edited

Loading

MengHao666 commented Jul 15, 2021

MengHao666 commented Jul 15, 2021 •

edited

Loading

xvjiarui commented Jul 15, 2021 •

edited

Loading

MengHao666 commented Jul 16, 2021 •

edited

Loading

errors occur when using “efficient_test= True” #706

errors occur when using “efficient_test= True” #706

Comments

MengHao666 commented Jul 15, 2021

xvjiarui commented Jul 15, 2021

MengHao666 commented Jul 15, 2021 • edited Loading

MengHao666 commented Jul 15, 2021

MengHao666 commented Jul 15, 2021 • edited Loading

xvjiarui commented Jul 15, 2021 • edited Loading

MengHao666 commented Jul 16, 2021 • edited Loading

MengHao666 commented Jul 15, 2021 •

edited

Loading

MengHao666 commented Jul 15, 2021 •

edited

Loading

xvjiarui commented Jul 15, 2021 •

edited

Loading

MengHao666 commented Jul 16, 2021 •

edited

Loading