Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors occur when using “efficient_test= True” #706

Closed
MengHao666 opened this issue Jul 15, 2021 · 6 comments · Fixed by #707
Closed

errors occur when using “efficient_test= True” #706

MengHao666 opened this issue Jul 15, 2021 · 6 comments · Fixed by #707

Comments

@MengHao666
Copy link

2021-07-14 21:23:30,745 - mmseg - INFO - Iter [3600/1000000] lr: 3.190e-01, eta: 21 days, 7:10:22, time: 1.815, data_time: 0.069, memory: 9345, decode_0.loss_seg: 0.0155, decode_0.acc_seg: 98.4233, dec
ode_1.loss_seg: 0.0381, decode_1.acc_seg: 98.4257, loss: 0.0536
2021-07-14 21:29:36,107 - mmseg - INFO - Iter [3800/1000000] lr: 3.189e-01, eta: 21 days, 6:46:41, time: 1.827, data_time: 0.070, memory: 9345, decode_0.loss_seg: 0.0155, decode_0.acc_seg: 98.4214, dec
ode_1.loss_seg: 0.0381, decode_1.acc_seg: 98.4251, loss: 0.0537
[>>>>>>>>>>>>>>>>>>> ] 245570/255537, 212.8 task/s, elapsed: 1154s, ETA: 47sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 247168/255537, 212.7 task/s, elapsed: 1162s, ETA: 39sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 250896/255537, 212.7 task/s, elapsed: 1180s, ETA: 22sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 252704/255537, 212.6 task/s, elapsed: 1189s, ETA: 13sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 252720/255537, 212.6 task/s, elapsed: 1189s, ETA: 13sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 254368/255537, 212.4 task/s, elapsed: 1198s, ETA: 6sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 254976/255537, 212.3 task/s, elapsed: 1201s, ETA: 3sefficient_test= True
[>>>>>>>>>>>>>>>>>>>>] 255552/255537, 212.3 task/s, elapsed: 1204s, ETA: 0sefficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True

Traceback (most recent call last):
File "tools/train.py", line 166, in
main()
File "tools/train.py", line 162, in main
meta=meta)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/apis/train.py", line 116, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/iter_based_runner.py", line 131, in run
iter_runner(iter_loaders[i], **kwargs)
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train
self.call_hook('after_train_iter')
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/hooks/evaluation.py", line 172, in after_train_iter
self._do_evaluate(runner)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/core/evaluation/eval_hooks.py", line 101, in _do_evaluate
key_score = self.evaluate(runner, results)
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/hooks/evaluation.py", line 269, in evaluate
results, logger=runner.logger, **self.eval_kwargs)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/datasets/custom.py", line 344, in evaluate
reduce_zero_label=self.reduce_zero_label)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/core/evaluation/metrics.py", line 293, in eval_metrics
reduce_zero_label)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/core/evaluation/metrics.py", line 124, in total_intersect_and_union
label_map, reduce_zero_label)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/core/evaluation/metrics.py", line 55, in intersect_and_union
pred_label = torch.from_numpy(np.load(pred_label))
File "/mnt/lustre/share/platform/env/miniconda3.6/envs/pt1.3v1/lib/python3.6/site-packages/numpy/lib/npyio.py", line 416, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpxcgehf20.npy'
srun: error: SH-IDC1-10-5-36-198: task 0: Exited with exit code 1
srun: Terminating job step 10774941.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: *** STEP 10774941.0 ON SH-IDC1-10-5-36-198 CANCELLED AT 2021-07-15T00:34:16 ***
srun: error: Timed out waiting for job step to complete

@xvjiarui
Copy link
Collaborator

Hi @MengHao666
May I know whether you are training on single node or multi-node?

@MengHao666
Copy link
Author

MengHao666 commented Jul 15, 2021

I use slurm_train.sh , but it should support multi node and single node.

@MengHao666
Copy link
Author

@xvjiarui hi,have you tested your code after bug fix?

@MengHao666
Copy link
Author

MengHao666 commented Jul 15, 2021

After lot of failure, I turn to minimize the size of my validate set to 25k samples in effcient_test=False, and now no problem occurs. I stronly suggest mmsegmentation to debug on effcient_test=True situation yourself.

@xvjiarui
Copy link
Collaborator

xvjiarui commented Jul 15, 2021

Hi @MengHao666
The issue is fixed by #707.
Sorry for the inconvenience, we are planning to refactor the test and evaluation pipelines this month.

@MengHao666
Copy link
Author

MengHao666 commented Jul 16, 2021

Hi @MengHao666
The issue is fixed by #707.
Sorry for the inconvenience, we are planning to refactor the test and evaluation pipelines this month.

Thanks for the reply. I will try and give feedback later when test on test set.

wjkim81 pushed a commit to wjkim81/mmsegmentation that referenced this issue Dec 3, 2023
* update setup.py to link or copy files required by mim into mmpose/.mim
* add MIM introduction in README
sibozhang pushed a commit to sibozhang/mmsegmentation that referenced this issue Mar 22, 2024
* cn_faq

* polish

* polish

* polish

* modfy backbone t ranslate

* fix

* polish

* polish

* polish

* polish

* polish

* polish

* polish
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants