-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
errors occur when using “efficient_test= True” #706
Comments
Hi @MengHao666 |
I use slurm_train.sh , but it should support multi node and single node. |
@xvjiarui hi,have you tested your code after bug fix? |
After lot of failure, I turn to minimize the size of my validate set to 25k samples in |
Hi @MengHao666 |
Thanks for the reply. I will try and give feedback later when test on test set. |
* update setup.py to link or copy files required by mim into mmpose/.mim * add MIM introduction in README
* cn_faq * polish * polish * polish * modfy backbone t ranslate * fix * polish * polish * polish * polish * polish * polish * polish
2021-07-14 21:23:30,745 - mmseg - INFO - Iter [3600/1000000] lr: 3.190e-01, eta: 21 days, 7:10:22, time: 1.815, data_time: 0.069, memory: 9345, decode_0.loss_seg: 0.0155, decode_0.acc_seg: 98.4233, dec
ode_1.loss_seg: 0.0381, decode_1.acc_seg: 98.4257, loss: 0.0536
2021-07-14 21:29:36,107 - mmseg - INFO - Iter [3800/1000000] lr: 3.189e-01, eta: 21 days, 6:46:41, time: 1.827, data_time: 0.070, memory: 9345, decode_0.loss_seg: 0.0155, decode_0.acc_seg: 98.4214, dec
ode_1.loss_seg: 0.0381, decode_1.acc_seg: 98.4251, loss: 0.0537
[>>>>>>>>>>>>>>>>>>> ] 245570/255537, 212.8 task/s, elapsed: 1154s, ETA: 47sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 247168/255537, 212.7 task/s, elapsed: 1162s, ETA: 39sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 250896/255537, 212.7 task/s, elapsed: 1180s, ETA: 22sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 252704/255537, 212.6 task/s, elapsed: 1189s, ETA: 13sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 252720/255537, 212.6 task/s, elapsed: 1189s, ETA: 13sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 254368/255537, 212.4 task/s, elapsed: 1198s, ETA: 6sefficient_test= True
[>>>>>>>>>>>>>>>>>>> ] 254976/255537, 212.3 task/s, elapsed: 1201s, ETA: 3sefficient_test= True
[>>>>>>>>>>>>>>>>>>>>] 255552/255537, 212.3 task/s, elapsed: 1204s, ETA: 0sefficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
efficient_test= True
Traceback (most recent call last):
File "tools/train.py", line 166, in
main()
File "tools/train.py", line 162, in main
meta=meta)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/apis/train.py", line 116, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/iter_based_runner.py", line 131, in run
iter_runner(iter_loaders[i], **kwargs)
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train
self.call_hook('after_train_iter')
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/hooks/evaluation.py", line 172, in after_train_iter
self._do_evaluate(runner)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/core/evaluation/eval_hooks.py", line 101, in _do_evaluate
key_score = self.evaluate(runner, results)
File "/mnt/lustre/menghao/.local/lib/python3.6/site-packages/mmcv/runner/hooks/evaluation.py", line 269, in evaluate
results, logger=runner.logger, **self.eval_kwargs)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/datasets/custom.py", line 344, in evaluate
reduce_zero_label=self.reduce_zero_label)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/core/evaluation/metrics.py", line 293, in eval_metrics
reduce_zero_label)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/core/evaluation/metrics.py", line 124, in total_intersect_and_union
label_map, reduce_zero_label)
File "/mnt/lustre/menghao/projects/mmsegmentation/mmseg/core/evaluation/metrics.py", line 55, in intersect_and_union
pred_label = torch.from_numpy(np.load(pred_label))
File "/mnt/lustre/share/platform/env/miniconda3.6/envs/pt1.3v1/lib/python3.6/site-packages/numpy/lib/npyio.py", line 416, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpxcgehf20.npy'
srun: error: SH-IDC1-10-5-36-198: task 0: Exited with exit code 1
srun: Terminating job step 10774941.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: *** STEP 10774941.0 ON SH-IDC1-10-5-36-198 CANCELLED AT 2021-07-15T00:34:16 ***
srun: error: Timed out waiting for job step to complete
The text was updated successfully, but these errors were encountered: