-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用MSRS数据集训练时产生了问题 #11
Comments
我的程序是在Ubuntu系统下运行的,而且需要两块以上的GPU才能训练,你这的问题不知道是不是由于Windows系统所导致的 |
好的,我大概知道了。报错基本都改好了,谢谢 |
main_train_swinfusion.py修改epoch如下: 但训练1000iter之后出现报错: Traceback (most recent call last): |
这是加载数据集的时候出现了问题 你的确保你A_dir 和B_dir里面的文件数量是否一样,文件命名是否一致 |
你好,我在训练的时候也遇到的同样的问题,请问具体是怎么解决的嘞? |
请问解决了,我也遇到了同样的问题
请问解决了吗,我在windows系统下也遇到了同样的问题 |
你好,请问一下你的机器是什么?为什么能训练得这么快? |
(swinfus) PS D:\Python\SwinFusion-master> python -m torch.distributed.launch --nproc_per_node=3 --master_port=1234 main_train_swinfusion.py --opt options/swinir/train_swinfusion_vif.json --dist True
NOTE: Redirects are currently not supported in Windows or MacOs.
C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects
--local-rank
argument to be set, pleasechange it to read from
os.environ['LOCAL_RANK']
instead. Seehttps://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [nj.baidupcs.com]:1234 (system error: 10049 - 在其上下文中,该 请求的地址无效。).
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [nj.baidupcs.com]:1234 (system error: 10049 - 在其上下文中,该 请求的地址无效。).
usage: main_train_swinfusion.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_swinfusion.py: error: unrecognized arguments: --local-rank=0
usage: main_train_swinfusion.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_swinfusion.py: error: unrecognized arguments: --local-rank=2
usage: main_train_swinfusion.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_swinfusion.py: error: unrecognized arguments: --local-rank=1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 23612) of binary: C:\Anaconda3\envs\swinfus\python.exe
Traceback (most recent call last):
File "C:\Anaconda3\envs\swinfus\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Anaconda3\envs\swinfus\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py", line 196, in
main()
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py", line 192, in main
launch(args)
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py", line 177, in launch
run(args)
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
main_train_swinfusion.py FAILED
Failures:
[1]:
time : 2023-04-10_18:38:44
host : LAPTOP-0K7VHP1C
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 21040)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-04-10_18:38:44
host : LAPTOP-0K7VHP1C
rank : 2 (local_rank: 2)
exitcode : 2 (pid: 25444)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2023-04-10_18:38:44
host : LAPTOP-0K7VHP1C
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 23612)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
错误日志如上所示,请问其原因是什么,解决方法是怎样的呢?谢谢
The text was updated successfully, but these errors were encountered: