使用MSRS数据集训练时产生了问题 #11

Amano-Hina · 2023-04-10T10:48:34Z

(swinfus) PS D:\Python\SwinFusion-master> python -m torch.distributed.launch --nproc_per_node=3 --master_port=1234 main_train_swinfusion.py --opt options/swinir/train_swinfusion_vif.json --dist True
NOTE: Redirects are currently not supported in Windows or MacOs.
C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [nj.baidupcs.com]:1234 (system error: 10049 - 在其上下文中，该请求的地址无效。).
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [nj.baidupcs.com]:1234 (system error: 10049 - 在其上下文中，该请求的地址无效。).
usage: main_train_swinfusion.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_swinfusion.py: error: unrecognized arguments: --local-rank=0
usage: main_train_swinfusion.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_swinfusion.py: error: unrecognized arguments: --local-rank=2
usage: main_train_swinfusion.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_swinfusion.py: error: unrecognized arguments: --local-rank=1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 23612) of binary: C:\Anaconda3\envs\swinfus\python.exe
Traceback (most recent call last):
File "C:\Anaconda3\envs\swinfus\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Anaconda3\envs\swinfus\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py", line 196, in
main()
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py", line 192, in main
launch(args)
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py", line 177, in launch
run(args)
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main_train_swinfusion.py FAILED

Failures:
[1]:
time : 2023-04-10_18:38:44
host : LAPTOP-0K7VHP1C
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 21040)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-04-10_18:38:44
host : LAPTOP-0K7VHP1C
rank : 2 (local_rank: 2)
exitcode : 2 (pid: 25444)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-04-10_18:38:44
host : LAPTOP-0K7VHP1C
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 23612)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

错误日志如上所示，请问其原因是什么，解决方法是怎样的呢？谢谢

The text was updated successfully, but these errors were encountered:

Linfeng-Tang · 2023-04-10T11:56:03Z

我的程序是在Ubuntu系统下运行的，而且需要两块以上的GPU才能训练，你这的问题不知道是不是由于Windows系统所导致的

Amano-Hina · 2023-04-10T12:02:02Z

好的，我大概知道了。报错基本都改好了，谢谢

Amano-Hina · 2023-05-06T07:09:24Z

main_train_swinfusion.py修改epoch如下：
for epoch in range(10000): # keep running
for i, train_data in enumerate(train_loader):

但训练1000iter之后出现报错：
23-05-06 14:31:05.347 : <epoch: 1, iter: 10,200, lr:2.000e-05> G_loss: 2.721e+00 Text_loss: 7.241e-01 Int_loss: 1.415e-01 SSIM_loss: 1.855e+00
23-05-06 14:34:08.261 : <epoch: 3, iter: 10,400, lr:2.000e-05> G_loss: 3.314e+00 Text_loss: 6.116e-01 Int_loss: 2.091e-01 SSIM_loss: 2.493e+00
23-05-06 14:37:10.746 : <epoch: 5, iter: 10,600, lr:2.000e-05> G_loss: 2.738e+00 Text_loss: 5.933e-01 Int_loss: 2.515e-01 SSIM_loss: 1.893e+00
23-05-06 14:40:13.426 : <epoch: 7, iter: 10,800, lr:2.000e-05> G_loss: 3.229e+00 Text_loss: 7.836e-01 Int_loss: 1.896e-01 SSIM_loss: 2.256e+00
23-05-06 14:43:16.382 : <epoch: 9, iter: 11,000, lr:2.000e-05> G_loss: 2.725e+00 Text_loss: 5.621e-01 Int_loss: 1.276e-01 SSIM_loss: 2.035e+00
23-05-06 14:43:16.382 : Saving the model. Save path is:Model/Infrared_Visible_Fusion/Infrared_Visible_Fusion/models/11000_E.pth

Traceback (most recent call last):
File "main_train_swinfusion.py", line 260, in
main()
File "main_train_swinfusion.py", line 223, in main
for test_data in test_loader:
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/root/miniconda3/lib/python3.8/site-packages/torch/_utils.py", line 644, in reraise
raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/autodl-tmp/SwinFusion/data/dataset_wogt.py", line 42, in getitem
B_path = self.paths_B[index]
IndexError: list index out of range

Linfeng-Tang · 2023-05-06T07:59:00Z

这是加载数据集的时候出现了问题你的确保你A_dir 和B_dir里面的文件数量是否一样，文件命名是否一致

HaHaFish1014 · 2023-05-29T13:43:56Z

好的，我大概知道了。报错基本都改好了，谢谢

你好，我在训练的时候也遇到的同样的问题，请问具体是怎么解决的嘞？

QiumanZeng · 2023-12-21T05:27:06Z

请问解决了，我也遇到了同样的问题

好的，我大概知道了。报错基本都改好了，谢谢

你好，我在训练的时候也遇到的同样的问题，请问具体是怎么解决的嘞？

请问解决了吗，我在windows系统下也遇到了同样的问题

xfsv · 2024-03-11T11:34:21Z

你好，请问一下你的机器是什么？为什么能训练得这么快？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用MSRS数据集训练时产生了问题 #11

使用MSRS数据集训练时产生了问题 #11

Amano-Hina commented Apr 10, 2023

Linfeng-Tang commented Apr 10, 2023

Amano-Hina commented Apr 10, 2023

Amano-Hina commented May 6, 2023

Linfeng-Tang commented May 6, 2023

HaHaFish1014 commented May 29, 2023

QiumanZeng commented Dec 21, 2023

xfsv commented Mar 11, 2024

使用MSRS数据集训练时产生了问题 #11

使用MSRS数据集训练时产生了问题 #11

Comments

Amano-Hina commented Apr 10, 2023

main_train_swinfusion.py FAILED

Root Cause (first observed failure): [0]: time : 2023-04-10_18:38:44 host : LAPTOP-0K7VHP1C rank : 0 (local_rank: 0) exitcode : 2 (pid: 23612) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Linfeng-Tang commented Apr 10, 2023

Amano-Hina commented Apr 10, 2023

Amano-Hina commented May 6, 2023

Linfeng-Tang commented May 6, 2023

HaHaFish1014 commented May 29, 2023

QiumanZeng commented Dec 21, 2023

xfsv commented Mar 11, 2024

Root Cause (first observed failure):
[0]:
time : 2023-04-10_18:38:44
host : LAPTOP-0K7VHP1C
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 23612)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html