Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用MSRS数据集训练时产生了问题 #11

Open
Amano-Hina opened this issue Apr 10, 2023 · 7 comments
Open

使用MSRS数据集训练时产生了问题 #11

Amano-Hina opened this issue Apr 10, 2023 · 7 comments

Comments

@Amano-Hina
Copy link

(swinfus) PS D:\Python\SwinFusion-master> python -m torch.distributed.launch --nproc_per_node=3 --master_port=1234 main_train_swinfusion.py --opt options/swinir/train_swinfusion_vif.json --dist True
NOTE: Redirects are currently not supported in Windows or MacOs.
C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [nj.baidupcs.com]:1234 (system error: 10049 - 在其上下文中,该 请求的地址无效。).
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [nj.baidupcs.com]:1234 (system error: 10049 - 在其上下文中,该 请求的地址无效。).
usage: main_train_swinfusion.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_swinfusion.py: error: unrecognized arguments: --local-rank=0
usage: main_train_swinfusion.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_swinfusion.py: error: unrecognized arguments: --local-rank=2
usage: main_train_swinfusion.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST]
main_train_swinfusion.py: error: unrecognized arguments: --local-rank=1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 23612) of binary: C:\Anaconda3\envs\swinfus\python.exe
Traceback (most recent call last):
File "C:\Anaconda3\envs\swinfus\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Anaconda3\envs\swinfus\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py", line 196, in
main()
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py", line 192, in main
launch(args)
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py", line 177, in launch
run(args)
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main_train_swinfusion.py FAILED

Failures:
[1]:
time : 2023-04-10_18:38:44
host : LAPTOP-0K7VHP1C
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 21040)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-04-10_18:38:44
host : LAPTOP-0K7VHP1C
rank : 2 (local_rank: 2)
exitcode : 2 (pid: 25444)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-04-10_18:38:44
host : LAPTOP-0K7VHP1C
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 23612)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

错误日志如上所示,请问其原因是什么,解决方法是怎样的呢?谢谢

@Linfeng-Tang
Copy link
Owner

我的程序是在Ubuntu系统下运行的,而且需要两块以上的GPU才能训练,你这的问题不知道是不是由于Windows系统所导致的

@Amano-Hina
Copy link
Author

好的,我大概知道了。报错基本都改好了,谢谢

@Amano-Hina
Copy link
Author

main_train_swinfusion.py修改epoch如下:
for epoch in range(10000): # keep running
for i, train_data in enumerate(train_loader):

但训练1000iter之后出现报错:
23-05-06 14:31:05.347 : <epoch: 1, iter: 10,200, lr:2.000e-05> G_loss: 2.721e+00 Text_loss: 7.241e-01 Int_loss: 1.415e-01 SSIM_loss: 1.855e+00
23-05-06 14:34:08.261 : <epoch: 3, iter: 10,400, lr:2.000e-05> G_loss: 3.314e+00 Text_loss: 6.116e-01 Int_loss: 2.091e-01 SSIM_loss: 2.493e+00
23-05-06 14:37:10.746 : <epoch: 5, iter: 10,600, lr:2.000e-05> G_loss: 2.738e+00 Text_loss: 5.933e-01 Int_loss: 2.515e-01 SSIM_loss: 1.893e+00
23-05-06 14:40:13.426 : <epoch: 7, iter: 10,800, lr:2.000e-05> G_loss: 3.229e+00 Text_loss: 7.836e-01 Int_loss: 1.896e-01 SSIM_loss: 2.256e+00
23-05-06 14:43:16.382 : <epoch: 9, iter: 11,000, lr:2.000e-05> G_loss: 2.725e+00 Text_loss: 5.621e-01 Int_loss: 1.276e-01 SSIM_loss: 2.035e+00
23-05-06 14:43:16.382 : Saving the model. Save path is:Model/Infrared_Visible_Fusion/Infrared_Visible_Fusion/models/11000_E.pth

Traceback (most recent call last):
File "main_train_swinfusion.py", line 260, in
main()
File "main_train_swinfusion.py", line 223, in main
for test_data in test_loader:
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/root/miniconda3/lib/python3.8/site-packages/torch/_utils.py", line 644, in reraise
raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/autodl-tmp/SwinFusion/data/dataset_wogt.py", line 42, in getitem
B_path = self.paths_B[index]
IndexError: list index out of range

@Linfeng-Tang
Copy link
Owner

这是加载数据集的时候出现了问题 你的确保你A_dir 和B_dir里面的文件数量是否一样,文件命名是否一致

@HaHaFish1014
Copy link

好的,我大概知道了。报错基本都改好了,谢谢

你好,我在训练的时候也遇到的同样的问题,请问具体是怎么解决的嘞?

@QiumanZeng
Copy link

请问解决了,我也遇到了同样的问题

好的,我大概知道了。报错基本都改好了,谢谢

你好,我在训练的时候也遇到的同样的问题,请问具体是怎么解决的嘞?

请问解决了吗,我在windows系统下也遇到了同样的问题

@xfsv
Copy link

xfsv commented Mar 11, 2024

你好,请问一下你的机器是什么?为什么能训练得这么快?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants