关于分布式训练的问题 #15

mengyuan97 · 2021-05-31T09:29:23Z

作者您好！
看到您的代码里通过script.sh和dist_train.sh，调用torch.distributed.launch实现分布式训练。
我想问如果单机单卡的话还需要分布式训练吗？是否直接Python train.py即可呢？

感谢您有意义的工作！

yuantn · 2021-05-31T14:09:10Z

作者您好！
看到您的代码里通过script.sh和dist_train.sh，调用torch.distributed.launch实现分布式训练。
我想问如果单机单卡的话还需要分布式训练吗？是否直接Python train.py即可呢？

感谢您有意义的工作！

您好！
用单卡的话可以不用分布式训练，直接 python train.py 理论上是可以的，但我只有在用PyCharm调试的时候这样尝试过，你可以试一下。但有个问题在于，这样的话该如何指定配置文件为 configs/MIAOD.py 呢？可能会有bug报错，要注意一下。

Hello!
If you use a single GPU, it is not necessary to use distributed training. It is theoretically possible to directly use python train.py, but I only tried this when debugging with PyCharm. You can have a try. But there is a problem. How to specify the configuration file as configs/MIAOD.py in this case? There may be some bugs reported, please pay attention to it.

wruixue0105 · 2021-06-04T07:55:57Z

Hello, author！when I used the single card for training, I encountered the following problems. What should I do in this case? I'd appreciate it if you could give me an answer.
2021-06-04 12:21:15,233 - mmdet - INFO - load model from: torchvision://resnet50
2021-06-04 12:21:15,506 - mmdet - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git [...] -- [...]'
2021-06-04 12:21:21,646 - mmdet - INFO - Start running, host: jys@223, work_dir: /home/jys/wfy/MI-AOD-master/work_dirs/MI-AOD/20210604_122113
2021-06-04 12:21:21,646 - mmdet - INFO - workflow: [('train', 1)], max: 3 epochs
Traceback (most recent call last):
File "train.py", line 269, in
main()
File "train.py", line 182, in main
distributed=distributed, validate=(not args.no_validate), timestamp=timestamp, meta=meta)
File "/home/jys/wfy/MI-AOD-master/mmdet/apis/train.py", line 120, in train_detector
runner.run(data_loaders_L, cfg.workflow, cfg.total_epochs)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 32, in train
**kwargs)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 31, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home/jys/wfy/MI-AOD-master/mmdet/models/detectors/base.py", line 228, in train_step
losses = self(**data)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jys/wfy/MI-AOD-master/mmdet/core/fp16/decorators.py", line 51, in new_func
return old_func(*args, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

yuantn · 2021-06-04T08:15:14Z

Hello, author！when I used the single card for training, I encountered the following problems. What should I do in this case? I'd appreciate it if you could give me an answer.
2021-06-04 12:21:15,233 - mmdet - INFO - load model from: torchvision://resnet50
2021-06-04 12:21:15,506 - mmdet - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git [...] -- [...]'
2021-06-04 12:21:21,646 - mmdet - INFO - Start running, host: jys@223, work_dir: /home/jys/wfy/MI-AOD-master/work_dirs/MI-AOD/20210604_122113
2021-06-04 12:21:21,646 - mmdet - INFO - workflow: [('train', 1)], max: 3 epochs
Traceback (most recent call last):
File "train.py", line 269, in
main()
File "train.py", line 182, in main
distributed=distributed, validate=(not args.no_validate), timestamp=timestamp, meta=meta)
File "/home/jys/wfy/MI-AOD-master/mmdet/apis/train.py", line 120, in train_detector
runner.run(data_loaders_L, cfg.workflow, cfg.total_epochs)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 32, in train
**kwargs)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 31, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home/jys/wfy/MI-AOD-master/mmdet/models/detectors/base.py", line 228, in train_step
losses = self(**data)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jys/wfy/MI-AOD-master/mmdet/core/fp16/decorators.py", line 51, in new_func
return old_func(*args, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

Please refer to the resolution in Issue #5.

It would be better if you can raise this issue alone instead of combining it with other issues.

mengyuan97 · 2021-06-08T02:05:47Z

作者您好！
看到您的代码里通过script.sh和dist_train.sh，调用torch.distributed.launch实现分布式训练。
我想问如果单机单卡的话还需要分布式训练吗？是否直接Python train.py即可呢？
感谢您有意义的工作！

您好！
用单卡的话可以不用分布式训练，直接 python train.py 理论上是可以的，但我只有在用PyCharm调试的时候这样尝试过，你可以试一下。但有个问题在于，这样的话该如何指定配置文件为 configs/MIAOD.py 呢？可能会有bug报错，要注意一下。

收到，我也在pycharm调试时尝试了，可以edit config 为 configs/MIAOD.py。感谢您的回复！

yuantn added the enhancement New feature or request label Jun 21, 2021

yuantn closed this as completed Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于分布式训练的问题 #15

关于分布式训练的问题 #15

mengyuan97 commented May 31, 2021

yuantn commented May 31, 2021 •

edited

Loading

wruixue0105 commented Jun 4, 2021

yuantn commented Jun 4, 2021 •

edited

Loading

mengyuan97 commented Jun 8, 2021

关于分布式训练的问题 #15

关于分布式训练的问题 #15

Comments

mengyuan97 commented May 31, 2021

yuantn commented May 31, 2021 • edited Loading

wruixue0105 commented Jun 4, 2021

yuantn commented Jun 4, 2021 • edited Loading

mengyuan97 commented Jun 8, 2021

yuantn commented May 31, 2021 •

edited

Loading

yuantn commented Jun 4, 2021 •

edited

Loading