Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于分布式训练的问题 #15

Closed
mengyuan97 opened this issue May 31, 2021 · 4 comments
Closed

关于分布式训练的问题 #15

mengyuan97 opened this issue May 31, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@mengyuan97
Copy link

作者您好!
看到您的代码里通过script.sh和dist_train.sh,调用torch.distributed.launch实现分布式训练。
我想问如果单机单卡的话还需要分布式训练吗?是否直接Python train.py即可呢?

感谢您有意义的工作!

@yuantn
Copy link
Owner

yuantn commented May 31, 2021

作者您好!
看到您的代码里通过script.sh和dist_train.sh,调用torch.distributed.launch实现分布式训练。
我想问如果单机单卡的话还需要分布式训练吗?是否直接Python train.py即可呢?

感谢您有意义的工作!

您好!
用单卡的话可以不用分布式训练,直接 python train.py 理论上是可以的,但我只有在用PyCharm调试的时候这样尝试过,你可以试一下。但有个问题在于,这样的话该如何指定配置文件为 configs/MIAOD.py 呢?可能会有bug报错,要注意一下。


Hello!
If you use a single GPU, it is not necessary to use distributed training. It is theoretically possible to directly use python train.py, but I only tried this when debugging with PyCharm. You can have a try. But there is a problem. How to specify the configuration file as configs/MIAOD.py in this case? There may be some bugs reported, please pay attention to it.

@wruixue0105
Copy link

Hello, author!when I used the single card for training, I encountered the following problems. What should I do in this case? I'd appreciate it if you could give me an answer.
2021-06-04 12:21:15,233 - mmdet - INFO - load model from: torchvision://resnet50
2021-06-04 12:21:15,506 - mmdet - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git [...] -- [...]'
2021-06-04 12:21:21,646 - mmdet - INFO - Start running, host: jys@223, work_dir: /home/jys/wfy/MI-AOD-master/work_dirs/MI-AOD/20210604_122113
2021-06-04 12:21:21,646 - mmdet - INFO - workflow: [('train', 1)], max: 3 epochs
Traceback (most recent call last):
File "train.py", line 269, in
main()
File "train.py", line 182, in main
distributed=distributed, validate=(not args.no_validate), timestamp=timestamp, meta=meta)
File "/home/jys/wfy/MI-AOD-master/mmdet/apis/train.py", line 120, in train_detector
runner.run(data_loaders_L, cfg.workflow, cfg.total_epochs)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 32, in train
**kwargs)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 31, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home/jys/wfy/MI-AOD-master/mmdet/models/detectors/base.py", line 228, in train_step
losses = self(**data)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jys/wfy/MI-AOD-master/mmdet/core/fp16/decorators.py", line 51, in new_func
return old_func(*args, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

@yuantn
Copy link
Owner

yuantn commented Jun 4, 2021

Hello, author!when I used the single card for training, I encountered the following problems. What should I do in this case? I'd appreciate it if you could give me an answer.
2021-06-04 12:21:15,233 - mmdet - INFO - load model from: torchvision://resnet50
2021-06-04 12:21:15,506 - mmdet - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git [...] -- [...]'
2021-06-04 12:21:21,646 - mmdet - INFO - Start running, host: jys@223, work_dir: /home/jys/wfy/MI-AOD-master/work_dirs/MI-AOD/20210604_122113
2021-06-04 12:21:21,646 - mmdet - INFO - workflow: [('train', 1)], max: 3 epochs
Traceback (most recent call last):
File "train.py", line 269, in
main()
File "train.py", line 182, in main
distributed=distributed, validate=(not args.no_validate), timestamp=timestamp, meta=meta)
File "/home/jys/wfy/MI-AOD-master/mmdet/apis/train.py", line 120, in train_detector
runner.run(data_loaders_L, cfg.workflow, cfg.total_epochs)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 32, in train
**kwargs)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 31, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home/jys/wfy/MI-AOD-master/mmdet/models/detectors/base.py", line 228, in train_step
losses = self(**data)
File "/home/jys/anaconda3/envs/MI/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jys/wfy/MI-AOD-master/mmdet/core/fp16/decorators.py", line 51, in new_func
return old_func(*args, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

Please refer to the resolution in Issue #5.

It would be better if you can raise this issue alone instead of combining it with other issues.

@mengyuan97
Copy link
Author

作者您好!
看到您的代码里通过script.sh和dist_train.sh,调用torch.distributed.launch实现分布式训练。
我想问如果单机单卡的话还需要分布式训练吗?是否直接Python train.py即可呢?
感谢您有意义的工作!

您好!
用单卡的话可以不用分布式训练,直接 python train.py 理论上是可以的,但我只有在用PyCharm调试的时候这样尝试过,你可以试一下。但有个问题在于,这样的话该如何指定配置文件为 configs/MIAOD.py 呢?可能会有bug报错,要注意一下。

收到,我也在pycharm调试时尝试了,可以edit config 为 configs/MIAOD.py。感谢您的回复!

@yuantn yuantn added the enhancement New feature or request label Jun 21, 2021
@yuantn yuantn closed this as completed Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants