Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: unrecognized arguments: --local_rank=1 #1302

Closed
TAICHIKF opened this issue Feb 18, 2022 · 6 comments
Closed

error: unrecognized arguments: --local_rank=1 #1302

TAICHIKF opened this issue Feb 18, 2022 · 6 comments
Assignees

Comments

@TAICHIKF
Copy link

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug
usage: train.py [-h]
train.py: error: unrecognized arguments: --local_rank=1 configs/fpn_crossformer_b_panda_40k.py --work-dir ./seg-output --launcher pytorch
usage: train.py [-h]
train.py: error: unrecognized arguments: --local_rank=2 configs/fpn_crossformer_b_panda_40k.py --work-dir ./seg-output --launcher pytorch
usage: train.py [-h]
train.py: error: unrecognized arguments: --local_rank=0 configs/fpn_crossformer_b_panda_40k.py --work-dir ./seg-output --launcher pytorch
usage: train.py [-h]
train.py: error: unrecognized arguments: --local_rank=3 configs/fpn_crossformer_b_panda_40k.py --work-dir ./seg-output --launcher pytorch
Traceback (most recent call last):
File "/root/anaconda3/envs/CrossFormer/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/envs/CrossFormer/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/CrossFormer/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/root/anaconda3/envs/CrossFormer/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/root/anaconda3/envs/CrossFormer/bin/python', '-u', './train.py', '--local_rank=3', 'configs/fpn_crossformer_b_panda_40k.py', '--work-dir', './seg-output', '--launcher', 'pytorch']' returned non-zero exit status 2.

Reproduction

  1. What command or script did you run?

./dist_train.sh configs/fpn_crossformer_b_panda_40k.py 4 ckpt/backbone-corssformer-s.pth

  1. Did you make any modifications on the code or config? Did you understand what you have modified?
    no

  2. What dataset did you use?

PANDA

Environment

Python3.6-based
mmcv-full==1.2.7 mmsegmentation==0.12.0
numpy scipy Pillow pyyaml torch==1.7.0 torchvision==0.8.1 timm==0.3.2

  1. Please run python mmseg/utils/collect_env.py to collect necessary environment information and paste it here.
  2. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback

If applicable, paste the error trackback here.

A placeholder for trackback.

Bug fix

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

@MengzhangLI MengzhangLI self-assigned this Feb 18, 2022
@MengzhangLI
Copy link
Contributor

Can you try to run Train with multiple GPUs commands below?

https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/train.md#train-with-multiple-gpus

I think your error is caused by incorrect usage of train.py.

@TAICHIKF
Copy link
Author

I use 【./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments] 】 this conmand, but it has the same error.

@MengzhangLI
Copy link
Contributor

Could you try to add '--deterministic' in [optional arguments] ?

@TAICHIKF
Copy link
Author

when I use 【python3 -m torch.distributed.launch --nproc_per_node=4 /mnt/code/sicap_test1_dis.py】, it can work but got a new error: FileNotFoundError: [Errno 2] No such file or directory '/mnt/code/mmsegmentation/run/sicap_pspnet_0211/.eval_hook';
When training with 4 Gpus, the above error occurs after one verification, and then the program continues training with one RANDOM GPU.

@TAICHIKF
Copy link
Author

image

@MengzhangLI
Copy link
Contributor

Your mmcv and mmseg version are too old. Could you try to upgrade to latest and try agian?

wjkim81 pushed a commit to wjkim81/mmsegmentation that referenced this issue Dec 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants