Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainging With 8 gpus #401

Closed
csBob123 opened this issue Mar 6, 2021 · 3 comments
Closed

Trainging With 8 gpus #401

csBob123 opened this issue Mar 6, 2021 · 3 comments

Comments

@csBob123
Copy link

csBob123 commented Mar 6, 2021

Thank you for your work. I found in your document, you note 'you may also use 8 GPUs and 1 imgs/gpu' . However, when I use 8 gpus, I got the following error:

raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))

ValueError: Expected more than 1 value per channel when training, got input size 1
Traceback (most recent call last):
File "/anaconda3/envs/pytorch15/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/anaconda3/envs/pytorch15/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/distributed/launch.py", line 263, in
main()

Thank you for your attention.

Looking forward to hearing from you.

Thanks!

@xiexinch
Copy link
Collaborator

xiexinch commented Mar 6, 2021

Hi @csBob123
Can you post the whole error message? And what command or script did you run?
And here is an issue #272 similar to your problem, perhaps it can help you.

@csBob123
Copy link
Author

csBob123 commented Mar 6, 2021

Hi @csBob123
Can you post the whole error message? And what command or script did you run?
And here is an issue #272 similar to your problem, perhaps it can help you.

File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
ValueError: Expected more than 1 value per channel when training, got input size 1
return old_func(*args, **kwargs)
File "/home/mmsegmentation/mmseg/models/segmentors/base.py", line 122, in forward
return self.forward_train(img, img_metas, img_r, depth, xr, yr, **kwargs)
File "/home/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 164, in forward_train
result = self.forward(*input, **kwargs)
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/nn/modules/container.py", line 100, in forward
loss_decode = self._decode_head_forward_train(x, img_metas,
File "/home/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 107, in _decode_head_forward_train
loss_decode = self.decode_head.forward_train(x, img_metas,input = module(input)

File "/home/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 186, in forward_train
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
seg_logits = self.forward(inputs)
File "/home/mmsegmentation/mmseg/models/decode_heads/sep_aspp_head.py", line 83, in forward
self.image_pool(x),
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/mmcv/cnn/bricks/conv_module.py", line 195, in forward
result = self.forward(*input, **kwargs)
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/nn/modules/container.py", line 100, in forward
x = self.norm(x)
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
input = module(input)
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/mmcv/cnn/bricks/conv_module.py", line 195, in forward
result = self.forward(*input, **kwargs)
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 470, in forward
x = self.norm(x)
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 470, in forward
return sync_batch_norm.apply(
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 13, in forward
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
return sync_batch_norm.apply(
ValueError File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 13, in forward
: Expected more than 1 value per channel when training, got input size 1
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size 1
Traceback (most recent call last):
File "/home/anaconda3/envs/pytorch15/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/anaconda3/envs/pytorch15/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/anaconda3/envs/pytorch15/lib/python3.8/site-packages/torch/distributed/launch.py", line 258, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/anaconda3/envs/pytorch15/bin/python', '-u', './tools/train.py', '--local_rank=7', './configs/deeplabv3plus/deeplabv3plus_r50-d8_512x1024_80k_cityscapes.py', '--launcher', 'pytorch', '--work-dir', './../mmseg_dictWork/test' returned non-zero exit status 1.

I use the following command: ./tools/dist_train.sh ./configs/deeplabv3plus/deeplabv3plus_r50-d8_512x1024_80k_cityscapes.py 8 --work-dir ./../mmseg_dictWork/test

I set samples_per_gpu=1 workers_per_gpu=1 to use batchsize 8

@xvjiarui
Copy link
Collaborator

xvjiarui commented Mar 7, 2021

Hi @csBob123
You got some [1, C, 1, 1] tensors. You need to set batch size >= 2.

aravind-h-v pushed a commit to aravind-h-v/mmsegmentation that referenced this issue Mar 27, 2023
* karras-ve docs

for issue open-mmlab#293

* make style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants