Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch_size的问题? #2

Open
jiangchaokang opened this issue Jul 17, 2022 · 1 comment
Open

batch_size的问题? #2

jiangchaokang opened this issue Jul 17, 2022 · 1 comment

Comments

@jiangchaokang
Copy link

Good job!
非常好的工作!

我在训练模型的时候当batch_size=1的时候可以完美训练,在单张3090显卡上。
但是将batch_size设置为2就会报错。错误如下:
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [15, 2182] at entry 0 and [15, 5269] at entry 1
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
似乎是启用多线程(num_workers!=0)提示哪个线程有问题,因为batch合并时维度不一样,导致第一个线程就挂了(worker process 0)

请问你们那边可以跑通更大的batch_size吗? 为什么在您提供的代码库中我遇到了这样的问题呢?
我运行的代码如下:
python -m torch.distributed.launch --nproc_per_node=1 --master_port 1234 main_DCAdapt.py configs/train_adapt_ft3d_kitti.yaml

@JZ-9962
Copy link
Collaborator

JZ-9962 commented Jul 18, 2022

Hi, thanks for your interest in our work!

因为我们采用了HPLFlowNet作为backbone,它的网络输入具有可变大小,因此我们采用了distributed data parallel的方法来支持多GPU训练(每个GPU的batch_size为1),具体可以参考:laoreja/HPLFlowNet#11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants