-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batch_size的问题? #2
Comments
Hi, thanks for your interest in our work! 因为我们采用了HPLFlowNet作为backbone,它的网络输入具有可变大小,因此我们采用了distributed data parallel的方法来支持多GPU训练(每个GPU的batch_size为1),具体可以参考:laoreja/HPLFlowNet#11 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Good job!
非常好的工作!
我在训练模型的时候当batch_size=1的时候可以完美训练,在单张3090显卡上。
但是将batch_size设置为2就会报错。错误如下:
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [15, 2182] at entry 0 and [15, 5269] at entry 1
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
似乎是启用多线程(num_workers!=0)提示哪个线程有问题,因为batch合并时维度不一样,导致第一个线程就挂了(worker process 0)
请问你们那边可以跑通更大的batch_size吗? 为什么在您提供的代码库中我遇到了这样的问题呢?
我运行的代码如下:
python -m torch.distributed.launch --nproc_per_node=1 --master_port 1234 main_DCAdapt.py configs/train_adapt_ft3d_kitti.yaml
The text was updated successfully, but these errors were encountered: