-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken paralelization #16
Comments
I didn't have your problem and I don't know how to fix it either. What I find weird from the Machine Learning perspective is that your batch_size is very small. It causes gradient to vary a lot and that might lead to numerical instabilities. So I would try much larger barch_sizes. At least 20. Better 50. |
I have the same problem with large batch_size 64, have you guys found the solution? |
I found some problems with parallelization too. When I try to run the model on more than one GPU, the process just freeze on the forward stage, namely this line in trainer.py: |
When I try to run the model on several GPUs I am getting a numerical error:
While running on a single GPU everything works just fine.
That indicated that there is an issue with parallelization
The text was updated successfully, but these errors were encountered: