Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken paralelization #16

Open
Svito-zar opened this issue Jun 6, 2019 · 4 comments
Open

Broken paralelization #16

Svito-zar opened this issue Jun 6, 2019 · 4 comments

Comments

@Svito-zar
Copy link

Svito-zar commented Jun 6, 2019

When I try to run the model on several GPUs I am getting a numerical error:

Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.

While running on a single GPU everything works just fine.

That indicated that there is an issue with parallelization

@zain-ul-abedien
Copy link

Hey Svito-zar i am training on a single gpu but it shows the same warning( Warning: NaN or Inf found in input tensor). Please guide me how should i solve this problem.
Screenshot from 2019-07-19 14-14-50

@Svito-zar
Copy link
Author

I didn't have your problem and I don't know how to fix it either.
Would be interested to know the solution as well.

What I find weird from the Machine Learning perspective is that your batch_size is very small. It causes gradient to vary a lot and that might lead to numerical instabilities. So I would try much larger barch_sizes. At least 20. Better 50.

@hologerry
Copy link

I have the same problem with large batch_size 64, have you guys found the solution?
Help, please.

@pptrick
Copy link

pptrick commented Feb 25, 2021

I found some problems with parallelization too. When I try to run the model on more than one GPU, the process just freeze on the forward stage, namely this line in trainer.py:
z, nll, y_logits = self.graph(x=x, y_onehot=y_onehot)
The program is still running but I can't see any output after this line. However, One GPU works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants