-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
num_tpu_cores=8
does not work on kaggle
#1538
Comments
I think this is a kaggle problem? |
It prolly needs this on top:
|
those lines are already at the top: |
ah... yes. good catch. |
@lezwon want to find an environment variable we can check to know if on kaggle and submit a PR? |
Honestly, pytorch does not like |
on GCP it would still be fork? |
Fork is an issue with pytorch/CUDA mostly. https://colab.research.google.com/drive/1IvCxIg-Q_DlI7UNJuajpl4UZXNiW5jMg Like create model at global scope, and serialize the to(xla_device) calls to avoid all 8 processes rushing into allocation host memory at the same time. |
I also have this issue. if I use GPU, the model is training normally, but when I try to TPU, this happens. EDIT: Having analyzed the issue is about the RAM crashing.
|
Hmm, this is something different:
We have seen that a few time but I keep forgetting what the root cause was. |
@dlibenzi it is interesting issue, i will let you know if i find the bug |
🐛 Bug
When I try to train a model on Kaggle TPU's with
num_tpu_cores
set to 8, I receive an errorException: process 2 terminated with exit code 1
. Would be great if this worked on kaggle.To Reproduce
Steps to reproduce the behavior:
https://www.kaggle.com/lezwon/pytorch-on-tpu-with-pytorch-lightning
Code sample
trainer = pl.Trainer(num_tpu_cores=8, precision=16)
Expected behavior
Run the model utilizing all 8 TPU cores.
Environment
The text was updated successfully, but these errors were encountered: