-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
This is a general report about what it's like to work with PyTorch Lightining and TPUs on Kaggle. The reason I post it here instead of the XLA repo is that the PyTorch Lightning marketing openly invites us to use Kaggle. The reason I did not follow the bug template, is because I'm reporting something more general than one specific bug.
Over the past few days of trying to use the TPUs I've run into various errors that randomly pop into and out of existence like quantum particles. I can't paste the errors here because of their elusive nature, and I didn't have the hindsight of knowing it would come to this. Here are their descriptions:
- OSError: Undefined Symbol
- Cannot replicate if number of devices (1) is different from 8
- Some "failed to meet rendezvous point" error.
- Execution hangs indefinitely after
INIT TPU local core: 0, global rank: 0 with XLA_USE_BF16=None. Also for 8 cores - For 8 cores only, execution hangs indefinitely when progress bar appears (PyTorch Lightning)
Note that for ALL of these examples I've sometimes been able to do some combination of the following to fix it (without changing my code at all)
- Restart runtime
- Change to standard runtime then change back to TPU
- Refresh the page
- Run the same cell again
Unfortunately, I spend 80% of my time restarting my runtime and reinstalling XLA, and only 20% of my time feeling hopeful that "maybe I've fixed it for good"
By the way, this is how I set up XLA:
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version nightly --apt-packages libomp5 libopenblas-dev