Kaggle and TPUs - Errors popping randomly in and out of existence

This is a general report about what it's like to work with PyTorch Lightining and TPUs on Kaggle. The reason I post it here instead of the XLA repo is that the PyTorch Lightning marketing openly invites us to use Kaggle. The reason I did not follow the bug template, is because I'm reporting something more general than one specific bug.

Over the past few days of trying to use the TPUs I've run into various errors that randomly pop into and out of existence like quantum particles. I can't paste the errors here because of their elusive nature, and I didn't have the hindsight of knowing it would come to this. Here are their descriptions:

- [OSError: Undefined Symbol](https://www.kaggle.com/discussion/214240)
- [Cannot replicate if number of devices (1) is different from 8](https://www.kaggle.com/discussion/214240)
- Some "failed to meet rendezvous point" error.
- Execution hangs indefinitely after `INIT TPU local core: 0, global rank: 0 with XLA_USE_BF16=None`. Also for 8 cores
- For 8 cores only, execution hangs indefinitely when progress bar appears (PyTorch Lightning)

Note that for ALL of these examples I've sometimes been able to do some combination of the following to fix it (without changing my code at all)

- Restart runtime
- Change to standard runtime then change back to TPU
- Refresh the page
- Run the same cell again

Unfortunately, I spend 80% of my time restarting my runtime and reinstalling XLA, and only 20% of my time feeling hopeful that "maybe I've fixed it for good"

By the way, this is how I set up XLA:

```
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version nightly --apt-packages libomp5 libopenblas-dev
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kaggle and TPUs - Errors popping randomly in and out of existence #5665

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kaggle and TPUs - Errors popping randomly in and out of existence #5665

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions