Skip to content

Kaggle and TPUs - Errors popping randomly in and out of existenceΒ #5665

@alexander-soare

Description

@alexander-soare

This is a general report about what it's like to work with PyTorch Lightining and TPUs on Kaggle. The reason I post it here instead of the XLA repo is that the PyTorch Lightning marketing openly invites us to use Kaggle. The reason I did not follow the bug template, is because I'm reporting something more general than one specific bug.

Over the past few days of trying to use the TPUs I've run into various errors that randomly pop into and out of existence like quantum particles. I can't paste the errors here because of their elusive nature, and I didn't have the hindsight of knowing it would come to this. Here are their descriptions:

Note that for ALL of these examples I've sometimes been able to do some combination of the following to fix it (without changing my code at all)

  • Restart runtime
  • Change to standard runtime then change back to TPU
  • Refresh the page
  • Run the same cell again

Unfortunately, I spend 80% of my time restarting my runtime and reinstalling XLA, and only 20% of my time feeling hopeful that "maybe I've fixed it for good"

By the way, this is how I set up XLA:

!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version nightly --apt-packages libomp5 libopenblas-dev

Metadata

Metadata

Assignees

Labels

3rd partyRelated to a 3rd-partyaccelerator: tpuTensor Processing UnitbugSomething isn't workinghelp wantedOpen to be worked onpriority: 1Medium priority task

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions