`num_tpu_cores=8` does not work on kaggle #1538

lezwon · 2020-04-20T18:04:20Z

🐛 Bug

When I try to train a model on Kaggle TPU's with num_tpu_cores set to 8, I receive an error Exception: process 2 terminated with exit code 1 . Would be great if this worked on kaggle.

To Reproduce

Steps to reproduce the behavior:

Run this notebook:
https://www.kaggle.com/lezwon/pytorch-on-tpu-with-pytorch-lightning

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-9-9251330963d1> in <module>
      3 # most basic trainer, uses good defaults (1 TPU)
      4 trainer = pl.Trainer(num_tpu_cores=8)
----> 5 trainer.fit(mnist_model)

/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, test_dataloaders)
    714 
    715             # train
--> 716             xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
    717 
    718             # load weights if not interrupted

/opt/conda/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    180         join=join,
    181         daemon=daemon,
--> 182         start_method=start_method)

/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    156 
    157     # Loop on join until it returns True or raises an exception.
--> 158     while not context.join():
    159         pass
    160 

/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    111                 raise Exception(
    112                     "process %d terminated with exit code %d" %
--> 113                     (error_index, exitcode)
    114                 )
    115 

Exception: process 3 terminated with exit code 1

Code sample

trainer = pl.Trainer(num_tpu_cores=8, precision=16)

Expected behavior

Run the model utilizing all 8 TPU cores.

Environment

cuda:
	GPU:
	available:           False
	version:             None
packages:
	numpy:               1.18.2
	pyTorch_debug:       False
	pyTorch_version:     1.6.0a0+30e7055
	pytorch-lightning:   0.7.3
	tensorboard:         2.1.1
	tqdm:                4.42.0
system:
	OS:                  Linux
	architecture:
		64bit
		
	processor:           
	python:              3.6.6
	version:             #1 SMP Sat Apr 4 00:12:45 PDT 2020

The text was updated successfully, but these errors were encountered:

williamFalcon · 2020-04-20T22:54:06Z

I think this is a kaggle problem?
@dlibenzi any ideas?

dlibenzi · 2020-04-20T22:58:07Z

It prolly needs this on top:

!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version nightly --apt-packages libomp5 libopenblas-dev

williamFalcon · 2020-04-20T23:34:16Z

those lines are already at the top:
https://www.kaggle.com/pytorchlightning/pytorch-on-tpu-with-pytorch-lightning

dlibenzi · 2020-04-21T00:55:23Z

I bet the issue is here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/bd168819f2e6433a8b0f0d497eda785199cf65cf/pytorch_lightning/trainer/trainer.py#L762

williamFalcon · 2020-04-21T01:03:08Z

ah... yes. good catch.
know of something more general that we can check? i assume the only two options are kaggle and colab?

williamFalcon · 2020-04-21T01:04:22Z

@lezwon want to find an environment variable we can check to know if on kaggle and submit a PR?

dlibenzi · 2020-04-21T01:06:17Z

Honestly, pytorch does not like fork because of CUDA, but I would make that the default, with ability to change via some environment variable in cases someone have issues.

williamFalcon · 2020-04-21T01:11:15Z

on GCP it would still be fork?
when would it not be fork with TPUs?

dlibenzi · 2020-04-21T01:24:59Z

Fork is an issue with pytorch/CUDA mostly.
But for safety, I would just add a Kaggle check as well in your code, and leave spawn as default.
Fork also helps Colab and Kaggle because, being them low memory VMs, one can reduce the memory consumption by creating the model (on default pytorch/cpu) at global scope, and then doing to(xla_device) from within the xmp.spawn() target functions.
This avoids creating pytorch/cpu models in each of the processes (one per core).
You can see a few tricks to fit models on Colab here:

https://colab.research.google.com/drive/1IvCxIg-Q_DlI7UNJuajpl4UZXNiW5jMg

Like create model at global scope, and serialize the to(xla_device) calls to avoid all 8 processes rushing into allocation host memory at the same time.

engmubarak48 · 2020-06-12T13:07:48Z

I also have this issue. if I use GPU, the model is training normally, but when I try to TPU, this happens.

EDIT: Having analyzed the issue is about the RAM crashing.

I believe this has to do with XLA using up RAM. I constantly use up all my RAM, which causes the
SIGKILL error. If you take a look at this: pytorch/xla#1280 --- reference Kaggle discussions

INIT TPU local core: 0, global rank: 0
INIT TPU local core: 4, global rank: 4
INIT TPU local core: 6, global rank: 6
INIT TPU local core: 3, global rank: 3
INIT TPU local core: 7, global rank: 7
INIT TPU local core: 5, global rank: 5
INIT TPU local core: 2, global rank: 2
INIT TPU local core: 1, global rank: 1

 
Validation sanity check:
0/? [00:00<?, ?it/s]
Exception in device=TPU:6: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
  (0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
  (1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
	 [[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 535, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine
    False)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
    for batch_idx, batch in enumerate(dataloader):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 31, in __next__
    return self.next()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 37, in next
    xm.mark_step()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 536, in mark_step
    wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
  (0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
  (1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
	 [[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Exception in device=TPU:1: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
  (0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
  (1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
	 [[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 535, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine
    False)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
    for batch_idx, batch in enumerate(dataloader):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 31, in __next__
    return self.next()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 37, in next
    xm.mark_step()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 536, in mark_step
    wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
  (0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
  (1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
	 [[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-29-f6eba0e942ef> in <module>()
      1 model = hatefull_memesCL()
      2 if __name__ == '__main__':
----> 3     trainer.fit(model)

3 frames
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    111                 raise Exception(
    112                     "process %d terminated with exit code %d" %
--> 113                     (error_index, exitcode)
    114                 )
    115 

Exception: process 6 terminated with exit code 17

dlibenzi · 2020-06-12T13:45:40Z

Hmm, this is something different:

Invalid argument: Computation requires more parameters (732) than supported (limit 236).

We have seen that a few time but I keep forgetting what the root cause was.
It's a misconfiguration of the TPU service, but I do not remember how it can get in that state.

engmubarak48 · 2020-06-12T13:47:52Z

@dlibenzi it is interesting issue, i will let you know if i find the bug

lezwon added bug Something isn't working help wanted Open to be worked on labels Apr 20, 2020

williamFalcon added the priority: 0 High priority task label Apr 20, 2020

williamFalcon assigned lezwon Apr 21, 2020

lezwon mentioned this issue Apr 23, 2020

check for kaggle env variable #1568

Merged

2 tasks

williamFalcon closed this as completed in #1568 Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`num_tpu_cores=8` does not work on kaggle #1538

`num_tpu_cores=8` does not work on kaggle #1538

lezwon commented Apr 20, 2020 •

edited

Loading

williamFalcon commented Apr 20, 2020

dlibenzi commented Apr 20, 2020

williamFalcon commented Apr 20, 2020

dlibenzi commented Apr 21, 2020

williamFalcon commented Apr 21, 2020

williamFalcon commented Apr 21, 2020

dlibenzi commented Apr 21, 2020

williamFalcon commented Apr 21, 2020

dlibenzi commented Apr 21, 2020

engmubarak48 commented Jun 12, 2020 •

edited

Loading

dlibenzi commented Jun 12, 2020

engmubarak48 commented Jun 12, 2020

num_tpu_cores=8 does not work on kaggle #1538

num_tpu_cores=8 does not work on kaggle #1538

Comments

lezwon commented Apr 20, 2020 • edited Loading

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

williamFalcon commented Apr 20, 2020

dlibenzi commented Apr 20, 2020

williamFalcon commented Apr 20, 2020

dlibenzi commented Apr 21, 2020

williamFalcon commented Apr 21, 2020

williamFalcon commented Apr 21, 2020

dlibenzi commented Apr 21, 2020

williamFalcon commented Apr 21, 2020

dlibenzi commented Apr 21, 2020

engmubarak48 commented Jun 12, 2020 • edited Loading

dlibenzi commented Jun 12, 2020

engmubarak48 commented Jun 12, 2020

`num_tpu_cores=8` does not work on kaggle #1538

`num_tpu_cores=8` does not work on kaggle #1538

lezwon commented Apr 20, 2020 •

edited

Loading

engmubarak48 commented Jun 12, 2020 •

edited

Loading