Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

num_tpu_cores=8 does not work on kaggle #1538

Closed
lezwon opened this issue Apr 20, 2020 · 12 comments · Fixed by #1568
Closed

num_tpu_cores=8 does not work on kaggle #1538

lezwon opened this issue Apr 20, 2020 · 12 comments · Fixed by #1568
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task

Comments

@lezwon
Copy link
Contributor

lezwon commented Apr 20, 2020

🐛 Bug

When I try to train a model on Kaggle TPU's with num_tpu_cores set to 8, I receive an error Exception: process 2 terminated with exit code 1 . Would be great if this worked on kaggle.

To Reproduce

Steps to reproduce the behavior:

  1. Run this notebook:
    https://www.kaggle.com/lezwon/pytorch-on-tpu-with-pytorch-lightning
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-9-9251330963d1> in <module>
      3 # most basic trainer, uses good defaults (1 TPU)
      4 trainer = pl.Trainer(num_tpu_cores=8)
----> 5 trainer.fit(mnist_model)

/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, test_dataloaders)
    714 
    715             # train
--> 716             xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
    717 
    718             # load weights if not interrupted

/opt/conda/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    180         join=join,
    181         daemon=daemon,
--> 182         start_method=start_method)

/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    156 
    157     # Loop on join until it returns True or raises an exception.
--> 158     while not context.join():
    159         pass
    160 

/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    111                 raise Exception(
    112                     "process %d terminated with exit code %d" %
--> 113                     (error_index, exitcode)
    114                 )
    115 

Exception: process 3 terminated with exit code 1

Code sample

trainer = pl.Trainer(num_tpu_cores=8, precision=16)

Expected behavior

Run the model utilizing all 8 TPU cores.

Environment

cuda:
	GPU:
	available:           False
	version:             None
packages:
	numpy:               1.18.2
	pyTorch_debug:       False
	pyTorch_version:     1.6.0a0+30e7055
	pytorch-lightning:   0.7.3
	tensorboard:         2.1.1
	tqdm:                4.42.0
system:
	OS:                  Linux
	architecture:
		64bit
		
	processor:           
	python:              3.6.6
	version:             #1 SMP Sat Apr 4 00:12:45 PDT 2020
@lezwon lezwon added bug Something isn't working help wanted Open to be worked on labels Apr 20, 2020
@williamFalcon
Copy link
Contributor

I think this is a kaggle problem?
@dlibenzi any ideas?

@williamFalcon williamFalcon added the priority: 0 High priority task label Apr 20, 2020
@dlibenzi
Copy link

It prolly needs this on top:

!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version nightly --apt-packages libomp5 libopenblas-dev

@williamFalcon
Copy link
Contributor

those lines are already at the top:
https://www.kaggle.com/pytorchlightning/pytorch-on-tpu-with-pytorch-lightning

image

@williamFalcon
Copy link
Contributor

ah... yes. good catch.
know of something more general that we can check? i assume the only two options are kaggle and colab?

@williamFalcon
Copy link
Contributor

@lezwon want to find an environment variable we can check to know if on kaggle and submit a PR?

@dlibenzi
Copy link

Honestly, pytorch does not like fork because of CUDA, but I would make that the default, with ability to change via some environment variable in cases someone have issues.

@williamFalcon
Copy link
Contributor

on GCP it would still be fork?
when would it not be fork with TPUs?

@dlibenzi
Copy link

Fork is an issue with pytorch/CUDA mostly.
But for safety, I would just add a Kaggle check as well in your code, and leave spawn as default.
Fork also helps Colab and Kaggle because, being them low memory VMs, one can reduce the memory consumption by creating the model (on default pytorch/cpu) at global scope, and then doing to(xla_device) from within the xmp.spawn() target functions.
This avoids creating pytorch/cpu models in each of the processes (one per core).
You can see a few tricks to fit models on Colab here:

https://colab.research.google.com/drive/1IvCxIg-Q_DlI7UNJuajpl4UZXNiW5jMg

Like create model at global scope, and serialize the to(xla_device) calls to avoid all 8 processes rushing into allocation host memory at the same time.

@engmubarak48
Copy link

engmubarak48 commented Jun 12, 2020

I also have this issue. if I use GPU, the model is training normally, but when I try to TPU, this happens.

EDIT: Having analyzed the issue is about the RAM crashing.

I believe this has to do with XLA using up RAM. I constantly use up all my RAM, which causes the
SIGKILL error. If you take a look at this: pytorch/xla#1280 --- reference Kaggle discussions

INIT TPU local core: 0, global rank: 0
INIT TPU local core: 4, global rank: 4
INIT TPU local core: 6, global rank: 6
INIT TPU local core: 3, global rank: 3
INIT TPU local core: 7, global rank: 7
INIT TPU local core: 5, global rank: 5
INIT TPU local core: 2, global rank: 2
INIT TPU local core: 1, global rank: 1

 
Validation sanity check:
0/? [00:00<?, ?it/s]
Exception in device=TPU:6: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
  (0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
  (1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
	 [[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 535, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine
    False)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
    for batch_idx, batch in enumerate(dataloader):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 31, in __next__
    return self.next()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 37, in next
    xm.mark_step()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 536, in mark_step
    wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
  (0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
  (1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
	 [[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Exception in device=TPU:1: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
  (0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
  (1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
	 [[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 535, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine
    False)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
    for batch_idx, batch in enumerate(dataloader):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 31, in __next__
    return self.next()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 37, in next
    xm.mark_step()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 536, in mark_step
    wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
  (0) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
  (1) Invalid argument: Computation requires more parameters (732) than supported (limit 236).
	 [[{{node XRTCompile}}]]
	 [[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-29-f6eba0e942ef> in <module>()
      1 model = hatefull_memesCL()
      2 if __name__ == '__main__':
----> 3     trainer.fit(model)

3 frames
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    111                 raise Exception(
    112                     "process %d terminated with exit code %d" %
--> 113                     (error_index, exitcode)
    114                 )
    115 

Exception: process 6 terminated with exit code 17 

@dlibenzi
Copy link

Hmm, this is something different:

Invalid argument: Computation requires more parameters (732) than supported (limit 236).

We have seen that a few time but I keep forgetting what the root cause was.
It's a misconfiguration of the TPU service, but I do not remember how it can get in that state.

@engmubarak48
Copy link

@dlibenzi it is interesting issue, i will let you know if i find the bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants