-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runs on GPU, error on TPU: Computation requires more parameters (546) than supported (limit 236) #1963
Comments
@jysohn23 this stuff keeps cropping up. Seems like continuations are enabled in the next-gen executor. |
Hey @hrbigelow I don't have access to the datasets you're using on your colab gdrive, but it looks like you're using an old colab template. Use this in your first setup cell instead, which should lead you to our new runtime:
And yes @dlibenzi, similar to this case the command to update the runtime was targeting old runtime but as long as we use this setup script things it should correctly update to new runtime. |
Thanks very much - it works with that new preamble. So by the way for future reference, if I wanted to make the colab runnable for you, what else would I need to do? One thing is, I have to re-mount my gdrive each time I reconnect, so I'm not sure if that part would be reproducible for you. Is there a better place to host and store data files for use with Colab, so that I can allow others to run it? Thanks again, Henry |
I think if you could reproduce the error with some fake data generator that'd be ideal, if not putting it up temporarily somewhere in like GCS bucket would work for us too. Davide may have some other opinions. |
Ahh good idea. And thanks for the sleek preamble, much cleaner. |
I was running on GCP with Jupyter Notebook and faced the exact problem. |
I encountered the similar problem when running a simple cifar classification task, it raises the error after about 2000 iterations Exception in device=TPU:7: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
(0) Invalid argument: Computation requires more parameters (6096) than supported (limit 3306).
[[{{node XRTCompile}}]]
(1) Invalid argument: Computation requires more parameters (6096) than supported (limit 3306).
[[{{node XRTCompile}}]]
[[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/content/jam/jamtorch/xla/utils.py", line 78, in new_fn
value = func(config)
File "/content/jam/example/jamtorch/tpuddp/main.py", line 32, in run
trainer.train()
File "/content/jam/jamtorch/trainer/genetic_trainer.py", line 212, in train
self.train_step(batch)
File "/content/jam/jamtorch/trainer/genetic_trainer.py", line 240, in train_step
if self.loss_backward(loss):
File "/content/jam/jamtorch/trainer/genetic_trainer.py", line 258, in loss_backward
loss.backward()
File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
(0) Invalid argument: Computation requires more parameters (6096) than supported (limit 3306).
[[{{node XRTCompile}}]]
(1) Invalid argument: Computation requires more parameters (6096) than supported (limit 3306).
[[{{node XRTCompile}}]]
[[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored. I used the provided environment of |
Yea, it is the hard limit of number of parameter right now. I will work on a change on the pt/xla side to pass the parameter as a tuple and that should solve this error. |
@JackCaoG I just ran into this error and same limit as last poster |
I dropped this project to work on something else last year. @rwightman My guess is that v4 we have larger |
@JackCaoG I'm running through some larger candidate vision models for medium-large scale CLIP / LiT / etc image-text model pretraining. I hope to include a script with working hparams for reproducing such training on TPU, GPU, (maybe IPU) w/ PyTorch ... so models are fairly large, and hope to go a bit larger still... but so far I've kept within what I thought would be reasonable to test on a single v3-8. I can resume training this one on a 4x GPU machine so no urgency there. Once I sort out the rest of the setup and get further along with the runs on larger dataset will likely run into this limit. Not sure how long all that will take but I can probably work around this for a bit. It does appear that it'd be easy to hit in any scenario where pod use is needed (models too large to fit decent batch sizes on a single accelerator). So surprised more people haven't hit it. |
sounds good, I will keep you updated regarding this issue. |
❓ Questions and Help
Hi all,
Could anyone give a clue what might be going wrong? I have run this commit, from this colab
which has produced this output: debug run
Some lines from it are:
The same code has run successfully on my GTX1070 Max-Q laptop environment with PyTorch version 1.3.1
I've never seen the error before (but it has been several months since I've used torch_xla)
Thanks in advance!
The text was updated successfully, but these errors were encountered: