-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] torch_xla._XLAC._xla_get_devices() hangs at the second call #3939
Comments
This is one of those API we don't expect user to call directly(it is under _XLAC), any reason |
The following usage with >>> import torch_xla
>>> import torch_xla.core.xla_model as xm
>>> xm.xla_device()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/anaconda3/envs/xla/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 244, in xla_device
devices = get_xla_supported_devices(devkind=devkind)
File "/home/ubuntu/anaconda3/envs/xla/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 138, in get_xla_supported_devices
xla_devices = _DEVICES.value
File "/home/ubuntu/anaconda3/envs/xla/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/utils/utils.py", line 32, in value
self._value = self._gen_fn()
File "/home/ubuntu/anaconda3/envs/xla/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 20, in <lambda>
_DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/computation_client.cc:280 : Missing XLA configuration
>>> xm.xla_device()
# Hanging I located to _XLAC using py-spy so I just reported that one. |
btw, this is the output from py-spy when it was hanging in HuggingFace:
|
on a second thought, shouldn't python program just exit when it sees first runtime error? Is this only the issue when you on a bash python interpreter? |
I agree with you, but unfortunately this is not how HuggingFace uses... Meanwhile, I filed a PR to HuggingFace that should also workaround this issue: huggingface/transformers#18777 |
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
Note that I intentionally didn't set XLA configuration.
Expected behavior
Whatever how many times this is called, it should consistently raise the RuntimeError. I encountered this issue when I was trying to use HuggingFace Trainer with native PyTorch. However, HuggingFace Trainer detects
torch_xla
in my Python environment, so its functionis_torch_tpu_available()
callsxla_device()
to check if TPU is available, and hanged at the second call.Environment
Additional context
The text was updated successfully, but these errors were encountered: