Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] torch_xla._XLAC._xla_get_devices() hangs at the second call #3939

Open
comaniac opened this issue Aug 26, 2022 · 5 comments
Open

[Bug] torch_xla._XLAC._xla_get_devices() hangs at the second call #3939

comaniac opened this issue Aug 26, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@comaniac
Copy link

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Note that I intentionally didn't set XLA configuration.

>>> import torch_xla
>>> torch_xla._XLAC._xla_get_devices()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: tensorflow/compiler/xla/xla_client/computation_client.cc:280 : Missing XLA configuration
>>> torch_xla._XLAC._xla_get_devices()
# Hanging

Expected behavior

Whatever how many times this is called, it should consistently raise the RuntimeError. I encountered this issue when I was trying to use HuggingFace Trainer with native PyTorch. However, HuggingFace Trainer detects torch_xla in my Python environment, so its function is_torch_tpu_available() calls xla_device() to check if TPU is available, and hanged at the second call.

Environment

  • Reproducible on XLA backend [CPU/TPU]: No backend
  • torch_xla version: d8db50a

Additional context

@JackCaoG
Copy link
Collaborator

This is one of those API we don't expect user to call directly(it is under _XLAC), any reason xm.xla_devices() doesn't work for you?

@comaniac
Copy link
Author

comaniac commented Aug 26, 2022

The following usage with xla_device() results in the same issue:

>>> import torch_xla
>>> import torch_xla.core.xla_model as xm
>>> xm.xla_device()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/xla/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 244, in xla_device
    devices = get_xla_supported_devices(devkind=devkind)
  File "/home/ubuntu/anaconda3/envs/xla/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 138, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/home/ubuntu/anaconda3/envs/xla/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/home/ubuntu/anaconda3/envs/xla/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 20, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/computation_client.cc:280 : Missing XLA configuration
>>> xm.xla_device()
# Hanging

I located to _XLAC using py-spy so I just reported that one.

@comaniac
Copy link
Author

btw, this is the output from py-spy when it was hanging in HuggingFace:

Thread 118054 (idle): "MainThread"
    <lambda> (torch_xla/core/xla_model.py:20)
    value (torch_xla/utils/utils.py:32)
    get_xla_supported_devices (torch_xla/core/xla_model.py:138)
    xla_device (torch_xla/core/xla_model.py:244)
    is_torch_tpu_available (transformers/utils/import_utils.py:409)
    _setup_devices (transformers/training_args.py:1328)
    wrapper (transformers/utils/import_utils.py:926)
    __get__ (transformers/utils/generic.py:49)
    device (transformers/training_args.py:1389)
    wrapper (transformers/utils/import_utils.py:926)
    __post_init__ (transformers/training_args.py:1086)
    __init__ (<string>:102)
    <module> (fine_tune.py:34)
Thread 118156 (idle): "Thread-1"
    wait (threading.py:300)
    wait (threading.py:552)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)

@JackCaoG
Copy link
Collaborator

on a second thought, shouldn't python program just exit when it sees first runtime error? Is this only the issue when you on a bash python interpreter?

@comaniac
Copy link
Author

I agree with you, but unfortunately this is not how HuggingFace uses...
Specifically, they fall back to other backends when is_torch_tpu_available() (that calls xla_device()) returns False, and this function may be called more than once.

Meanwhile, I filed a PR to HuggingFace that should also workaround this issue: huggingface/transformers#18777

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants