[Bug] torch_xla._XLAC._xla_get_devices() hangs at the second call #3939

comaniac · 2022-08-26T00:53:53Z

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Note that I intentionally didn't set XLA configuration.

>>> import torch_xla
>>> torch_xla._XLAC._xla_get_devices()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: tensorflow/compiler/xla/xla_client/computation_client.cc:280 : Missing XLA configuration
>>> torch_xla._XLAC._xla_get_devices()
# Hanging

Expected behavior

Whatever how many times this is called, it should consistently raise the RuntimeError. I encountered this issue when I was trying to use HuggingFace Trainer with native PyTorch. However, HuggingFace Trainer detects torch_xla in my Python environment, so its function is_torch_tpu_available() calls xla_device() to check if TPU is available, and hanged at the second call.

Environment

Reproducible on XLA backend [CPU/TPU]: No backend
torch_xla version: d8db50a

Additional context

The text was updated successfully, but these errors were encountered:

JackCaoG · 2022-08-26T01:00:46Z

This is one of those API we don't expect user to call directly(it is under _XLAC), any reason xm.xla_devices() doesn't work for you?

comaniac · 2022-08-26T01:02:27Z

The following usage with xla_device() results in the same issue:

>>> import torch_xla
>>> import torch_xla.core.xla_model as xm
>>> xm.xla_device()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/xla/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 244, in xla_device
    devices = get_xla_supported_devices(devkind=devkind)
  File "/home/ubuntu/anaconda3/envs/xla/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 138, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/home/ubuntu/anaconda3/envs/xla/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/home/ubuntu/anaconda3/envs/xla/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 20, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/computation_client.cc:280 : Missing XLA configuration
>>> xm.xla_device()
# Hanging

I located to _XLAC using py-spy so I just reported that one.

comaniac · 2022-08-26T01:05:48Z

btw, this is the output from py-spy when it was hanging in HuggingFace:

Thread 118054 (idle): "MainThread"
    <lambda> (torch_xla/core/xla_model.py:20)
    value (torch_xla/utils/utils.py:32)
    get_xla_supported_devices (torch_xla/core/xla_model.py:138)
    xla_device (torch_xla/core/xla_model.py:244)
    is_torch_tpu_available (transformers/utils/import_utils.py:409)
    _setup_devices (transformers/training_args.py:1328)
    wrapper (transformers/utils/import_utils.py:926)
    __get__ (transformers/utils/generic.py:49)
    device (transformers/training_args.py:1389)
    wrapper (transformers/utils/import_utils.py:926)
    __post_init__ (transformers/training_args.py:1086)
    __init__ (<string>:102)
    <module> (fine_tune.py:34)
Thread 118156 (idle): "Thread-1"
    wait (threading.py:300)
    wait (threading.py:552)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)

JackCaoG · 2022-08-26T17:58:26Z

on a second thought, shouldn't python program just exit when it sees first runtime error? Is this only the issue when you on a bash python interpreter?

comaniac · 2022-08-26T18:09:06Z

I agree with you, but unfortunately this is not how HuggingFace uses...
Specifically, they fall back to other backends when is_torch_tpu_available() (that calls xla_device()) returns False, and this function may be called more than once.

Meanwhile, I filed a PR to HuggingFace that should also workaround this issue: huggingface/transformers#18777

JackCaoG added the bug Something isn't working label Aug 26, 2022

comaniac mentioned this issue Aug 26, 2022

Cache results of is_torch_tpu_available() huggingface/transformers#18777

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] torch_xla._XLAC._xla_get_devices() hangs at the second call #3939

[Bug] torch_xla._XLAC._xla_get_devices() hangs at the second call #3939

comaniac commented Aug 26, 2022

JackCaoG commented Aug 26, 2022

comaniac commented Aug 26, 2022 •

edited

Loading

comaniac commented Aug 26, 2022

JackCaoG commented Aug 26, 2022

comaniac commented Aug 26, 2022

[Bug] torch_xla._XLAC._xla_get_devices() hangs at the second call #3939

[Bug] torch_xla._XLAC._xla_get_devices() hangs at the second call #3939

Comments

comaniac commented Aug 26, 2022

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

JackCaoG commented Aug 26, 2022

comaniac commented Aug 26, 2022 • edited Loading

comaniac commented Aug 26, 2022

JackCaoG commented Aug 26, 2022

comaniac commented Aug 26, 2022

comaniac commented Aug 26, 2022 •

edited

Loading