You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Have you reproduced the bug with TensorFlow Nightly?
No
Source
binary
TensorFlow version
tensorflow-rocm-2.13.0.570
Custom code
Yes
OS platform and distribution
Fedora 39
Mobile device
No response
Python version
3.11.7
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
I am trying to install tensorflow-rocm on Fedora 39 and I ran into issues. Fedora 39 provides ROCm 5.7 packages in their repo, so I picked tensorflow-rocm-2.13.0.570 which was marked as the version supporting ROCm 5.7 in the compatibility table.
Since Fedora comes with Python 3.12.1 the setup had to be done in a Python 3.11.7 virtual environment; for the record - PyTorch 2.1.2+rocm5.6 does work in the same venv and does detect the GPU, so I think the ROCm libraries shipped by Fedora do seem to be OK.
After pip installing tensorflow_rocm-2.13.0.570 the following happens upon module import:
Python 3.11.7 (main, Dec 18 2023, 00:00:00) [GCC 13.2.1 20231205 (Red Hat 13.2.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/__init__.py", line 38, in <module>
from tensorflow.python.tools import module_util as _module_util
File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/__init__.py", line 36, in <module>
from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/pywrap_tensorflow.py", line 26, in <module>
self_check.preload_check()
File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/platform/self_check.py", line 63, in preload_check
from tensorflow.python.platform import _pywrap_cpu_feature_guard
ImportError: librccl.so.1: cannot open shared object file: No such file or directory
There is indeed no librccl.so.1 on the system and apparently no librccl.so.1 provider.
The tensorflow wheel does ship a venvs/311_generic/lib/python3.11/site-packages/tensorflow/include/external/local_config_rocm/rocm/rocm/lib/librccl.so, by the way also torch comes iwth an own librccl.so, but for some reason tensorflow-rocm is trying to load librccl.so.1
As a workaround I tried to create a symlink librccl.so.1 -> librccl.so, but this did not help. Finally, I exported the directory via LD_LIBRARY_PATH before starting Python, which allowed me to import the module:
>>> import tensorflow
2024-01-21 19:18:41.965936: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
However, attempting to run any TF related command results in an std::bad_alloc error:
>>> print(tensorflow.reduce_sum(tensorflow.random.normal([1000, 1000])))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/eager/context.py", line 1535, in _initialize_physical_devices
devs = pywrap_tfe.TF_ListPhysicalDevices()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError: std::bad_alloc
I tried downgrading to tensorflow_rocm-2.12.0.560, but it behaves exactly the same as 2.13.
For the sake of completeness I tried 2.14 as a "negative" test and as expected it did not find libamdhip64.so.6, because Fedora 39 does not have ROCm 6.0.0 yet.
Standalone code to reproduce the issue
# this triggers "ImportError: librccl.so.1: cannot open shared object file: No such file or directory"
>>> import tensorflow
# then, once I create the library symlink and add LD_LIBRARY_PATH
>>> print(tensorflow.reduce_sum(tensorflow.random.normal([1000, 1000])))
Traceback (most recent call last):
File "<stdin>", line 1, in<module>
File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/eager/context.py", line 1451, in _initialize_physical_devices
devs = pywrap_tfe.TF_ListPhysicalDevices()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError: std::bad_alloc
### Relevant log output
_No response_
The text was updated successfully, but these errors were encountered:
jin-eld
changed the title
tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to laod wrong librccl.so, std::bad_alloc failure)
tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure)
Jan 22, 2024
Hi @jin-eld, librccl.so.1 should in theory be located in the ROcm directory after installing ROCm; on Ubuntu, it's there under /opt/rocm/lib. We don't officially support Fedora, so component availability and location might differ there, and it sounds like the Fedora installation doesn't provide this by default. I can see there is a rccl Fedora package; have you tried installing that?
Hi @schung-amd, to be honest I did not recheck for quite a while now (gave up on TensorFlow due to not working ROCm support on Fedora and used PyTorch instead). I will retest with Fedora 41 - a lot has changed since F39, basically F41 comes with a fully packaged ROCm stack. However, the libraries will be in their default system locations, i.e. /usr/lib64/ and not in /opt/rocm/lib
It would be really nice to get tensorflow-rocm to work on Fedora, because it is almost impossible for TensorFlow to get packaged in a way like PyTorch. There are two issues with that: as I understand Fedora's policy, all packages and their dependencies must be built from scratch. Second issue is the fact that TensorFlow uses "bazel" as its build system. According to Fedora folks, building "bazel" from scratch would be an insane amount of work, so it's not likely that this will ever be done, meaning that the TensorFlow will never get packaged as RPM.
So, if we could get a working pip install of tensorflow-rocm - that'd be really awesome.
I'll get back to you after I have retested tensorflow-rocm with latest Fedora.
Issue type
Build/Install
Have you reproduced the bug with TensorFlow Nightly?
No
Source
binary
TensorFlow version
tensorflow-rocm-2.13.0.570
Custom code
Yes
OS platform and distribution
Fedora 39
Mobile device
No response
Python version
3.11.7
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
I am trying to install tensorflow-rocm on Fedora 39 and I ran into issues. Fedora 39 provides ROCm 5.7 packages in their repo, so I picked tensorflow-rocm-2.13.0.570 which was marked as the version supporting ROCm 5.7 in the compatibility table.
Since Fedora comes with Python 3.12.1 the setup had to be done in a Python 3.11.7 virtual environment; for the record - PyTorch 2.1.2+rocm5.6 does work in the same venv and does detect the GPU, so I think the ROCm libraries shipped by Fedora do seem to be OK.
After pip installing tensorflow_rocm-2.13.0.570 the following happens upon module import:
There is indeed no
librccl.so.1
on the system and apparently nolibrccl.so.1
provider.The tensorflow wheel does ship a
venvs/311_generic/lib/python3.11/site-packages/tensorflow/include/external/local_config_rocm/rocm/rocm/lib/librccl.so
, by the way also torch comes iwth an ownlibrccl.so
, but for some reason tensorflow-rocm is trying to loadlibrccl.so.1
As a workaround I tried to create a symlink
librccl.so.1 -> librccl.so
, but this did not help. Finally, I exported the directory via LD_LIBRARY_PATH before starting Python, which allowed me to import the module:However, attempting to run any TF related command results in an
std::bad_alloc
error:I tried downgrading to
tensorflow_rocm-2.12.0.560
, but it behaves exactly the same as 2.13.For the sake of completeness I tried 2.14 as a "negative" test and as expected it did not find
libamdhip64.so.6
, because Fedora 39 does not have ROCm 6.0.0 yet.Standalone code to reproduce the issue
The text was updated successfully, but these errors were encountered: