tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) #2374

jin-eld · 2024-01-21T18:49:08Z

Issue type

Build/Install

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

tensorflow-rocm-2.13.0.570

Custom code

Yes

OS platform and distribution

Fedora 39

Mobile device

No response

Python version

3.11.7

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I am trying to install tensorflow-rocm on Fedora 39 and I ran into issues. Fedora 39 provides ROCm 5.7 packages in their repo, so I picked tensorflow-rocm-2.13.0.570 which was marked as the version supporting ROCm 5.7 in the compatibility table.

Since Fedora comes with Python 3.12.1 the setup had to be done in a Python 3.11.7 virtual environment; for the record - PyTorch 2.1.2+rocm5.6 does work in the same venv and does detect the GPU, so I think the ROCm libraries shipped by Fedora do seem to be OK.

After pip installing tensorflow_rocm-2.13.0.570 the following happens upon module import:

Python 3.11.7 (main, Dec 18 2023, 00:00:00) [GCC 13.2.1 20231205 (Red Hat 13.2.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/__init__.py", line 38, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/__init__.py", line 36, in <module>
    from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/pywrap_tensorflow.py", line 26, in <module>
    self_check.preload_check()
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/platform/self_check.py", line 63, in preload_check
    from tensorflow.python.platform import _pywrap_cpu_feature_guard
ImportError: librccl.so.1: cannot open shared object file: No such file or directory

There is indeed no librccl.so.1 on the system and apparently no librccl.so.1 provider.
The tensorflow wheel does ship a venvs/311_generic/lib/python3.11/site-packages/tensorflow/include/external/local_config_rocm/rocm/rocm/lib/librccl.so, by the way also torch comes iwth an own librccl.so, but for some reason tensorflow-rocm is trying to load librccl.so.1

As a workaround I tried to create a symlink librccl.so.1 -> librccl.so, but this did not help. Finally, I exported the directory via LD_LIBRARY_PATH before starting Python, which allowed me to import the module:

>>> import tensorflow
2024-01-21 19:18:41.965936: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

However, attempting to run any TF related command results in an std::bad_alloc error:

>>> print(tensorflow.reduce_sum(tensorflow.random.normal([1000, 1000])))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/eager/context.py", line 1535, in _initialize_physical_devices
    devs = pywrap_tfe.TF_ListPhysicalDevices()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError: std::bad_alloc

I tried downgrading to tensorflow_rocm-2.12.0.560, but it behaves exactly the same as 2.13.

For the sake of completeness I tried 2.14 as a "negative" test and as expected it did not find libamdhip64.so.6, because Fedora 39 does not have ROCm 6.0.0 yet.

Standalone code to reproduce the issue

# this triggers "ImportError: librccl.so.1: cannot open shared object file: No such file or directory"
>>> import tensorflow

# then, once I create the library symlink and add LD_LIBRARY_PATH
>>> print(tensorflow.reduce_sum(tensorflow.random.normal([1000, 1000])))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/eager/context.py", line 1451, in _initialize_physical_devices
    devs = pywrap_tfe.TF_ListPhysicalDevices()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError: std::bad_alloc



### Relevant log output

_No response_

The text was updated successfully, but these errors were encountered:

schung-amd · 2024-11-22T19:36:30Z

Hi @jin-eld, librccl.so.1 should in theory be located in the ROcm directory after installing ROCm; on Ubuntu, it's there under /opt/rocm/lib. We don't officially support Fedora, so component availability and location might differ there, and it sounds like the Fedora installation doesn't provide this by default. I can see there is a rccl Fedora package; have you tried installing that?

jin-eld · 2024-11-22T19:56:42Z

Hi @schung-amd, to be honest I did not recheck for quite a while now (gave up on TensorFlow due to not working ROCm support on Fedora and used PyTorch instead). I will retest with Fedora 41 - a lot has changed since F39, basically F41 comes with a fully packaged ROCm stack. However, the libraries will be in their default system locations, i.e. /usr/lib64/ and not in /opt/rocm/lib

It would be really nice to get tensorflow-rocm to work on Fedora, because it is almost impossible for TensorFlow to get packaged in a way like PyTorch. There are two issues with that: as I understand Fedora's policy, all packages and their dependencies must be built from scratch. Second issue is the fact that TensorFlow uses "bazel" as its build system. According to Fedora folks, building "bazel" from scratch would be an insane amount of work, so it's not likely that this will ever be done, meaning that the TensorFlow will never get packaged as RPM.

So, if we could get a working pip install of tensorflow-rocm - that'd be really awesome.

I'll get back to you after I have retested tensorflow-rocm with latest Fedora.

jin-eld changed the title ~~tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to laod wrong librccl.so, std::bad_alloc failure)~~ tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) Jan 22, 2024

ppanchad-amd added the Under Investigation label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) #2374

tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) #2374

jin-eld commented Jan 21, 2024

schung-amd commented Nov 22, 2024

jin-eld commented Nov 22, 2024

tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) #2374

tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) #2374

Comments

jin-eld commented Jan 21, 2024

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

schung-amd commented Nov 22, 2024

jin-eld commented Nov 22, 2024