Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) #2374

Open
jin-eld opened this issue Jan 21, 2024 · 2 comments

Comments

@jin-eld
Copy link

jin-eld commented Jan 21, 2024

Issue type

Build/Install

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

tensorflow-rocm-2.13.0.570

Custom code

Yes

OS platform and distribution

Fedora 39

Mobile device

No response

Python version

3.11.7

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I am trying to install tensorflow-rocm on Fedora 39 and I ran into issues. Fedora 39 provides ROCm 5.7 packages in their repo, so I picked tensorflow-rocm-2.13.0.570 which was marked as the version supporting ROCm 5.7 in the compatibility table.

Since Fedora comes with Python 3.12.1 the setup had to be done in a Python 3.11.7 virtual environment; for the record - PyTorch 2.1.2+rocm5.6 does work in the same venv and does detect the GPU, so I think the ROCm libraries shipped by Fedora do seem to be OK.

After pip installing tensorflow_rocm-2.13.0.570 the following happens upon module import:

Python 3.11.7 (main, Dec 18 2023, 00:00:00) [GCC 13.2.1 20231205 (Red Hat 13.2.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/__init__.py", line 38, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/__init__.py", line 36, in <module>
    from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/pywrap_tensorflow.py", line 26, in <module>
    self_check.preload_check()
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/platform/self_check.py", line 63, in preload_check
    from tensorflow.python.platform import _pywrap_cpu_feature_guard
ImportError: librccl.so.1: cannot open shared object file: No such file or directory

There is indeed no librccl.so.1 on the system and apparently no librccl.so.1 provider.
The tensorflow wheel does ship a venvs/311_generic/lib/python3.11/site-packages/tensorflow/include/external/local_config_rocm/rocm/rocm/lib/librccl.so, by the way also torch comes iwth an own librccl.so, but for some reason tensorflow-rocm is trying to load librccl.so.1

As a workaround I tried to create a symlink librccl.so.1 -> librccl.so, but this did not help. Finally, I exported the directory via LD_LIBRARY_PATH before starting Python, which allowed me to import the module:

>>> import tensorflow
2024-01-21 19:18:41.965936: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

However, attempting to run any TF related command results in an std::bad_alloc error:

>>> print(tensorflow.reduce_sum(tensorflow.random.normal([1000, 1000])))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/eager/context.py", line 1535, in _initialize_physical_devices
    devs = pywrap_tfe.TF_ListPhysicalDevices()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError: std::bad_alloc

I tried downgrading to tensorflow_rocm-2.12.0.560, but it behaves exactly the same as 2.13.

For the sake of completeness I tried 2.14 as a "negative" test and as expected it did not find libamdhip64.so.6, because Fedora 39 does not have ROCm 6.0.0 yet.

Standalone code to reproduce the issue

# this triggers "ImportError: librccl.so.1: cannot open shared object file: No such file or directory"
>>> import tensorflow

# then, once I create the library symlink and add LD_LIBRARY_PATH
>>> print(tensorflow.reduce_sum(tensorflow.random.normal([1000, 1000])))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/eager/context.py", line 1451, in _initialize_physical_devices
    devs = pywrap_tfe.TF_ListPhysicalDevices()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError: std::bad_alloc


### Relevant log output

_No response_
@jin-eld jin-eld changed the title tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to laod wrong librccl.so, std::bad_alloc failure) tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) Jan 22, 2024
@schung-amd
Copy link

Hi @jin-eld, librccl.so.1 should in theory be located in the ROcm directory after installing ROCm; on Ubuntu, it's there under /opt/rocm/lib. We don't officially support Fedora, so component availability and location might differ there, and it sounds like the Fedora installation doesn't provide this by default. I can see there is a rccl Fedora package; have you tried installing that?

@jin-eld
Copy link
Author

jin-eld commented Nov 22, 2024

Hi @schung-amd, to be honest I did not recheck for quite a while now (gave up on TensorFlow due to not working ROCm support on Fedora and used PyTorch instead). I will retest with Fedora 41 - a lot has changed since F39, basically F41 comes with a fully packaged ROCm stack. However, the libraries will be in their default system locations, i.e. /usr/lib64/ and not in /opt/rocm/lib

It would be really nice to get tensorflow-rocm to work on Fedora, because it is almost impossible for TensorFlow to get packaged in a way like PyTorch. There are two issues with that: as I understand Fedora's policy, all packages and their dependencies must be built from scratch. Second issue is the fact that TensorFlow uses "bazel" as its build system. According to Fedora folks, building "bazel" from scratch would be an insane amount of work, so it's not likely that this will ever be done, meaning that the TensorFlow will never get packaged as RPM.

So, if we could get a working pip install of tensorflow-rocm - that'd be really awesome.

I'll get back to you after I have retested tensorflow-rocm with latest Fedora.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants