Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch-triton-rocm==2.3.1 uses ROCM 6's libamdhip64.so instead of ROCM 5.7 #626

Open
etiennemlb opened this issue Aug 9, 2024 · 0 comments

Comments

@etiennemlb
Copy link

Problem Description

On my platform, I'm using pytorch 2.3.1 for rocm 5.7.1 (pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7) this pulls triton-rocm-2.3.1. all is fine.

Take any hello world triton program and run it.

I get the following error:

Traceback (most recent call last):
  File "python_environment/lib/python3.11/site-packages/triton/common/backend.py", line 96, in get_backend
    importlib.import_module(device_backend_package_name, package=__spec__.name)
  File "/opt/cray/pe/python/3.11.5/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "python_environment/lib/python3.11/site-packages/triton/third_party/hip/__init__.py", line 5, in <module>
    register_backend("hip", HIPBackend)
  File "python_environment/lib/python3.11/site-packages/triton/common/backend.py", line 88, in register_backend
    _backends[device_type] = backend_cls.create_backend(device_type)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python_environment/lib/python3.11/site-packages/triton/common/backend.py", line 80, in create_backend
    return cls(device_type)
           ^^^^^^^^^^^^^^^^
  File "python_environment/lib/python3.11/site-packages/triton/third_party/hip/hip_backend.py", line 389, in __init__
    self.driver = HIPDriver()
                  ^^^^^^^^^^^
  File "python_environment/lib/python3.11/site-packages/triton/runtime/driver.py", line 119, in __init__
    self.utils = HIPUtils()
                 ^^^^^^^^^^
  File "python_environment/lib/python3.11/site-packages/triton/runtime/driver.py", line 105, in __init__
    mod = importlib.util.module_from_spec(spec)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 573, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1233, in create_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: libamdhip64.so.6: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "test.py", line 90, in <module>
    output_triton = add(x, y)
                    ^^^^^^^^^
  File "test.py", line 76, in add
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
  File "<string>", line 25, in add_kernel
ValueError: Cannot find backend for hip

Looking into the code we get to (

def rocm_path_dir():
):

@functools.lru_cache()
def rocm_path_dir():
    default_path = os.path.join(os.path.dirname(__file__), "..", "third_party", "hip")
    # Check if include files have been populated locally.  If so, then we are 
    # most likely in a whl installation and he rest of our libraries should be here
    if (os.path.exists(default_path+"/include/hip/hip_runtime.h")):
        return default_path
    else:
        return os.getenv("ROCM_PATH", default="/opt/rocm")

This is used to find the location of the runtime/header that triton should use to build the kernel/link against. Now in my situation it detects the default_path as valid and uses that (it resolves to .../python_environment/lib/python3.11/site-packages/triton/common/../third_party/hip/lib.

We get to the first issue:
Triton links the kernel against a rocm 6 runtime and not a rocm 5! I pulled triton from a rocm 5.7 torch.

$ readelf -a -W .../python_environment/lib/python3.11/site-packages/triton/common/../third_party/hip/lib/libamdhip64.so | grep soname
 0x000000000000000e (SONAME)             Library soname: [libamdhip64.so.6]

The second issue is that right after building the triton/runtime/backends/hip.c, we try to load it into python (

spec = importlib.util.spec_from_file_location("hip_utils", cache_path)
).
This fails, the libamdhip64.so we linked against is not in LD_LIBRARY_PATH, is not even guaranteed to be installed on the system and available via ldconf's cache. Thus we get the issue mentioned first in this issue ImportError: libamdhip64.so.6: cannot open shared object file: No such file or directory.

Third issue, I think that rocm_path_dir is playing against the user because it prevents him overwriting the libraries issue mentioned above. I would advocate to reverse the priority ordering such that ROCM_PATH if found is used before trying to look at triton's own hip runtime. Or, else, provide a way to disable triton using its hip runtime.

TLDR; A triton pulled with a torch using rocm 5 uses, in fact rocm6 binaries while the user expects rocm 5. If the user has not installed rocm 6, we end up with errors of the form, 'ImportError: libamdhip64.so.6: cannot open shared object file: No such file or directory`.

A workaround, edit rocm_path_dir such that we do not use triton's libraries:
image
note that we can't just modify the environment variable such that we point toward triton's hip runtime. This could lead to many other hip machinery to break down (mixing rocm 5 and 6 ?).

Operating System

RHEL 8.8

CPU

AMD EPYC 7A53 64-Core Processor

GPU

AMD Instinct MI250X

ROCm Version

ROCm 5.7.1

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant