-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMDGPU.jl doesn't seem to work with 7900 series GPUs #371
Comments
rocm 5.5 was recently released and it is the first version that officially supports 7900xt and 7900xtx. |
Does it require LLVM16 or is 15 enough? |
Just tried ROCm 5.5, complains at opaque pointers: producer LLVM 16, consumer LLVM 14 |
Julia master is now on LLVM 15 I think, does that help here? |
Not completely sure, but it seems #508 (ROCm mixed mode) does not yet provide 7900XTX support (at least on arch) The failure is:
pytorch with rocm 5.6 works fine Here is what I tried (on nightly from Oct 14th 2023, and on 1.10-beta3)
|
I've seen this error, but I don't think it is related to Julia. I'm using Ubuntu 22.04 with ROCm 5.6 and we do support 7900 XTX, I'm actually using it with Julia 1.10-beta3. julia> using AMDGPU
julia> AMDGPU.versioninfo()
ROCm provided by: system
[+] HSA Runtime v1.1.0
@ /opt/rocm-5.6.1/lib/libhsa-runtime64.so
[+] ld.lld
@ /opt/rocm/llvm/bin/ld.lld
[+] ROCm-Device-Libs
@ /home/pxl-th/.julia/artifacts/5ad5ecb46e3c334821f54c1feecc6c152b7b6a45/amdgcn/bitcode
[+] HIP Runtime v5.6.31062
@ /opt/rocm-5.6.1/lib/libamdhip64.so
[+] rocBLAS v3.0.0
@ /opt/rocm-5.6.1/lib/librocblas.so
[+] rocSOLVER v3.22.0
@ /opt/rocm-5.6.1/lib/librocsolver.so
[+] rocALUTION
@ /opt/rocm-5.6.1/lib/librocalution.so
[+] rocSPARSE
@ /opt/rocm-5.6.1/lib/librocsparse.so.0
[+] rocRAND v2.10.5
@ /opt/rocm-5.6.1/lib/librocrand.so
[+] rocFFT v1.0.21
@ /opt/rocm-5.6.1/lib/librocfft.so
[+] MIOpen v2.20.0
@ /opt/rocm-5.6.1/lib/libMIOpen.so
HIP Devices [1]
1. HIPDevice(name="Radeon RX 7900 XTX", id=1)
julia> sum(AMDGPU.ones(Float32, 16))
16.0f0 You also don't need |
However there are other issues with Navi 3: #518. |
Weirdly, whether I use
|
Do |
Yes, and pytorch works fine in a normal python-virtualenv (directly on the same host, no containers or VMs) click to see output
|
This particular issue comes from debug HIP build, but can be ignored (for now). |
As for 7900 series, we already support them. |
I'm having some trouble with an error upon using AMDGPU. I first tried today with rocm version 5.7.1, and have since downgraded to version 5.6.1. I'm on arch linux attemping to use a 7900 XTX. The leading line of the error is as follows:
Full error: Details
rocminfo: Details
clinfo: Details
If instead I use a LocalPreferences.toml with use_artifacts = true , then I see on using AMDGPU
Details
Lastly, here is a list which I think contains the relevant packages. I've tested with everything uniformly either version 5.7 or version 5.6 except composable-kernel which I could only find as either version 5.5 or version 5.7. Details
Apologies if the solution is documented above and I have not understood how to put it together. Thank you for all of your hard work. |
Can you get ROCm in release mode? HIP on you system is in debug. |
Also there are other issues with debug build that are not there with release mode |
Gotcha, thank you for your time. I am not sure how easy/difficult it is for me to get ROCm in release mode, but I will report back with any progress. |
Not sure if arch has HIP in release mode in its packages, but as an alternative, you can build it from source, which is quite easy. But you then need to make it visible to |
@fraksuh @pxl-th What I'm seeing is that:
This all seems like you don't have the AMDGPU DKMS modules installed on your base system. An easy way to check is to see if you have AMDGPU DKMS drivers don't seem to be provided in extra by Arch, and there's an AUR that is not maintained anymore named amdgpu-dkms. There's this that might provide it, but I'm not sure if it works with ROCm. I'm curious to know what the correct way for ROCm AMDGPU drivers. Ubuntu and CentOS both have a dedicated APT repo for this, that matches the ROCm version you want to run, so it's easier for them. You're just paying the Arch penalty. If you get the AMDGPU DKMS part working you can just use containers instead of building ROCm yourself. AMD provides Ubuntu 18.04/20.04/22.04 you can pull from Docker Hub. You can run it with the following command, although you might reduce the amount of permissions you need to run it:
|
Hey thank you so much for taking the time to help.
I do at least have /dev/kfd on my system, but I don't at the moment know where it came from; I was able to "make sure your user is in the same group as /dev/kfd"
Earlier today I reached out to the maintainer of the relevant Arch packages, and they are kindly offering their brainpower; I plan to follow this lead first before attempting to either build ROCm myself or use containers (thank you for the suggestion!). |
I'm having the exact same problem, did you figure anything out? |
I've had some further corresponding with the package maintainer, but no resolution. Ideally I'd open an issue on their gitlab, but they seem to have suspended account creation. Here are the salient aspects of our conversation so far; quoted lines are the maintainers words, non-quoted lines are mine: "Packages are not built in debug mode but with Arch specific flags. Release mode is discouraged for all binary packages in the Arch Linux repositories." "The /usr/src/debug doesn't mean that the package is compiled in debug mode. It's the default prefix for debug symbols, "You can try to run the simple test.sh in the hip-runtime-amd repo on gitlab or run the HIP-Examples repo from AMD on github. If both pass, it's likely an issue with julia." hip-runtime-amd-main> chmod +x test.sh And I get the same passing output if I extract the HIP-Examples depot from AMD into /opt/rocm/hip and run their test.sh there. |
There are two failures here, the first one is that it doesn't recognize the new core, but setting
HSA_OVERRIDE_GFX_VERSION=10.3.0
seems to fix that.I do get a separate error when trying to create a ROCmArray.
I imagine this has to do with needing more recent versions of some libraries, because I can run HIP programs locally.
I'm specifically using an rx7900xtx (gfx1100)
The text was updated successfully, but these errors were encountered: