-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support rusticl runtime #276
Comments
I'm having a play trying to get this to work, but as I don't know OpenCL or why sum64 is used this is a crapshoot. My initial read was that the extension was needed for 64 bit ints to work at all but it appears to just be needed for atom_add. And atom_add is just used to return the sum value to the host, seemingly atomic in this context ensures host and device are in sync. So I tried replacing with the 32 bit version which should be present always?:
Which ran, directly into a rusticl panic. Probably this is some deep syntax crime anyway, at least splitting into multiple atomic ops seems bad form:
But why do we need atomic at all right?
And just to be safe delete the return value in sum64 in case syntax messes anything up. That should completely bypass sum64 execution, unless it's being called somehow from other kernels not by a name that I can grep. However still a panic:
That leads me to think that the runtime panics are entirely unrelated to sum64 hackery? They've only been exposed now because the hackery let the kernel compile. Sorry for being verbose, seemed sensible to let you know how I'm stabbing in the dark as there's a good chance I'm misunderstanding how sum64 is used and crimes are being committed. |
Thanks, attempting to compile with rusticl is an useful exercise. And it's a good approach to simplify/remove the initial trouble bits (sum64) like you did just to get it to compile. It seems what we hit now may be a rusticl bug. We don't have the rusticl stack-trace symbols, but there is a line# and we know it tries to unwrap a None. |
Probably there is a bug in the fp64 implementation, it's still experimental after all. I'll look into it ability permitting and add an issue to rusticl's tracker if appropriate. Probably need to learn how to compile and use mesa first, if they reply I should be in a position to respond and test. Actually the more I look at it the more I think it was premature to test fp64. It's available behind a flag but it seems that's for implementers to be able to test it as they implement, it's described as "in-progress" on their tracker: https://mesamatrix.net/ Have been peeking at rusticl's code to see how hard it would be to implement the extension, but I'm struggling to even find atomic_add right now. I'm thinking that Core CL functionality might not be done via rust/rusticl but directly in spirv or something (unclear). All I can find in the rust codebase is exposing the 32 bit atomic support not an implementation. int64 atomics is one of the few extensions that clover supports that rusticl doesn't. Extensions have been implemented on rusticl on a priority basis, maybe very few programs make use of it and it's just low priority. But maybe we'll get lucky and it'll have higher priority just because they want to fully succeed clover. |
Which rusticl version are you using? I'm using rusticl 23.2.1-1ubuntu3.1~22.04.2 and it does not produce the nice error messages you're seeing :) (it fails, but I can't see where) |
Mesa 23.3.5 the latest arch packages. But I'm not even on arch host this is through a temp arch environment with distrobox which you could do too. It's as simple as |
This is with the latest sources:
which corresponds to this in vtn_variables.c:
Possibly I didn't use correct compiler settings to build mesa, possibly this is just where rusticl is at. Here's the gist of how I compiled mesa
To do this yourself you'll need recent toolchains, at least meson 1.3.1 so either compile youself or take the easy route and build in a Fedora 39 environment which has a recent enough meson. If in a F39 environment you can also get most of the build dependencies easily by installing the builddep plugin to dnf |
If I'm interpreting things correctly, rusticl doesn't yet support the opencl 2.0 extension that adds things like work_group_reduce_add() https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/workGroupFunctions.html Corresponding to this feature on the tracker: https://mesamatrix.net/#RusticlOpenCL2.0_Extension__Workgroup_Collective_Functions |
This is in standby from my POV, maybe in 6 months we'll have another go. |
Tried this again with latest mesa. Now meson 1.4.0+ is required to build mesa, which I got by upgrading to Fedora 40 (alternatively build meson from source). Also needed to add PyYAML (
Same blocker as before. Will try again in a few months. |
that's what I'm getting now with latest rusticl from Mesa 25.0 master branch..
note strange mention of "openclon12":
full log:
|
@oscarbg Could you please try with "-use NO_ASM" ? that should disable our use of __asm() in case that's producing the above error. |
@preda
|
@oscarbg that's impressive, it almost runs! :) The problem we hit now manifests in a consistent mismatch between the checksum computed GPU-side and CPU-side. Given that this happens on an iGPU, and with the consistent values (that don't change across read attemts), it seems this is not a data transfer error. More likely, the problem is with the kernel that computes the checksum GPU-side, sum64() in etc.cl: Line 19 in 7a0ba6d
The 'tricky' elements in this kernel are: work_group_reduce_add() and global atomic_add(). Maybe there's an error around there. The small kernel sum64() could be re-written with various experiments to try to isolate the error. At this point, I think it's worth openinig an issues with rusticl, letting them know of this unexpected behavior (so they may become aware of a potential error there). The argument that it's unlikey "on-our-side" is that the code runs correctly across multiple GPUs both AMD and Nvidia. |
Just did a quick test with an intel A310 GPU using the iris driver in mesa. gpuowl failed with similar issues to my last radeonsi attempt:
PRPLL failed because it looks like rusticl checks for and intentionally fails when fp64 is emulated:
I'll try with an intel B580 tomorrow which has native fp64. I expect the same results as @oscarbg as the iris and radeonsi drivers are in similar states, iris has a few more extensions implemented but they look irrelevant to us. |
I also have a zen4 igpu and tried to repoduce @oscarbg 's results. Got the same ASM result so using
edit:
|
Attempted to fix the __builtin_s_sleep(). (s_sleep is present on RDNA2, it's the compiler that does not have the builtin). |
The GPU read check should be disabled in source (i.e. always have it pass) to see what fails then. This line: Line 674 in 002f686
should be changed to if (true || hostSum == gpuSum) { |
The sleep fix worked. Modifying that line in Gpu.cpp did this:
|
Well something's broken (the read check failure was probably not spurious). If it's not a problem with the check, then maybe the read is not working correctly. |
Tuned with existing fft as suggested:
Same issue:
Go up an FFT size:
The tune timings being 100000.0 looks suspicious. |
Tune timing "100000.0" is a huge value that indicates that it failed (so that config, which did not pass, should have basically "infinite" cost as it's broken). |
It is possible to run with "-use DEBUG" which enables asserts (in OpenCL) at runtime -- which is normally rather slow. But maybe we'd see some earlier failure. |
FWIW bigger FFT's take longer to tune as expected.
Funnily testing the currently erroneous B580 gives numbers higher than 100000.0 at the higher FFT's: https://www.mersenneforum.org/node/1062411?p=1064299#post1064299 |
Would it help if I compiled a debug build? |
Commenting out the offending assert in trig.cl with Gpu.cpp untouched goes back to erroring in the same place as before:
Doing the Gpu.cpp line change as well goes back to this:
So -use DEBUG might have helped to find that undeclared (?) variable n, but that doesn't seem to have been the blocker. edit: Of course the assert wasn't the issue, without DEBUG the asserts are all removed. |
I opened a RustiCL issue to let them know of the situation: |
We have feedback from Karol Herbst that workgroup_reduce() family is not implemented by RustiCL. So I just removed all use of workgrup_reduce() in the latest commit. Please retry. I have a strong suspicion not everything is fixed by this though. Let's see how the read check behaves now. |
Clean
Remove assert in trig.cl
This is all with Gpu.cpp untouched. So that's progress? The checksums at least aren't mismatching. edit: |
On the B580 the LHS is zeroed instead of garbage:
It runs with ASM enabled, but should it be disabled because you've made ASM specific to AMD? edit: BTW a gotcha between rusticl/other is that they may enumerate the GPU's differently. Unless it's random the b580 is device 1 on rusticl, device 0 on intel's runtime. edit edit: No change with mesa build from 2024-12-24 |
This is the current status with B580:
Removing that assert and recompiling we don't trip any more asserts and are back at zeroed data. |
RDNA3 780M: Not sure why it's doing different things, but here's something new on the latest mesa:
Unknown intrinsics and now the data is zeroed where previously it was garbage. |
It may be that the "unknown intrinsic" is caused by the __builtin_nontemporal_load/store which is recent. Since the latest commit ( 1582454 ) Although, looking more carefully, I see that not all of the load_global() that produce the "unknown intrinsic" above have access=non-temporal, so it's not that. |
Will install ROCm later to compare. |
@chocolate42 could you please try with "-carry long" on RustiCL and see whether it fixes the
|
780M, no change:
|
As long as we have those "Unknown intrinsic" for global read, it's broken. |
I think that's done it (tested with the PR). The 780M needs -carry long:
Succeeds with -carry long
Hash matches at iteration 2000 on a second run
Now I need to install rocm on this thing and see if there's any difference in timings. And test with B580. Tomorrow job. edit: So when that PR gets merged, which may not be soon as mesa has a fancy way of merging PR's, prpll with AMD cards should work on rusticl? intel cards use a different driver so all bets are off there. |
Trying rocm to compare to the runs in previous post (Ubuntu 22.04 via distrobox, rocm 6.3.1, "make amd")
Resumed from a rusticl save fine. Doesn't need -use NO_ASM or -carry long (why is that?). ~10% faster than rusticl. The GPU keeps hanging, not sure what the deal is there. Whenever the GPU hangs the screen goes blank and comes back 5-10 seconds later. dmesg has this to say
So is that some rocm bug, or I've messed up installing rocm somehow? edit: I'm running the rusticl build some more to try and get it to crash. It's well past the point that the rocm build hung 3 times and so far no problem. edit edit: After looking more closely both builds average around 52W of total laptop draw from the wall in current conditions, but the rusticl build is very consistent at 49-53W, the rocm build stays there for a bit before crashing to ~30W then spiking to ~65W, at which point it often but not always hangs the GPU. Seems like something in the stack isn't configured properly for the rdna3 iGPU. |
[44436.803321] amdgpu: sdma_engine_id 1 exceeds maximum id of 0
...
That looks like an AMDGPU driver bug to me. That is, the AMDGPU driver
that's part of the Linux kernel (or that is installed with admgpu-dkms
on older Linux kernels).
Unfortunately, there's a multitude of such AMDGPU bugs. If possible,
they should be reported "upstream". The workaround is to use a
different version of the Linux kernel, e.g. from
https://kernel.ubuntu.com/mainline/ . I'm using Linux kernel 6.8.x
(which is quite a bit behind the most recent 6.12.x) and AMDGPU seems
fine there.
|
Thanks. That's tricky to test because the way distrobox works the container shares the host kernel which I don't want to mess with. Guess a live image (24.04 to match your 6.8 kernel) is the way (which will take hours to download, why is the Ubuntu image so massive compared to Fedora, 5.8GB vs 2.3GB my god). In the meantime I managed to crash while running the rusticl build but it was a herculean effort. It ran for about 300k iterations while a PS2 emulator was taxing the iGPU via vulkan, crashed after ~30 minutes of sustained power draw from the wall of ~80W. This is a thin and light laptop I'm pretty sure it overheated (or soft-crashed because it was overheating), this might not be a bug bug but the rocm crashes surely are. This logged me out of the session and closed everything but didn't fully crash the PC.
|
From a live image of Ubuntu 24.04, kernel 6.8, rocm 6.3.1, still getting GPU hangs just like in distrobox. Screen goes blank, 10 seconds later it comes back after a successful reset. I'm guessing RDNA3 iGPU support is poor in rocm, AMD always have been terrible at supporting newer (and older, and current) hardware with rocm. Unless you have a rocm/kernel/prpll combo suggestion or someone demonstrates they get it working, I'm going to write off rocm as broken with the 780M. This is a persistent live image so rocm etc is installed on it still so I can test further if need be.
|
I'm trying to run gpuowl on rusticl instead of rocm, it fails with this:
gpuowl.cl says this:
and the sum64 function looks simple enough:
If implementing sum64 with 32 bit atomics and sorting the atom_add ambiguity (which appears to be related) is all it takes to get gpuowl working on rusticl then cool. Rusticl is cross-platform and built into mesa (should be in OOTB for Ubuntu 24.04) and should be the way forwards for opencl on Linux (also mfakto is quicker under rusticl, for my hardware). I have no idea if atomic 64 bit int will ever be supported with rusticl, or if there are other hidden or vendor issues to be uncovered (can only test with RDNA3 iGPU 780M).
The text was updated successfully, but these errors were encountered: