Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support rusticl runtime #276

Open
chocolate42 opened this issue Feb 18, 2024 · 41 comments
Open

Support rusticl runtime #276

chocolate42 opened this issue Feb 18, 2024 · 41 comments

Comments

@chocolate42
Copy link

I'm trying to run gpuowl on rusticl instead of rocm, it fails with this:

[f40@p95 gpuowl]$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 ./build-debug/gpuowl -prp 77936867
20240218 14:10:06  GpuOwl VERSION v7.5-2-gba227ce
20240218 14:10:06  GpuOwl VERSION v7.5-2-gba227ce
20240218 14:10:06  config: -prp 77936867 
20240218 14:10:06  device 0, unique id ''
20240218 14:10:06 77936867 FFT: 4M 1K:8:256 (18.58 bpw)
20240218 14:10:06 77936867 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,}  -cl-std=CL2.0 -cl-finite-math-only "
20240218 14:10:06 77936867 ASM compilation failed, retrying compilation using NO_ASM
20240218 14:10:06 77936867 OpenCL compilation error -11 (args -DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,}  -cl-std=CL2.0 -cl-finite-math-only  -DNO_ASM=1)
20240218 14:10:06 77936867 input.cl:44:26: warning: unsupported OpenCL extension 'cl_khr_int64_base_atomics' - ignoring [-Wignored-pragmas]
input.cl:1494:29: error: call to 'atom_add' is ambiguous
Error executing LLVM compilation action.

20240218 14:10:06  Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at src/clwrap.cpp:245 build
20240218 14:10:06  Bye

gpuowl.cl says this:

// 64-bit atomics used in kernel sum64
// If 64-bit atomics aren't available, sum64() can be implemented with 32-bit
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

and the sum64 function looks simple enough:

KERNEL(256) sum64(global ulong* out, u32 sizeBytes, global ulong* in) {
  if (get_global_id(0) == 0) { out[0] = 0; }
  
  ulong sum = 0;
  for (i32 p = get_global_id(0); p < sizeBytes / sizeof(u64); p += get_global_size(0)) {
    sum += in[p];
  }
  sum = work_group_reduce_add(sum);
  if (get_local_id(0) == 0) { atom_add(&out[0], sum); }
}

If implementing sum64 with 32 bit atomics and sorting the atom_add ambiguity (which appears to be related) is all it takes to get gpuowl working on rusticl then cool. Rusticl is cross-platform and built into mesa (should be in OOTB for Ubuntu 24.04) and should be the way forwards for opencl on Linux (also mfakto is quicker under rusticl, for my hardware). I have no idea if atomic 64 bit int will ever be supported with rusticl, or if there are other hidden or vendor issues to be uncovered (can only test with RDNA3 iGPU 780M).

@chocolate42
Copy link
Author

I'm having a play trying to get this to work, but as I don't know OpenCL or why sum64 is used this is a crapshoot. My initial read was that the extension was needed for 64 bit ints to work at all but it appears to just be needed for atom_add. And atom_add is just used to return the sum value to the host, seemingly atomic in this context ensures host and device are in sync. So I tried replacing with the 32 bit version which should be present always?:

KERNEL(256) sum64(global ulong* out, u32 sizeBytes, global ulong* in) {
  if (get_global_id(0) == 0) { out[0] = 0; }
  
  ulong sum = 0;
  for (i32 p = get_global_id(0); p < sizeBytes / sizeof(u64); p += get_global_size(0)) {
    sum += in[p];
  }
  sum = work_group_reduce_add(sum);
  if (get_local_id(0) == 0) {
    uint hi = (sum>>32)&0xFFFFFFFF;
    uint lo = sum&0xFFFFFFFF;
    global uint* recast = (global uint*) (&out[0]);
    atomic_add(recast, lo);
    atomic_add(recast+1, hi);
  }
}

Which ran, directly into a rusticl panic. Probably this is some deep syntax crime anyway, at least splitting into multiple atomic ops seems bad form:

[f40@p95 gpuowl]$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 RUST_BACKTRACE=full ./build-debug/gpuowl -prp 77936867
20240219 11:33:26  GpuOwl VERSION ba227ce-dirty
20240219 11:33:26  GpuOwl VERSION ba227ce-dirty
20240219 11:33:26  config: -prp 77936867 
20240219 11:33:26  device 0, unique id ''
20240219 11:33:26 77936867 FFT: 4M 1K:8:256 (18.58 bpw)
20240219 11:33:26 77936867 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,}  -cl-std=CL2.0 -cl-finite-math-only "
20240219 11:33:26 77936867 ASM compilation failed, retrying compilation using NO_ASM
thread '<unnamed>' panicked at ../mesa-23.3.5/src/gallium/frontends/rusticl/core/program.rs:260:13:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0:     0x7f54dc3cf69c - <unknown>
   1:     0x7f54dc3f0230 - <unknown>
   2:     0x7f54dc3cd0dd - <unknown>
   3:     0x7f54dc3cf485 - <unknown>
   4:     0x7f54dc3d0a93 - <unknown>
   5:     0x7f54dc3d07ca - <unknown>
   6:     0x7f54dc3d0fc5 - <unknown>
   7:     0x7f54dc3d0e79 - <unknown>
   8:     0x7f54dc3cfb56 - <unknown>
   9:     0x7f54dc3d0c32 - <unknown>
  10:     0x7f54db46beb5 - <unknown>
  11:     0x7f54db46bf53 - <unknown>
  12:     0x7f54dc2d2ab0 - <unknown>
  13:     0x7f54dc35590b - <unknown>
  14:     0x7f54dc2f0e88 - <unknown>
  15:     0x7f54dc354cfe - <unknown>
  16:     0x7f54dc358f97 - <unknown>
  17:     0x7f54dc30d8f3 - <unknown>
  18:     0x7f54dc30d326 - <unknown>
  19:     0x7f54dd4d1bf3 - clBuildProgram
  20:     0x557c74db6b34 - build
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/clwrap.cpp:227:25
  21:     0x557c74db7e38 - _Z7compileP11_cl_contextP13_cl_device_idRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKSt6vectorIS8_SaIS8_EE
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/clwrap.cpp:275:10
  22:     0x557c74dac717 - compile
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:317:22
  23:     0x557c74dacb8d - _ZN3GpuC2ERK4ArgsjjjjjjP13_cl_device_idbbO7Weights
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:350:3
  24:     0x557c74daedf0 - _ZN3GpuC2ERK4ArgsjjjjjjP13_cl_device_idbb
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:330:107
  25:     0x557c74db5ada - _ZSt11make_uniqueI3GpuJRK4ArgsRjS4_jS4_S4_S4_P13_cl_device_idRbS7_EENSt8__detail9_MakeUniqIT_E15__single_objectEDpOT0_
                               at /usr/include/c++/13.2.1/bits/unique_ptr.h:1070:30
  26:     0x557c74daf113 - _ZN3Gpu4makeEjRK4Args
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:595:76
  27:     0x557c74dbaaec - _ZN4Task7executeERK4Args
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Task.cpp:174:38
  28:     0x557c74da3ce0 - main
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/main.cpp:56:51
  29:     0x7f54dcf76cd0 - <unknown>
  30:     0x7f54dcf76d8a - __libc_start_main
  31:     0x557c74d985a5 - _start
  32:                0x0 - <unknown>
20240219 11:33:26  Unexpected exception
fatal runtime error: Rust panics must be rethrown
Aborted (core dumped)

But why do we need atomic at all right? if (get_local_id(0) == 0) { out[0] = sum; }. Well it still panics. So how about just ignoring the result of sum64. It's only used in vector<u32> Gpu::readAndCompress(ConstBuffer<int>& buf) from Gpu.cpp, and it's immediately checked against the CPU doing the sum. So sod it by removing the sum64 call and faking the result in readAndCompress() with:

    expectedSum = sum;
    if (sum != expectedSum || (allZero && nRetry == 0)) {

And just to be safe delete the return value in sum64 in case syntax messes anything up. That should completely bypass sum64 execution, unless it's being called somehow from other kernels not by a name that I can grep. However still a panic:

[f40@p95 gpuowl]$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 RUST_BACKTRACE=full ./build-debug/gpuowl -prp 77936867 -use NO_ASM
20240219 11:44:35  GpuOwl VERSION ba227ce-dirty
20240219 11:44:35  GpuOwl VERSION ba227ce-dirty
20240219 11:44:35  config: -prp 77936867 -use NO_ASM 
20240219 11:44:35  device 0, unique id ''
20240219 11:44:35 77936867 FFT: 4M 1K:8:256 (18.58 bpw)
20240219 11:44:35 77936867 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,} -DNO_ASM=1  -cl-std=CL2.0 -cl-finite-math-only "
thread '<unnamed>' panicked at ../mesa-23.3.5/src/gallium/frontends/rusticl/core/program.rs:260:13:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0:     0x7fc80997a69c - <unknown>
   1:     0x7fc80999b230 - <unknown>
   2:     0x7fc8099780dd - <unknown>
   3:     0x7fc80997a485 - <unknown>
   4:     0x7fc80997ba93 - <unknown>
   5:     0x7fc80997b7ca - <unknown>
   6:     0x7fc80997bfc5 - <unknown>
   7:     0x7fc80997be79 - <unknown>
   8:     0x7fc80997ab56 - <unknown>
   9:     0x7fc80997bc32 - <unknown>
  10:     0x7fc808a16eb5 - <unknown>
  11:     0x7fc808a16f53 - <unknown>
  12:     0x7fc80987dab0 - <unknown>
  13:     0x7fc80990090b - <unknown>
  14:     0x7fc80989be88 - <unknown>
  15:     0x7fc8098ffcfe - <unknown>
  16:     0x7fc809903f97 - <unknown>
  17:     0x7fc8098b88f3 - <unknown>
  18:     0x7fc8098b8326 - <unknown>
  19:     0x7fc80aa7cbf3 - clBuildProgram
  20:     0x55972f405a5f - build
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/clwrap.cpp:222:27
  21:     0x55972f406e2a - _Z7compileP11_cl_contextP13_cl_device_idRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKSt6vectorIS8_SaIS8_EE
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/clwrap.cpp:275:10
  22:     0x55972f3fb709 - compile
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:317:22
  23:     0x55972f3fbb7f - _ZN3GpuC2ERK4ArgsjjjjjjP13_cl_device_idbbO7Weights
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:350:3
  24:     0x55972f3fdde2 - _ZN3GpuC2ERK4ArgsjjjjjjP13_cl_device_idbb
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:330:107
  25:     0x55972f404acc - _ZSt11make_uniqueI3GpuJRK4ArgsRjS4_jS4_S4_S4_P13_cl_device_idRbS7_EENSt8__detail9_MakeUniqIT_E15__single_objectEDpOT0_
                               at /usr/include/c++/13.2.1/bits/unique_ptr.h:1070:30
  26:     0x55972f3fe105 - _ZN3Gpu4makeEjRK4Args
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:595:76
  27:     0x55972f409ade - _ZN4Task7executeERK4Args
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Task.cpp:174:38
  28:     0x55972f3f2ce0 - main
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/main.cpp:56:51
  29:     0x7fc80a521cd0 - <unknown>
  30:     0x7fc80a521d8a - __libc_start_main
  31:     0x55972f3e75a5 - _start
  32:                0x0 - <unknown>
20240219 11:44:35  Unexpected exception
fatal runtime error: Rust panics must be rethrown
Aborted (core dumped)

That leads me to think that the runtime panics are entirely unrelated to sum64 hackery? They've only been exposed now because the hackery let the kernel compile.

Sorry for being verbose, seemed sensible to let you know how I'm stabbing in the dark as there's a good chance I'm misunderstanding how sum64 is used and crimes are being committed.

@preda
Copy link
Owner

preda commented Feb 19, 2024

Thanks, attempting to compile with rusticl is an useful exercise. And it's a good approach to simplify/remove the initial trouble bits (sum64) like you did just to get it to compile.

It seems what we hit now may be a rusticl bug. We don't have the rusticl stack-trace symbols, but there is a line# and we know it tries to unwrap a None.
./mesa-23.3.5/src/gallium/frontends/rusticl/core/program.rs:260:13:
called Option::unwrap() on a None value

@chocolate42
Copy link
Author

Probably there is a bug in the fp64 implementation, it's still experimental after all. I'll look into it ability permitting and add an issue to rusticl's tracker if appropriate. Probably need to learn how to compile and use mesa first, if they reply I should be in a position to respond and test.

Actually the more I look at it the more I think it was premature to test fp64. It's available behind a flag but it seems that's for implementers to be able to test it as they implement, it's described as "in-progress" on their tracker: https://mesamatrix.net/

Have been peeking at rusticl's code to see how hard it would be to implement the extension, but I'm struggling to even find atomic_add right now. I'm thinking that Core CL functionality might not be done via rust/rusticl but directly in spirv or something (unclear). All I can find in the rust codebase is exposing the 32 bit atomic support not an implementation.

int64 atomics is one of the few extensions that clover supports that rusticl doesn't. Extensions have been implemented on rusticl on a priority basis, maybe very few programs make use of it and it's just low priority. But maybe we'll get lucky and it'll have higher priority just because they want to fully succeed clover.

@preda
Copy link
Owner

preda commented Feb 20, 2024

Which rusticl version are you using?
if under Ubuntu, could be obtained as the version of the package mesa-opencl-icl.
It might also be obtained with "gpuowl -h" or clinfo as the driver version.

I'm using rusticl 23.2.1-1ubuntu3.1~22.04.2 and it does not produce the nice error messages you're seeing :) (it fails, but I can't see where)

@chocolate42
Copy link
Author

Mesa 23.3.5 the latest arch packages. But I'm not even on arch host this is through a temp arch environment with distrobox which you could do too. It's as simple as distrobox create --name whatever --image archlinux, distrobox enter whatever, then you're in a shell of a base arch environment that you update install rusticl etc as normal. Much less painful than installing all these different often conflicting toolchains direct to host or in a vm.

@chocolate42
Copy link
Author

This is with the latest sources:

📦[f40@f39 f39]$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ../mesabuild ../f39/gpuowl/build-debug/gpuowl -prp 77936867
20240222 12:49:34  GpuOwl VERSION v7.5-6-gd522826
20240222 12:49:34  GpuOwl VERSION v7.5-6-gd522826
20240222 12:49:34  config: -prp 77936867 
20240222 12:49:34  device 0, unique id '', driver '24.1.0-devel'
20240222 12:49:34 77936867 FFT: 4M 1K:8:256 (18.58 bpw)
20240222 12:49:34 77936867 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,}  -cl-std=CL2.0 -cl-finite-math-only "
20240222 12:49:34 77936867 ASM compilation failed, retrying compilation using NO_ASM
SPIR-V parsing FAILED:
    In file ../mesa-main-2024-02-22/src/compiler/spirv/vtn_variables.c:2341
    Initializer for CrossWorkgroup variable 3 not yet supported in Mesa.
    164196 bytes into the SPIR-V binary

which corresponds to this in vtn_variables.c:

      case SpvStorageClassCrossWorkgroup:
         vtn_assert(b->options->environment == NIR_SPIRV_OPENCL);
         vtn_fail("Initializer for CrossWorkgroup variable %u "
                  "not yet supported in Mesa.",
                  vtn_id_for_value(b, val));
         break;

Possibly I didn't use correct compiler settings to build mesa, possibly this is just where rusticl is at.

Here's the gist of how I compiled mesa

# mesa source in mesa-main-2024-02-22
mkdir mesabuild
cd mesabuild
# specific to amd
meson setup ../mesa-main-2024-02-22 --libdir lib64 -Dgallium-rusticl=true -Dllvm=enabled -Drust_std=2021 -Dgallium-drivers=radeonsi,swrast
meson compile -C ./
# mesa built to mesabuild

# to use mesabuild
RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C wherever/mesabuild relative/to/mesabuild/dir/gpuowl/build-debug/gpuowl -prp 77936867

To do this yourself you'll need recent toolchains, at least meson 1.3.1 so either compile youself or take the easy route and build in a Fedora 39 environment which has a recent enough meson. If in a F39 environment you can also get most of the build dependencies easily by installing the builddep plugin to dnf sudo dnf install "dnf-command(builddep)" followed by sudo dnf builddep mesa which installs all dependencies that were needed to build the mesa in the fedora repo.

@chocolate42
Copy link
Author

If I'm interpreting things correctly, rusticl doesn't yet support the opencl 2.0 extension that adds things like work_group_reduce_add()

https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/workGroupFunctions.html

Corresponding to this feature on the tracker:

https://mesamatrix.net/#RusticlOpenCL2.0_Extension__Workgroup_Collective_Functions

@preda
Copy link
Owner

preda commented Apr 26, 2024

This is in standby from my POV, maybe in 6 months we'll have another go.

@chocolate42
Copy link
Author

chocolate42 commented Jul 21, 2024

Tried this again with latest mesa. Now meson 1.4.0+ is required to build mesa, which I got by upgrading to Fedora 40 (alternatively build meson from source). Also needed to add PyYAML (pip install PyYAML) and cbindgen (dnf install cbindgen) to the environment I used before to get mesa to successfully compile.

20240721 11:50:22  GpuOwl VERSION v7.5-8-gb400b88
20240721 11:50:22  GpuOwl VERSION v7.5-8-gb400b88
20240721 11:50:22  config: -prp 77936867 
20240721 11:50:23  device 0, unique id '', driver '24.3.0-devel'
20240721 11:50:23 77936867 FFT: 4M 1K:8:256 (18.58 bpw)
20240721 11:50:23 77936867 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,}  -cl-std=CL2.0 -cl-finite-math-only "
20240721 11:50:23 77936867 ASM compilation failed, retrying compilation using NO_ASM
SPIR-V parsing FAILED:
    In file ../mesa-main-2024-07-21/mesa-main/src/compiler/spirv/vtn_variables.c:2354
    Initializer for CrossWorkgroup variable 3 not yet supported in Mesa.
    164196 bytes into the SPIR-V binary

Same blocker as before. Will try again in a few months.

@oscarbg
Copy link

oscarbg commented Dec 5, 2024

that's what I'm getting now with latest rusticl from Mesa 25.0 master branch..

env |grep RUSTICL
RUSTICL_FEATURES=fp16,fp64
RUSTICL_ENABLE=radeonsi

note strange mention of "openclon12":

In file included from /tmp/openclon12/fftp.cl:6:
/tmp/openclon12/weight.cl:40:38: error: invalid output constraint '=v' in asm
Error executing LLVM compilation action.

full log:

 ./prpll -iters 1000 -prp 57885161 
20241205 20:57:42  PRPLL 0.15-64-gc542395 starting
20241205 20:57:42  config: -iters 1000 -prp 57885161 
20241205 20:57:42  device 0, OpenCL 25.0~git2412050600.e26a38~oibaf~o (git-e26a383 2024-12-05 oracular-oibaf-ppa), unique id ''
20241205 20:57:42 57885161 No FFTs found in tune.txt that can handle 57885161. Consider tuning with -tune
20241205 20:57:42 57885161 BPW info for 512:16:512 not found, defaults={18.02, 18.12, 18.12, 18.22}
20241205 20:57:42 57885161 BPW info for 1K:16:512 not found, defaults={17.74, 17.84, 17.84, 17.94}
20241205 20:57:42 57885161 BPW info for 4K:16:512 not found, defaults={17.19, 17.29, 17.29, 17.39}
20241205 20:57:42 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241205 20:57:42 57885161 config: 
20241205 20:57:42 57885161 FFT: 3M 1K:6:256:0 (18.40 bpw)
20241205 20:57:42 57885161 In file included from input.cl:1:
In file included from /tmp/openclon12/fftp.cl:6:
/tmp/openclon12/weight.cl:40:38: error: invalid output constraint '=v' in asm
Error executing LLVM compilation action.

20241205 20:57:42 57885161 Compiling 'fftp.cl' error COMPILE_PROGRAM_FAILURE (-15) (args -cl-finite-math-only -cl-std=CL2.0  -DEXP=57885161u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=6u -DCARRY_LEN=8u -DNW=4u -DNH=4u -DAMDGPU=1 -DWEIGHT_STEP=0.51445938099070077 -DIWEIGHT_STEP=-0.33969836857173508 -DFFT_VARIANT=0u -DTAILT=U2(-7.5298160855459062e-05,0.012271538285719925) -DTRIG_SCALE=3 -DTRIG_SIN={1.331580545039619e-06,7.4418738109161865e-23,-3.9350574163656679e-19,3.4886384292170233e-32,-1.472792461108135e-45,3.6269691064102593e-59,-5.8460626541768332e-73,6.5788131861971713e-87,} -DTRIG_COS={1,-8.8655337396400457e-13,1.3099614748116047e-25,-7.7423384350554752e-39,2.451427236886687e-52,-4.8296023716962189e-66,6.4872905043230845e-80,-6.2756334651618034e-94,} -DFRAC_BPW_HI=1723128490u -DFRAC_BPW_LO=2863311529u -DFRAC_BITS_BIGSTEP=1073741823u )
20241205 20:57:42 57885161 Can't compile fftp.cl
20241205 20:57:42  Exception "Can't compile fftp.cl"
20241205 20:57:42  Bye

@preda
Copy link
Owner

preda commented Dec 6, 2024

@oscarbg Could you please try with "-use NO_ASM" ?
./prpll -iters 1000 -prp 57885161 -use NO_ASM

that should disable our use of __asm() in case that's producing the above error.

@oscarbg
Copy link

oscarbg commented Dec 7, 2024

@preda
seems better.. but still get fails.. note I'm on a Zen4 iGPU (RDNA2)..

 ./prpll -iters 1000 -prp 57885161 -use NO_ASM
20241207 17:17:48  PRPLL 0.15-64-gc542395 starting
20241207 17:17:48  config: -iters 1000 -prp 57885161 -use NO_ASM 
20241207 17:17:50  device 0, OpenCL 25.0~git2412070600.2aae00~oibaf~o (git-2aae000 2024-12-07 oracular-oibaf-ppa), unique id ''
20241207 17:17:50 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241207 17:17:50 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241207 17:17:50 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241207 17:17:50 57885161 config:  -DNO_ASM=1
20241207 17:17:50 57885161 FFT: 3M 512:12:256:0:0 (18.40 bpw)
20241207 17:18:11 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241207 17:18:11 57885161 Proof of power 9 requires about 3.6GB of disk space
20241207 17:18:26 57885161 Stopping, please wait..
20241207 17:18:48 57885161 EE      1000 81a1b2c933d40475 18404 ETA 12d 07:55; Z=27 (avg 26.6) 1 errors
20241207 17:18:48  Exception "stop requested"
20241207 17:18:48  Bye

 ./prpll -iters 100000 -prp 57885161 -use NO_ASM
20241207 17:19:03  PRPLL 0.15-64-gc542395 starting
20241207 17:19:03  config: -iters 100000 -prp 57885161 -use NO_ASM 
20241207 17:19:04  device 0, OpenCL 25.0~git2412070600.2aae00~oibaf~o (git-2aae000 2024-12-07 oracular-oibaf-ppa), unique id ''
20241207 17:19:04 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241207 17:19:04 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241207 17:19:04 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241207 17:19:04 57885161 config:  -DNO_ASM=1
20241207 17:19:04 57885161 FFT: 3M 512:12:256:0:0 (18.40 bpw)
20241207 17:19:25 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241207 17:19:25 57885161 Proof of power 9 requires about 3.6GB of disk space
20241207 17:20:02 57885161 GPU read failed: 080855e301fe0722 (gpu) != 1b007c21189f52d8 (host)
20241207 17:20:02 57885161 GPU read failed: 080855e301fe0722 (gpu) != 1b007c21189f52d8 (host)
20241207 17:20:02 57885161 GPU read failed: 080855e301fe0722 (gpu) != 1b007c21189f52d8 (host)
20241207 17:20:02  Exception "GPU persistent read errors"
20241207 17:20:02  Bye

./prpll -iters 100000 -prp 57885161 -use NO_ASM
20241207 17:20:51  PRPLL 0.15-64-gc542395 starting
20241207 17:20:51  config: -iters 100000 -prp 57885161 -use NO_ASM 
20241207 17:20:52  device 0, OpenCL 25.0~git2412070600.2aae00~oibaf~o (git-2aae000 2024-12-07 oracular-oibaf-ppa), unique id ''
20241207 17:20:52 57885161 No FFTs found in tune.txt that can handle 57885161. Consider tuning with -tune
20241207 17:20:52 57885161 BPW info for 512:16:512 not found, defaults={18.02, 18.12, 18.12, 18.22}
20241207 17:20:52 57885161 BPW info for 1K:16:512 not found, defaults={17.74, 17.84, 17.84, 17.94}
20241207 17:20:52 57885161 BPW info for 4K:16:512 not found, defaults={17.19, 17.29, 17.29, 17.39}
20241207 17:20:52 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241207 17:20:52 57885161 config:  -DNO_ASM=1
20241207 17:20:52 57885161 FFT: 3M 1K:6:256:0 (18.40 bpw)
20241207 17:21:15 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241207 17:21:15 57885161 Proof of power 9 requires about 3.6GB of disk space
20241207 17:21:54 57885161 GPU read failed: fe829510fd920153 (gpu) != 10cbf1c7f8360e50 (host)
20241207 17:21:54 57885161 GPU read failed: fe829510fd920153 (gpu) != 10cbf1c7f8360e50 (host)
20241207 17:21:54 57885161 GPU read failed: fe829510fd920153 (gpu) != 10cbf1c7f8360e50 (host)
20241207 17:21:54  Exception "GPU persistent read errors"
20241207 17:21:54  Bye

@preda
Copy link
Owner

preda commented Dec 9, 2024

@oscarbg that's impressive, it almost runs! :)

The problem we hit now manifests in a consistent mismatch between the checksum computed GPU-side and CPU-side. Given that this happens on an iGPU, and with the consistent values (that don't change across read attemts), it seems this is not a data transfer error.

More likely, the problem is with the kernel that computes the checksum GPU-side, sum64() in etc.cl:

KERNEL(256) sum64(global ulong* out, u32 sizeBytes, global ulong* in) {

The 'tricky' elements in this kernel are: work_group_reduce_add() and global atomic_add(). Maybe there's an error around there. The small kernel sum64() could be re-written with various experiments to try to isolate the error.

At this point, I think it's worth openinig an issues with rusticl, letting them know of this unexpected behavior (so they may become aware of a potential error there). The argument that it's unlikey "on-our-side" is that the code runs correctly across multiple GPUs both AMD and Nvidia.

@chocolate42
Copy link
Author

Just did a quick test with an intel A310 GPU using the iris driver in mesa. gpuowl failed with similar issues to my last radeonsi attempt:

celery@fedora:~/Documents/gpu_a310/gpuowl_run$ OverrideDefaultFP64Settings=1 IGC_EnableDPEmulation=1 RUSTICL_ENABLE=iris RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/git/mesa-main-2024-12-23/mesabuild ../../../gpu_a310/gpuowl/build-release/gpuowl -iters 10000 -prp 5117313
20241223 15:34:14  GpuOwl VERSION v7.5-10-g61a08e2
20241223 15:34:14  GpuOwl VERSION v7.5-10-g61a08e2
20241223 15:34:14  config: -iters 10000 -prp 5117313 
WARNING: OpenCL support via iris driver is incomplete.
For a complete and conformant OpenCL implementation, use
https://github.com/intel/compute-runtime instead
20241223 15:34:14  device 0, unique id '', driver '25.0.0-devel'
20241223 15:34:14 5117313 FFT: 256K 256:2:256 (19.52 bpw)
20241223 15:34:14 5117313 OpenCL args "-DEXP=5117313u -DWIDTH=256u -DSMALL_HEIGHT=256u -DMIDDLE=2u -DWEIGHT_STEP=0.39377732713590624 -DIWEIGHT_STEP=-0.28252527822725104 -DIWEIGHTS={0,-0.48523002361711648,-0.47002374282953113,-0.45436826922573748,-0.43825033367116206,-0.42165627506162284,-0.40457202874456544,-0.3869831145982634,} -DFWEIGHTS={0,0.94261523763811095,0.88687698075188715,0.83273798717860426,0.78015237024567874,0.72907555987842787,0.67946426482374089,0.63127643595764338,}  -cl-std=CL2.0 -cl-finite-math-only "
SPIR-V parsing FAILED:
    In file ../mesa-main/src/compiler/spirv/vtn_variables.c:2363
    Initializer for CrossWorkgroup variable 3 not yet supported in Mesa.
    163372 bytes into the SPIR-V binary

PRPLL failed because it looks like rusticl checks for and intentionally fails when fp64 is emulated:

celery@fedora:~/Documents/gpu_a310/prpll_run$ OverrideDefaultFP64Settings=1 IGC_EnableDPEmulation=1 RUSTICL_ENABLE=iris RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/git/mesa-main-2024-12-23/mesabuild ../../../gpu_a310/prpll/build-release/prpll -iters 10000 -prp 5117313
20241223 15:36:36  PRPLL 0.15-90-gb947294 starting
20241223 15:36:36  config: -iters 10000 -prp 5117313 
WARNING: OpenCL support via iris driver is incomplete.
For a complete and conformant OpenCL implementation, use
https://github.com/intel/compute-runtime instead
20241223 15:36:36  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 15:36:36 5117313 No FFTs found in tune.txt that can handle 5117313. Consider tuning with -tune
20241223 15:36:36 5117313 BPW info for 512:16:512 not found, defaults={18.02, 18.12, 18.12, 18.22}
20241223 15:36:36 5117313 BPW info for 1K:16:512 not found, defaults={17.74, 17.84, 17.84, 17.94}
20241223 15:36:36 5117313 BPW info for 4K:16:512 not found, defaults={17.19, 17.29, 17.29, 17.39}
20241223 15:36:36 5117313 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241223 15:36:36 5117313 config: 
20241223 15:36:36 5117313 FFT: 256K 256:2:256:0 (19.52 bpw)
prpll: ../mesa-main/src/compiler/nir/nir_lower_double_ops.c:681: lower_doubles_instr_to_soft: Assertion `softfp64 != NULL' failed.

I'll try with an intel B580 tomorrow which has native fp64. I expect the same results as @oscarbg as the iris and radeonsi drivers are in similar states, iris has a few more extensions implemented but they look irrelevant to us.

@chocolate42
Copy link
Author

chocolate42 commented Dec 23, 2024

I also have a zen4 igpu and tried to repoduce @oscarbg 's results. Got the same ASM result so using -use NO_ASM, however then it failed because the latest commit (b947294) uses __builtin_amdgcn_s_sleep, which presumably doesn't exist on rdna2:

f40@fedora:~/Documents/gpu/prpll_run$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll/build-release/prpll -prp 57885161 -use NO_ASM
20241223 16:58:46  PRPLL 0.15-90-gb947294 starting
20241223 16:58:46  config: -prp 57885161 -use NO_ASM 
20241223 16:58:46  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 16:58:46 57885161 No FFTs found in tune.txt that can handle 57885161. Consider tuning with -tune
20241223 16:58:46 57885161 BPW info for 512:16:512 not found, defaults={18.02, 18.12, 18.12, 18.22}
20241223 16:58:46 57885161 BPW info for 1K:16:512 not found, defaults={17.74, 17.84, 17.84, 17.94}
20241223 16:58:46 57885161 BPW info for 4K:16:512 not found, defaults={17.19, 17.29, 17.29, 17.39}
20241223 16:58:46 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241223 16:58:46 57885161 config:  -DNO_ASM=1
20241223 16:58:46 57885161 FFT: 3M 1K:6:256:0 (18.40 bpw)
20241223 16:58:47 57885161 In file included from input.cl:1:
/tmp/openclon12/carryfused.cl:11:3: error: use of undeclared identifier '__builtin_amdgcn_s_sleep'
Error executing LLVM compilation action.

20241223 16:58:47 57885161 Compiling 'carryfused.cl' error COMPILE_PROGRAM_FAILURE (-15) (args -cl-finite-math-only -cl-std=CL2.0  -DNO_ASM=1 -DEXP=57885161u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=6u -DCARRY_LEN=8u -DNW=4u -DNH=4u -DAMDGPU=1 -DWEIGHT_STEP=0.51445938099070077 -DIWEIGHT_STEP=-0.33969836857173508 -DFFT_VARIANT=0u -DTAILT=U2(-7.5298160855459062e-05,0.012271538285719925) -DTRIG_SCALE=3 -DTRIG_SIN={1.331580545039619e-06,7.4418738109161865e-23,-3.9350574163656679e-19,3.4886384292170233e-32,-1.472792461108135e-45,3.6269691064102593e-59,-5.8460626541768332e-73,6.5788131861971713e-87,} -DTRIG_COS={1,-8.8655337396400457e-13,1.3099614748116047e-25,-7.7423384350554752e-39,2.451427236886687e-52,-4.8296023716962189e-66,6.4872905043230845e-80,-6.2756334651618034e-94,} -DFRAC_BPW_HI=1723128490u -DFRAC_BPW_LO=2863311529u -DFRAC_BITS_BIGSTEP=1073741823u )
20241223 16:58:47 57885161 Can't compile carryfused.cl
20241223 16:58:47  Exception "Can't compile carryfused.cl"
20241223 16:58:47  Bye

edit:
Using the previous commit I can confirm the checksum errors, although they're different every run:

f40@fedora:~/Documents/gpu/prpll2$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll2/build-release/prpll -prp 57885161 -use NO_ASM
20241223 17:22:01  PRPLL 0.15-89-g2c38c27 starting
20241223 17:22:01  config: -prp 57885161 -use NO_ASM 
20241223 17:22:02  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 17:22:02 57885161 No FFTs found in tune.txt that can handle 57885161. Consider tuning with -tune
20241223 17:22:02 57885161 BPW info for 512:16:512 not found, defaults={18.02, 18.12, 18.12, 18.22}
20241223 17:22:02 57885161 BPW info for 1K:16:512 not found, defaults={17.74, 17.84, 17.84, 17.94}
20241223 17:22:02 57885161 BPW info for 4K:16:512 not found, defaults={17.19, 17.29, 17.29, 17.39}
20241223 17:22:02 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241223 17:22:02 57885161 config:  -DNO_ASM=1
20241223 17:22:02 57885161 FFT: 3M 1K:6:256:0 (18.40 bpw)
20241223 17:22:09 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 17:22:09 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 17:22:20 57885161 GPU read failed: 00b5708d015ae9e3 (gpu) != f2a5059c02275608 (host)
20241223 17:22:20 57885161 GPU read failed: 00b5708d015ae9e3 (gpu) != f2a5059c02275608 (host)
20241223 17:22:20 57885161 GPU read failed: 00b5708d015ae9e3 (gpu) != f2a5059c02275608 (host)
20241223 17:22:20  Exception "GPU persistent read errors"
20241223 17:22:20  Bye
f40@fedora:~/Documents/gpu/prpll2$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll2/build-release/prpll -prp 57885161 -use NO_ASM
20241223 17:22:51  PRPLL 0.15-89-g2c38c27 starting
20241223 17:22:51  config: -prp 57885161 -use NO_ASM 
20241223 17:22:51  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 17:22:51 57885161 No FFTs found in tune.txt that can handle 57885161. Consider tuning with -tune
20241223 17:22:51 57885161 BPW info for 512:16:512 not found, defaults={18.02, 18.12, 18.12, 18.22}
20241223 17:22:51 57885161 BPW info for 1K:16:512 not found, defaults={17.74, 17.84, 17.84, 17.94}
20241223 17:22:51 57885161 BPW info for 4K:16:512 not found, defaults={17.19, 17.29, 17.29, 17.39}
20241223 17:22:51 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241223 17:22:51 57885161 config:  -DNO_ASM=1
20241223 17:22:51 57885161 FFT: 3M 1K:6:256:0 (18.40 bpw)
20241223 17:22:58 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 17:22:58 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 17:23:09 57885161 GPU read failed: 034d25e2035184c6 (gpu) != 0c03023cf9be1d0c (host)
20241223 17:23:09 57885161 GPU read failed: 034d25e2035184c6 (gpu) != 0c03023cf9be1d0c (host)
20241223 17:23:09 57885161 GPU read failed: 034d25e2035184c6 (gpu) != 0c03023cf9be1d0c (host)
20241223 17:23:09  Exception "GPU persistent read errors"
20241223 17:23:09  Bye

@preda
Copy link
Owner

preda commented Dec 23, 2024

Attempted to fix the __builtin_s_sleep(). (s_sleep is present on RDNA2, it's the compiler that does not have the builtin).

@preda
Copy link
Owner

preda commented Dec 23, 2024

The GPU read check should be disabled in source (i.e. always have it pass) to see what fails then.

This line:

if (hostSum == gpuSum) {

should be changed to
if (true || hostSum == gpuSum) {

@chocolate42
Copy link
Author

The sleep fix worked. Modifying that line in Gpu.cpp did this:

f40@fedora:~/Documents/gpu/prpll4$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll4/build-release/prpll -prp 57885161 -use NO_ASM
20241223 17:52:32  PRPLL 0.15-92-g002f686-dirty starting
20241223 17:52:32  config: -prp 57885161 -use NO_ASM 
20241223 17:52:32  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 17:52:32 57885161 No FFTs found in tune.txt that can handle 57885161. Consider tuning with -tune
20241223 17:52:32 57885161 BPW info for 512:16:512 not found, defaults={18.02, 18.12, 18.12, 18.22}
20241223 17:52:32 57885161 BPW info for 1K:16:512 not found, defaults={17.74, 17.84, 17.84, 17.94}
20241223 17:52:32 57885161 BPW info for 4K:16:512 not found, defaults={17.19, 17.29, 17.29, 17.39}
20241223 17:52:32 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241223 17:52:32 57885161 config:  -DNO_ASM=1
20241223 17:52:32 57885161 FFT: 3M 1K:6:256:0 (18.40 bpw)
20241223 17:52:39 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 17:52:39 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 17:52:56 57885161 EE      2000 0231a69f87ccd053 5499 ETA 3d 16:25; Z=37 (avg 37.3) 1 errors
20241223 17:53:02 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 17:53:02 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 17:53:18 57885161 EE      2000 75b4c4dba744745a 5396 ETA 3d 14:46; Z=7 (avg 22.3) 2 errors
20241223 17:53:18 57885161 Danger ROE! Z=6.9 is too small, increase precision or FFT size!
20241223 17:53:25 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 17:53:25 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 17:53:41 57885161 EE      2000 e15882fa014c45d1 5396 ETA 3d 14:46; Z=7 (avg 17.2) 3 errors
20241223 17:53:41 57885161 Danger ROE! Z=6.9 is too small, increase precision or FFT size!
20241223 17:53:41 57885161 3 sequential errors, will stop.
20241223 17:53:41  Exception "too many errors"
20241223 17:53:41  Bye

@preda
Copy link
Owner

preda commented Dec 23, 2024

Well something's broken (the read check failure was probably not spurious). If it's not a problem with the check, then maybe the read is not working correctly.

@chocolate42
Copy link
Author

Tuned with existing fft as suggested:

f40@fedora:~/Documents/gpu/prpll4$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll4/build-release/prpll -use NO_ASM -tune -fft 1K:6:256
20241223 17:56:25  PRPLL 0.15-92-g002f686-dirty starting
20241223 17:56:25  config: -use NO_ASM -tune -fft 1K:6:256 
20241223 17:56:25  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 17:56:34  Error aa46d511be8939fc
20241223 17:56:34  * 100000.0 1K:6:256:0:0  58243153
20241223 17:56:43  Error 3db9b130cd7787ff
20241223 17:56:43  * 100000.0 1K:6:256:1:0  58409877
20241223 17:56:43  Bye

Same issue:

f40@fedora:~/Documents/gpu/prpll4$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll4/build-release/prpll -prp 57885161 -use NO_ASM
20241223 17:57:24  PRPLL 0.15-92-g002f686-dirty starting
20241223 17:57:24  config: -prp 57885161 -use NO_ASM 
20241223 17:57:24  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 17:57:24 57885161 config:  -DNO_ASM=1
20241223 17:57:24 57885161 FFT: 3M 1K:6:256:0:0 (18.40 bpw)
20241223 17:57:31 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 17:57:31 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 17:57:47 57885161 EE      2000 f91e5665023487a3 5474 ETA 3d 16:01; Z=38 (avg 37.7) 1 errors
20241223 17:57:53 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 17:57:53 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 17:58:10 57885161 EE      2000 9e0adffd483d73e5 5398 ETA 3d 14:47; Z=7 (avg 22.5) 2 errors
20241223 17:58:10 57885161 Danger ROE! Z=6.9 is too small, increase precision or FFT size!
20241223 17:58:16 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 17:58:16 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 17:58:32 57885161 EE      2000 30fa528ffd0fa058 5420 ETA 3d 15:09; Z=7 (avg 17.4) 3 errors
20241223 17:58:32 57885161 Danger ROE! Z=6.9 is too small, increase precision or FFT size!
20241223 17:58:32 57885161 3 sequential errors, will stop.
20241223 17:58:32  Exception "too many errors"
20241223 17:58:32  Bye

Go up an FFT size:

f40@fedora:~/Documents/gpu/prpll4$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll4/build-release/prpll -prp 57885161 -use NO_ASM -fft 256:13:512
20241223 17:59:30  PRPLL 0.15-92-g002f686-dirty starting
20241223 17:59:30  config: -prp 57885161 -use NO_ASM -fft 256:13:512 
20241223 17:59:30  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 17:59:30 57885161 config:  -DNO_ASM=1
20241223 17:59:30 57885161 FFT: 3.25M 256:13:512:3 (16.99 bpw)
20241223 17:59:39 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 17:59:39 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 17:59:58 57885161 EE      2000 7c3aba03b874b0ab 6542 ETA 4d 09:11; Z=415 (avg 414.9) 1 errors
20241223 18:00:06 57885161 EE         0 on-load: ba93dd3b845c2527 vs. 0000000000000003
20241223 18:00:06  Exception "Error on load"
20241223 18:00:06  Bye
f40@fedora:~/Documents/gpu/prpll4$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll4/build-release/prpll -prp 57885161 -use NO_ASM -fft 256:13:512
20241223 18:00:24  PRPLL 0.15-92-g002f686-dirty starting
20241223 18:00:24  config: -prp 57885161 -use NO_ASM -fft 256:13:512 
20241223 18:00:24  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 18:00:24 57885161 config:  -DNO_ASM=1
20241223 18:00:24 57885161 FFT: 3.25M 256:13:512:3 (16.99 bpw)
20241223 18:00:32 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 18:00:32 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 18:00:51 57885161 EE      2000 78bd224a319adf72 6380 ETA 4d 06:35; Z=418 (avg 417.6) 1 errors
20241223 18:00:59 57885161 EE         0 on-load: b04da776be912cd4 vs. 0000000000000003
20241223 18:00:59  Exception "Error on load"
20241223 18:00:59  Bye

The tune timings being 100000.0 looks suspicious.

@preda
Copy link
Owner

preda commented Dec 23, 2024

Tune timing "100000.0" is a huge value that indicates that it failed (so that config, which did not pass, should have basically "infinite" cost as it's broken).

@preda
Copy link
Owner

preda commented Dec 23, 2024

It is possible to run with "-use DEBUG" which enables asserts (in OpenCL) at runtime -- which is normally rather slow. But maybe we'd see some earlier failure.

@chocolate42
Copy link
Author

FWIW bigger FFT's take longer to tune as expected.

f40@fedora:~/Documents/gpu/prpll4$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll4/build-release/prpll -use NO_ASM -tune
20241223 18:04:33  PRPLL 0.15-92-g002f686-dirty starting
20241223 18:04:33  config: -use NO_ASM -tune 
20241223 18:04:33  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 18:04:33  BPW info for 512:16:512 not found, defaults={18.02, 18.12, 18.12, 18.22}
20241223 18:04:33  BPW info for 1K:16:512 not found, defaults={17.74, 17.84, 17.84, 17.94}
20241223 18:04:33  BPW info for 4K:16:512 not found, defaults={17.19, 17.29, 17.29, 17.39}
20241223 18:04:33  BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241223 18:04:35  Error 4c126f095ec8cfc1
20241223 18:04:35    100000.0 256:2:256:0:0   5117313
20241223 18:04:37  Error 770b9e94120731d0
20241223 18:04:37    100000.0 256:3:256:0:0   7613448
20241223 18:04:40  Error 970576b507e74f18
20241223 18:04:40    100000.0 256:4:256:0:0  10086252
20241223 18:04:43  Error 4f0e244055062ae7
20241223 18:04:43    100000.0 256:2:512:0:0  10040115
20241223 18:04:45  Error 56d0223a73324940

...

20241223 18:05:47    100000.0 1K:3:256:0:0  29617029
20241223 18:05:52  Error a6eb3b8c0933a327
20241223 18:05:52    100000.0 256:12:256:0:0  29458169
20241223 18:05:57  Error 3ec6aef2e65a4096
20241223 18:05:57    100000.0 256:6:512:0:0  29477044
20241223 18:06:02  Error 4deaa398a237e398
20241223 18:06:02    100000.0 512:6:256:0:0  29500637
20241223 18:06:07  Error 4912c1b11ed6a926
20241223 18:06:07    100000.0 512:3:512:0:0  29437722
20241223 18:06:13  Error 839f6716fdbd62c0
20241223 18:06:13    100000.0 256:13:256:0:0  31815892
f40@fedora:~/Documents/gpu/prpll4$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll4/build-release/prpll -use NO_ASM -tune -fft 4M
20241223 18:06:26  PRPLL 0.15-92-g002f686-dirty starting
20241223 18:06:26  config: -use NO_ASM -tune -fft 4M 
20241223 18:06:26  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 18:06:37  Error 3c0a0b68939c4563
20241223 18:06:37  * 100000.0 1K:8:256:0:0  77007421
20241223 18:06:50  Error 3903e204419f9a7b
20241223 18:06:50  * 100000.0 1K:8:256:1:0  77334577
f40@fedora:~/Documents/gpu/prpll4$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll4/build-release/prpll -use NO_ASM -tune -fft 18M
20241223 18:07:03  PRPLL 0.15-92-g002f686-dirty starting
20241223 18:07:03  config: -use NO_ASM -tune -fft 18M 
20241223 18:07:03  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 18:07:56  Error 79c9ad6e6280ad40
20241223 18:07:56  * 100000.0  1K:9:1K:0:0 332476849

Tune timing "100000.0" is a huge value that indicates that it failed (so that config, which did not pass, should have basically "infinite" cost as it's broken).

Funnily testing the currently erroneous B580 gives numbers higher than 100000.0 at the higher FFT's: https://www.mersenneforum.org/node/1062411?p=1064299#post1064299

@chocolate42
Copy link
Author

It is possible to run with "-use DEBUG" which enables asserts (in OpenCL) at runtime -- which is normally rather slow. But maybe we'd see some earlier failure.

40@fedora:~/Documents/gpu/prpll4$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll4/build-release/prpll -prp 57885161 -use NO_ASM,DEBUG -fft 256:13:512
20241223 18:24:11  PRPLL 0.15-92-g002f686-dirty starting
20241223 18:24:11  config: -prp 57885161 -use NO_ASM,DEBUG -fft 256:13:512 
20241223 18:24:11  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 18:24:11 57885161 config:  -DDEBUG=1 -DNO_ASM=1
20241223 18:24:11 57885161 FFT: 3.25M 256:13:512:3 (16.99 bpw)
20241223 18:24:11 57885161 In file included from input.cl:1:
In file included from /tmp/openclon12/fftp.cl:7:
In file included from /tmp/openclon12/fftwidth.cl:4:
In file included from /tmp/openclon12/fftbase.cl:6:
/tmp/openclon12/trig.cl:41:15: error: use of undeclared identifier 'n'
Error executing LLVM compilation action.

20241223 18:24:11 57885161 Compiling 'fftp.cl' error COMPILE_PROGRAM_FAILURE (-15) (args -cl-finite-math-only -cl-std=CL2.0  -DDEBUG=1 -DNO_ASM=1 -DEXP=57885161u -DWIDTH=256u -DSMALL_HEIGHT=512u -DMIDDLE=13u -DCARRY_LEN=8u -DNW=4u -DNH=8u -DAMDGPU=1 -DWEIGHT_STEP=0.0099470002848833783 -DIWEIGHT_STEP=-0.0098490319611598265 -DFFT_VARIANT=3u -DTAILT=U2(-1.8824717398857355e-05,0.0061358846491544753) -DTRIG_SCALE=11 -DTRIG_SIN={3.3522307427570831e-07,-1.2563750003736832e-24,-6.2784214256363043e-21,3.5276726415399086e-35,-9.4385829203836991e-50,1.4731332310453353e-64,-1.5048548786217372e-79,1.0732889266281846e-94,} -DTRIG_COS={1,-5.6187254763428522e-14,5.2616793297505444e-28,-1.97092877987088e-42,3.95503846604331e-57,-4.9382785316349927e-72,4.2032744514730063e-87,-2.5377621968447027e-102,} -DFRAC_BPW_HI=4233636942u -DFRAC_BPW_LO=3303820995u -DFRAC_BITS_BIGSTEP=1073741823u )
20241223 18:24:11 57885161 Can't compile fftp.cl
20241223 18:24:11  Exception "Can't compile fftp.cl"
20241223 18:24:11  Bye

Would it help if I compiled a debug build?

@chocolate42
Copy link
Author

chocolate42 commented Dec 23, 2024

Commenting out the offending assert in trig.cl with Gpu.cpp untouched goes back to erroring in the same place as before:

f40@fedora:~/Documents/gpu/prpll_debug_clean$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll_debug_clean/build-debug/prpll -prp 57885161 -use NO_ASM,DEBUG -fft 256:13:512
20241223 18:45:14  PRPLL 0.15-92-g002f686-dirty starting
20241223 18:45:14  config: -prp 57885161 -use NO_ASM,DEBUG -fft 256:13:512 
20241223 18:45:14  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 18:45:14 57885161 config:  -DDEBUG=1 -DNO_ASM=1
20241223 18:45:14 57885161 FFT: 3.25M 256:13:512:3 (16.99 bpw)
20241223 18:45:27 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 18:45:27 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 18:45:48 57885161 GPU read failed: ff27b0e9001ba8b3 (gpu) != 01a268f60455b72a (host)
20241223 18:45:48 57885161 GPU read failed: ff27b0e9001ba8b3 (gpu) != 01a268f60455b72a (host)
20241223 18:45:48 57885161 GPU read failed: ff27b0e9001ba8b3 (gpu) != 01a268f60455b72a (host)
20241223 18:45:48  Exception "GPU persistent read errors"
20241223 18:45:48  Bye

Doing the Gpu.cpp line change as well goes back to this:

f40@fedora:~/Documents/gpu/prpll_debug_dirty$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll_debug_dirty/build-debug/prpll -prp 57885161 -use NO_ASM,DEBUG -fft 256:13:512
20241223 18:47:42  PRPLL 0.15-92-g002f686-dirty starting
20241223 18:47:42  config: -prp 57885161 -use NO_ASM,DEBUG -fft 256:13:512 
20241223 18:47:43  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 18:47:43 57885161 config:  -DDEBUG=1 -DNO_ASM=1
20241223 18:47:43 57885161 FFT: 3.25M 256:13:512:3 (16.99 bpw)
20241223 18:47:55 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 18:47:55 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 18:48:24 57885161 EE      2000 69e18d84791da96c 9504 ETA 6d 08:48; Z=411 (avg 410.7) 1 errors
20241223 18:48:37 57885161 EE         0 on-load: 02b776e9980ca31d vs. 0000000000000003
20241223 18:48:37  Exception "Error on load"
20241223 18:48:37  Bye

So -use DEBUG might have helped to find that undeclared (?) variable n, but that doesn't seem to have been the blocker.

edit: Of course the assert wasn't the issue, without DEBUG the asserts are all removed.

@preda
Copy link
Owner

preda commented Dec 23, 2024

I opened a RustiCL issue to let them know of the situation:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12361

@preda
Copy link
Owner

preda commented Dec 23, 2024

We have feedback from Karol Herbst that workgroup_reduce() family is not implemented by RustiCL.

So I just removed all use of workgrup_reduce() in the latest commit. Please retry.

I have a strong suspicion not everything is fixed by this though. Let's see how the read check behaves now.

@chocolate42
Copy link
Author

chocolate42 commented Dec 23, 2024

Clean

f40@fedora:~/Documents/gpu/prpll6$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll6/build-release/prpll -prp 57885161 -use DEBUG,NO_ASM
20241223 22:05:23  PRPLL 0.15-93-g3737df5 starting
20241223 22:05:23  config: -prp 57885161 -use DEBUG,NO_ASM 
20241223 22:05:23  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 22:05:23 57885161 config:  -DDEBUG=1 -DNO_ASM=1
20241223 22:05:23 57885161 FFT: 3M 1K:6:256:0:0 (18.40 bpw)
20241223 22:05:23 57885161 In file included from input.cl:1:
In file included from /tmp/openclon12/fftp.cl:7:
In file included from /tmp/openclon12/fftwidth.cl:4:
In file included from /tmp/openclon12/fftbase.cl:6:
/tmp/openclon12/trig.cl:41:15: error: use of undeclared identifier 'n'
Error executing LLVM compilation action.

20241223 22:05:23 57885161 Compiling 'fftp.cl' error COMPILE_PROGRAM_FAILURE (-15) (args -cl-finite-math-only -cl-std=CL2.0  -DDEBUG=1 -DNO_ASM=1 -DEXP=57885161u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=6u -DCARRY_LEN=8u -DNW=4u -DNH=4u -DAMDGPU=1 -DWEIGHT_STEP=0.51445938099070077 -DIWEIGHT_STEP=-0.33969836857173508 -DFFT_VARIANT=0u -DTAILT=U2(-7.5298160855459062e-05,0.012271538285719925) -DTRIG_SCALE=3 -DTRIG_SIN={1.331580545039619e-06,7.4418738109161865e-23,-3.9350574163656679e-19,3.4886384292170233e-32,-1.472792461108135e-45,3.6269691064102593e-59,-5.8460626541768332e-73,6.5788131861971713e-87,} -DTRIG_COS={1,-8.8655337396400457e-13,1.3099614748116047e-25,-7.7423384350554752e-39,2.451427236886687e-52,-4.8296023716962189e-66,6.4872905043230845e-80,-6.2756334651618034e-94,} -DFRAC_BPW_HI=1723128490u -DFRAC_BPW_LO=2863311529u -DFRAC_BITS_BIGSTEP=1073741823u )
20241223 22:05:23 57885161 Can't compile fftp.cl
20241223 22:05:23  Exception "Can't compile fftp.cl"
20241223 22:05:23  Bye

f40@fedora:~/Documents/gpu/prpll6$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll6/build-release/prpll -prp 57885161 -use NO_ASM
20241223 22:05:54  PRPLL 0.15-93-g3737df5 starting
20241223 22:05:54  config: -prp 57885161 -use NO_ASM 
20241223 22:05:55  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 22:05:55 57885161 config:  -DNO_ASM=1
20241223 22:05:55 57885161 FFT: 3M 1K:6:256:0:0 (18.40 bpw)
20241223 22:06:01 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 22:06:01 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 22:06:18 57885161 EE      2000 470397a1d5627a8b 5694 ETA 3d 19:33; Z=38 (avg 38.0) 1 errors
20241223 22:06:25 57885161 EE         0 on-load: ff75baaa869a9607 vs. 0000000000000003
20241223 22:06:25  Exception "Error on load"
20241223 22:06:25  Bye

Remove assert in trig.cl

f40@fedora:~/Documents/gpu/prpll6$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/gpu/mesa/mesa-main-2024-12-23/mesabuild ../../../prpll6/build-release/prpll -prp 57885161 -use DEBUG,NO_ASM
20241223 22:12:49  PRPLL 0.15-93-g3737df5-dirty starting
20241223 22:12:49  config: -prp 57885161 -use DEBUG,NO_ASM 
20241223 22:12:49  device 0, OpenCL 25.0.0-devel, unique id ''
20241223 22:12:49 57885161 config:  -DDEBUG=1 -DNO_ASM=1
20241223 22:12:49 57885161 FFT: 3M 1K:6:256:0:0 (18.40 bpw)
20241223 22:12:58 57885161 OK         0 on-load: blockSize 1000, 0000000000000003
20241223 22:12:58 57885161 Proof of power 9 requires about 3.6GB of disk space
20241223 22:13:17 57885161 EE      2000 6823b16f1c4564c4 6477 ETA 4d 08:09; Z=36 (avg 36.4) 1 errors
20241223 22:13:25 57885161 EE         0 on-load: 129abe57c2e7ad31 vs. 0000000000000003
20241223 22:13:25  Exception "Error on load"
20241223 22:13:25  Bye

This is all with Gpu.cpp untouched. So that's progress? The checksums at least aren't mismatching.

edit: 20241223 22:13:25 57885161 EE 0 on-load: 129abe57c2e7ad31 vs. 0000000000000003
At this line it tried to reload the previous good checkpoint after errors were detected? So it read garbage instead of what we know should be all zeroes and a 3? It's different every time.

@chocolate42
Copy link
Author

chocolate42 commented Dec 24, 2024

On the B580 the LHS is zeroed instead of garbage:

celery@fedora:~/Documents/gpu_b580/prpll$ RUSTICL_ENABLE=iris RUSTICL_FEATURES=fp64 meson devenv -C /home/celery/Documents/git/mesa-main-2024-12-23/mesabuild ../../../gpu_b580/prpll/build-debug/prpll -device 1 -prp 57885161
20241224 09:37:55  PRPLL 0.15-94-g6ddbb31 starting
20241224 09:37:55  config: -device 1 -prp 57885161 
WARNING: OpenCL support via iris driver is incomplete.
For a complete and conformant OpenCL implementation, use
https://github.com/intel/compute-runtime instead
MESA: warning: INTEL_HWCONFIG_TOTAL_GS_THREADS (336) != devinfo->max_gs_threads (312)
20241224 09:37:55  device 1, OpenCL 25.0.0-devel, unique id ''
20241224 09:37:55 57885161 No FFTs found in tune.txt that can handle 57885161. Consider tuning with -tune
20241224 09:37:55 57885161 BPW info for 512:16:512 not found, defaults={18.02, 18.12, 18.12, 18.22}
20241224 09:37:55 57885161 BPW info for 1K:16:512 not found, defaults={17.74, 17.84, 17.84, 17.94}
20241224 09:37:55 57885161 BPW info for 4K:16:512 not found, defaults={17.19, 17.29, 17.29, 17.39}
20241224 09:37:55 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241224 09:37:55 57885161 config: 
20241224 09:37:55 57885161 FFT: 3M 1K:6:256:0 (18.40 bpw)
20241224 09:38:02 57885161 EE         0 on-load: 0000000000000000 vs. 0000000000000003
20241224 09:38:03  Exception "Error on load"
20241224 09:38:03  Bye

celery@fedora:~/Documents/gpu_b580/prpll$ RUSTICL_ENABLE=iris RUSTICL_FEATURES=fp64 meson devenv -C /home/celery/Documents/git/mesa-main-2024-12-23/mesabuild ../../../gpu_b580/prpll/build-debug/prpll -device 1 -prp 57885161 -use NO_ASM
20241224 09:38:14  PRPLL 0.15-94-g6ddbb31 starting
20241224 09:38:14  config: -device 1 -prp 57885161 -use NO_ASM 
WARNING: OpenCL support via iris driver is incomplete.
For a complete and conformant OpenCL implementation, use
https://github.com/intel/compute-runtime instead
MESA: warning: INTEL_HWCONFIG_TOTAL_GS_THREADS (336) != devinfo->max_gs_threads (312)
20241224 09:38:14  device 1, OpenCL 25.0.0-devel, unique id ''
20241224 09:38:14 57885161 No FFTs found in tune.txt that can handle 57885161. Consider tuning with -tune
20241224 09:38:14 57885161 BPW info for 512:16:512 not found, defaults={18.02, 18.12, 18.12, 18.22}
20241224 09:38:14 57885161 BPW info for 1K:16:512 not found, defaults={17.74, 17.84, 17.84, 17.94}
20241224 09:38:14 57885161 BPW info for 4K:16:512 not found, defaults={17.19, 17.29, 17.29, 17.39}
20241224 09:38:14 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241224 09:38:14 57885161 config:  -DNO_ASM=1
20241224 09:38:15 57885161 FFT: 3M 1K:6:256:0 (18.40 bpw)
20241224 09:38:22 57885161 EE         0 on-load: 0000000000000000 vs. 0000000000000003
20241224 09:38:22  Exception "Error on load"
20241224 09:38:22  Bye

celery@fedora:~/Documents/gpu_b580/prpll$ RUSTICL_ENABLE=iris RUSTICL_FEATURES=fp64 meson devenv -C /home/celery/Documents/git/mesa-main-2024-12-23/mesabuild ../../../gpu_b580/prpll/build-debug/prpll -device 1 -prp 57885161 -use NO_ASM,DEBUG
20241224 09:38:27  PRPLL 0.15-94-g6ddbb31 starting
20241224 09:38:27  config: -device 1 -prp 57885161 -use NO_ASM,DEBUG 
WARNING: OpenCL support via iris driver is incomplete.
For a complete and conformant OpenCL implementation, use
https://github.com/intel/compute-runtime instead
MESA: warning: INTEL_HWCONFIG_TOTAL_GS_THREADS (336) != devinfo->max_gs_threads (312)
20241224 09:38:27  device 1, OpenCL 25.0.0-devel, unique id ''
20241224 09:38:27 57885161 No FFTs found in tune.txt that can handle 57885161. Consider tuning with -tune
20241224 09:38:27 57885161 BPW info for 512:16:512 not found, defaults={18.02, 18.12, 18.12, 18.22}
20241224 09:38:27 57885161 BPW info for 1K:16:512 not found, defaults={17.74, 17.84, 17.84, 17.94}
20241224 09:38:27 57885161 BPW info for 4K:16:512 not found, defaults={17.19, 17.29, 17.29, 17.39}
20241224 09:38:27 57885161 BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
20241224 09:38:27 57885161 config:  -DDEBUG=1 -DNO_ASM=1
20241224 09:38:27 57885161 FFT: 3M 1K:6:256:0 (18.40 bpw)
20241224 09:38:47 57885161 EE         0 on-load: 0000000000000000 vs. 0000000000000003
20241224 09:38:47  Exception "Error on load"
20241224 09:38:47  Bye

It runs with ASM enabled, but should it be disabled because you've made ASM specific to AMD?

edit: BTW a gotcha between rusticl/other is that they may enumerate the GPU's differently. Unless it's random the b580 is device 1 on rusticl, device 0 on intel's runtime.

edit edit: No change with mesa build from 2024-12-24

@chocolate42
Copy link
Author

This is the current status with B580:

celery@fedora:~/Documents/gpu_b580/prpll$ RUSTICL_ENABLE=iris RUSTICL_FEATURES=fp64 meson devenv -C ~/Documents/git/mesa-main-2024-12-24/build --workdir ./ ./build-debug/prpll -tune
20241226 14:21:43  PRPLL 0.15-94-g6ddbb31 starting
20241226 14:21:43  config: -tune 
MESA: warning: INTEL_HWCONFIG_TOTAL_GS_THREADS (336) != devinfo->max_gs_threads (312)
WARNING: OpenCL support via iris driver is incomplete.
For a complete and conformant OpenCL implementation, use
https://github.com/intel/compute-runtime instead
20241226 14:21:43  device 0, OpenCL 25.0.0-devel, unique id ''
20241226 14:21:43  BPW info for 512:16:512 not found, defaults={18.02, 18.12, 18.12, 18.22}
20241226 14:21:43  BPW info for 1K:16:512 not found, defaults={17.74, 17.84, 17.84, 17.94}
20241226 14:21:43  BPW info for 4K:16:512 not found, defaults={17.19, 17.29, 17.29, 17.39}
20241226 14:21:43  BPW info for 4K:16:1K not found, defaults={16.92, 17.02, 17.02, 17.12}
prpll: src/Gpu.cpp:1338: double Gpu::timePRP(): Assertion `dataResidue() == state.res64' failed.

Removing that assert and recompiling we don't trip any more asserts and are back at zeroed data.

@chocolate42
Copy link
Author

RDNA3 780M:

Not sure why it's doing different things, but here's something new on the latest mesa:

f40@fedora:~/Documents/gpu/builds/preda/gpuowl$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp16,fp64 meson devenv -C /home/f40/Documents/gpu/builds/mesa/mesa-main-2024-12-27/build --workdir ./ ./build-release/prpll -prp 77936867 -use NO_ASM
20241227 13:50:14  PRPLL 0.15-94-g6ddbb31 starting
20241227 13:50:14  config: -prp 77936867 -use NO_ASM 
20241227 13:50:14  device 0, OpenCL 25.0.0-devel, unique id ''
20241227 13:50:14 77936867 config:  -DNO_ASM=1
20241227 13:50:14 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
Unknown intrinsic: div 32x2  %39 = @load_global (%38) (access=none, align_mul=8, align_offset=0)
Unknown intrinsic: div 32x2  %41 = @load_global (%40) (access=none, align_mul=8, align_offset=0)
Unknown intrinsic: div 64x2  %23 = @load_global_constant (%22) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %39 = @load_global (%38) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %48 = @load_global (%47) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %56 = @load_global (%55) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %48 = @load_global (%47) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %56 = @load_global (%55) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %34 = @load_global (%33) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %43 = @load_global (%42) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 32    %30 = @load_global (%29) (access=none, align_mul=4, align_offset=0)
Unknown intrinsic: div 32    %28 = @load_global (%27) (access=none, align_mul=4, align_offset=0)
Unknown intrinsic: div 64x2  %56 = @load_global (%55) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %55 = @load_global (%54) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 32    %30 = @load_global (%29) (access=none, align_mul=4, align_offset=0)
Unknown intrinsic: div 32x2  %30 = @load_global (%29) (access=none, align_mul=8, align_offset=0)
20241227 13:50:15 77936867 EE         0 on-load: 0000000000000000 vs. 0000000000000003
20241227 13:50:15  Exception "Error on load"
20241227 13:50:15  Bye
f40@fedora:~/Documents/gpu/builds/preda/gpuowl$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp16,fp64 meson devenv -C /home/f40/Documents/gpu/builds/mesa/mesa-main-2024-12-27/build --workdir ./ ./build-release/prpll -prp 77936867 -use NO_ASM,DEBUG
20241227 13:51:47  PRPLL 0.15-94-g6ddbb31 starting
20241227 13:51:47  config: -prp 77936867 -use NO_ASM,DEBUG 
20241227 13:51:47  device 0, OpenCL 25.0.0-devel, unique id ''
20241227 13:51:47 77936867 config:  -DDEBUG=1 -DNO_ASM=1
20241227 13:51:47 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
Unknown intrinsic: div 32x2  %39 = @load_global (%38) (access=none, align_mul=8, align_offset=0)
Unknown intrinsic: div 32x2  %41 = @load_global (%40) (access=none, align_mul=8, align_offset=0)
Unknown intrinsic: div 64x2  %23 = @load_global_constant (%22) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %39 = @load_global (%38) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 32    %19 = @global_atomic (%18, %7 (0x10)) (atomic_op=iadd)
Unknown intrinsic: div 64x2  %56 = @load_global (%55) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %56 = @load_global (%55) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 32    %19 = @global_atomic (%18, %7 (0x10)) (atomic_op=iadd)
Unknown intrinsic: div 64x2  %34 = @load_global (%33) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %43 = @load_global (%42) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 32    %30 = @load_global (%29) (access=none, align_mul=4, align_offset=0)
Unknown intrinsic: div 32    %28 = @load_global (%27) (access=none, align_mul=4, align_offset=0)
Unknown intrinsic: div 64x2  %56 = @load_global (%55) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %55 = @load_global (%54) (access=non-temporal, align_mul=16, align_offset=0)
Unknown intrinsic: div 32    %30 = @load_global (%29) (access=none, align_mul=4, align_offset=0)
Unknown intrinsic: div 32x2  %30 = @load_global (%29) (access=none, align_mul=8, align_offset=0)
20241227 13:51:48 77936867 EE         0 on-load: 0000000000000000 vs. 0000000000000003
20241227 13:51:48  Exception "Error on load"
20241227 13:51:48  Bye

Unknown intrinsics and now the data is zeroed where previously it was garbage.

@preda
Copy link
Owner

preda commented Dec 28, 2024

It may be that the "unknown intrinsic" is caused by the __builtin_nontemporal_load/store which is recent.

Since the latest commit ( 1582454 )
this is disabled by default (should be back to the previous behavior), and it can be controlled with -use NONTEMPORAL

Although, looking more carefully, I see that not all of the load_global() that produce the "unknown intrinsic" above have access=non-temporal, so it's not that.

@chocolate42
Copy link
Author

f40@fedora:~/Documents/gpu/builds/preda/gpuowl$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp16,fp64 meson devenv -C /home/f40/Documents/gpu/builds/mesa/mesa-main-2024-12-28/build --workdir ./ ./build-release/prpll -prp 77936867 -use NO_ASM
20241228 10:20:53  PRPLL 0.15-95-g1582454 starting
20241228 10:20:53  config: -prp 77936867 -use NO_ASM 
20241228 10:20:53  device 0, OpenCL 25.0.0-devel, unique id ''
20241228 10:20:53 77936867 config:  -DNO_ASM=1
20241228 10:20:53 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
Unknown intrinsic: div 32x2  %39 = @load_global (%38) (access=none, align_mul=8, align_offset=0)
Unknown intrinsic: div 32x2  %41 = @load_global (%40) (access=none, align_mul=8, align_offset=0)
Unknown intrinsic: div 64x2  %23 = @load_global_constant (%22) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %39 = @load_global (%38) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %48 = @load_global (%47) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %56 = @load_global (%55) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %56 = @load_global (%55) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %48 = @load_global (%47) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %34 = @load_global (%33) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %43 = @load_global (%42) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 32    %30 = @load_global (%29) (access=none, align_mul=4, align_offset=0)
Unknown intrinsic: div 32    %28 = @load_global (%27) (access=none, align_mul=4, align_offset=0)
Unknown intrinsic: div 64x2  %56 = @load_global (%55) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %55 = @load_global (%54) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 32    %30 = @load_global (%29) (access=none, align_mul=4, align_offset=0)
Unknown intrinsic: div 32x2  %30 = @load_global (%29) (access=none, align_mul=8, align_offset=0)
20241228 10:20:54 77936867 EE         0 on-load: 0000000000000000 vs. 0000000000000003
20241228 10:20:54  Exception "Error on load"
20241228 10:20:54  Bye

Will install ROCm later to compare.

@preda
Copy link
Owner

preda commented Dec 28, 2024

@chocolate42 could you please try with "-carry long" on RustiCL and see whether it fixes the

EE         0 on-load: 0000000000000000 vs. 0000000000000003

@chocolate42
Copy link
Author

780M, no change:

f40@fedora:~/Documents/gpu/builds/preda/gpuowl$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp16,fp64 meson devenv -C /home/f40/Documents/gpu/builds/mesa/mesa-main-2024-12-28/build --workdir ./ ./build-release/prpll -prp 77936867 -use NO_ASM -carry long
20241228 13:18:03  PRPLL 0.15-95-g1582454 starting
20241228 13:18:03  config: -prp 77936867 -use NO_ASM -carry long 
20241228 13:18:03  device 0, OpenCL 25.0.0-devel, unique id ''
20241228 13:18:03 77936867 config:  -DNO_ASM=1
20241228 13:18:03 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241228 13:18:03 77936867 Using long carry!
Unknown intrinsic: div 32x2  %39 = @load_global (%38) (access=none, align_mul=8, align_offset=0)
Unknown intrinsic: div 32x2  %41 = @load_global (%40) (access=none, align_mul=8, align_offset=0)
Unknown intrinsic: div 64x2  %23 = @load_global_constant (%22) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %39 = @load_global (%38) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %48 = @load_global (%47) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %56 = @load_global (%55) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %56 = @load_global (%55) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %48 = @load_global (%47) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %34 = @load_global (%33) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 64x2  %43 = @load_global (%42) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 32    %30 = @load_global (%29) (access=none, align_mul=4, align_offset=0)
Unknown intrinsic: div 32    %28 = @load_global (%27) (access=none, align_mul=4, align_offset=0)
Unknown intrinsic: div 64x2  %56 = @load_global (%55) (access=none, align_mul=16, align_offset=0)
Unknown intrinsic: div 32    %30 = @load_global (%29) (access=none, align_mul=4, align_offset=0)
Unknown intrinsic: div 32x2  %30 = @load_global (%29) (access=none, align_mul=8, align_offset=0)
20241228 13:18:03 77936867 EE         0 on-load: 0000000000000000 vs. 0000000000000003
20241228 13:18:03  Exception "Error on load"
20241228 13:18:03  Bye

@preda
Copy link
Owner

preda commented Dec 28, 2024

As long as we have those "Unknown intrinsic" for global read, it's broken.
Those will be fixed in Mesa as per https://gitlab.freedesktop.org/mesa/mesa/-/issues/12361#note_2717423
We need to check again with the fix, with "-carry long", and see what we hit then.

@chocolate42
Copy link
Author

chocolate42 commented Dec 28, 2024

I think that's done it (tested with the PR). The 780M needs -carry long:

f40@fedora:~/Documents/gpu/builds/preda/gpuowl$ rm -r 7*
f40@fedora:~/Documents/gpu/builds/preda/gpuowl$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp16,fp64 meson devenv -C /home/f40/Documents/gpu/builds/mesa/mesa-main--fix-global-access/build --workdir ./ ./build-release/prpll -prp 77936867 -use NO_ASM -iters 10000
20241228 18:14:58  PRPLL 0.15-95-g1582454 starting
20241228 18:14:58  config: -prp 77936867 -use NO_ASM -iters 10000 
20241228 18:14:58  device 0, OpenCL 25.0.0-devel, unique id ''
20241228 18:14:58 77936867 config:  -DNO_ASM=1
20241228 18:14:58 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241228 18:15:10 77936867 OK         0 on-load: blockSize 1000, 0000000000000003
20241228 18:15:10 77936867 Proof of power 10 requires about 9.8GB of disk space
20241228 18:15:37 77936867 EE      2000 3f1730589b53d3f0 9255 ETA 8d 08:21; Z=732 (avg 732.1) 1 errors
20241228 18:15:48 77936867 EE         0 on-load: 931e2dfb51a198ef vs. 0000000000000003
20241228 18:15:48  Exception "Error on load"
20241228 18:15:48  Bye

Succeeds with -carry long

f40@fedora:~/Documents/gpu/builds/preda/gpuowl$ rm -r 7*
f40@fedora:~/Documents/gpu/builds/preda/gpuowl$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp16,fp64 meson devenv -C /home/f40/Documents/gpu/builds/mesa/mesa-main--fix-global-access/build --workdir ./ ./build-release/prpll -prp 77936867 -use NO_ASM -iters 10000 -carry long
20241228 18:12:28  PRPLL 0.15-95-g1582454 starting
20241228 18:12:28  config: -prp 77936867 -use NO_ASM -iters 10000 -carry long 
20241228 18:12:28  device 0, OpenCL 25.0.0-devel, unique id ''
20241228 18:12:28 77936867 config:  -DNO_ASM=1
20241228 18:12:28 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241228 18:12:28 77936867 Using long carry!
20241228 18:12:40 77936867 OK         0 on-load: blockSize 1000, 0000000000000003
20241228 18:12:40 77936867 Proof of power 10 requires about 9.8GB of disk space
20241228 18:13:11 77936867 OK      2000 f54bf64335a7266e 10269 ETA 9d 06:18; Z=713 (avg 713.0)
20241228 18:14:29 77936867 Stopping, please wait..
20241228 18:14:43 77936867 OK     10000 fc4f135f7cf4ad29 10252 ETA 9d 05:56; Z=701 (avg 711.0)
20241228 18:14:43  Exception "stop requested"
20241228 18:14:43  Bye

Hash matches at iteration 2000 on a second run

f40@fedora:~/Documents/gpu/builds/preda/gpuowl$ rm -r 7*
f40@fedora:~/Documents/gpu/builds/preda/gpuowl$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp16,fp64 meson devenv -C /home/f40/Documents/gpu/builds/mesa/mesa-main--fix-global-access/build --workdir ./ ./build-release/prpll -prp 77936867 -use NO_ASM -iters 100000 -carry long
20241228 18:16:31  PRPLL 0.15-95-g1582454 starting
20241228 18:16:31  config: -prp 77936867 -use NO_ASM -iters 100000 -carry long 
20241228 18:16:31  device 0, OpenCL 25.0.0-devel, unique id ''
20241228 18:16:31 77936867 config:  -DNO_ASM=1
20241228 18:16:31 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241228 18:16:31 77936867 Using long carry!
20241228 18:16:43 77936867 OK         0 on-load: blockSize 1000, 0000000000000003
20241228 18:16:43 77936867 Proof of power 10 requires about 9.8GB of disk space
20241228 18:17:14 77936867 OK      2000 f54bf64335a7266e 10272 ETA 9d 06:22; Z=713 (avg 713.0)
20241228 18:20:19 77936867        20000 3cd1bd9d5e09cbc5 10255
20241228 18:23:44 77936867        40000 dffe1b1b0d748128 10258
20241228 18:27:09 77936867        60000 0945da4dc08bdd95 10258
20241228 18:30:35 77936867        80000 8d76071d27ee4221 10267
20241228 18:33:56 77936867 Stopping, please wait..
20241228 18:34:10 77936867 OK    100000 6d7296b9e2830f50 10258 ETA 9d 05:48; Z=701 (avg 711.0)
20241228 18:34:10  Exception "stop requested"
20241228 18:34:10  Bye

Now I need to install rocm on this thing and see if there's any difference in timings. And test with B580. Tomorrow job.

edit: So when that PR gets merged, which may not be soon as mesa has a fancy way of merging PR's, prpll with AMD cards should work on rusticl? intel cards use a different driver so all bets are off there.

@chocolate42
Copy link
Author

chocolate42 commented Dec 28, 2024

Trying rocm to compare to the runs in previous post (Ubuntu 22.04 via distrobox, rocm 6.3.1, "make amd")

📦[f40@u22.04 gpuowl_rocm]$ ./build-release/prpll-amd -prp 77936867 -iters 10000
20241228 21:52:06  PRPLL 0.15-95-g1582454 starting
20241228 21:52:06  config: -prp 77936867 -iters 10000 
20241228 21:52:06  device 0, OpenCL 3635.0 (HSA1.1,LC), unique id ''
20241228 21:52:06 77936867 config: 
20241228 21:52:06 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241228 21:52:18 77936867 OK     10000 on-load: blockSize 1000, fc4f135f7cf4ad29
20241228 21:52:18 77936867 Proof of power 10 requires about 9.8GB of disk space
20241228 21:52:45 77936867 OK     12000 e637ca5a70dab6ec 8927 ETA 8d 01:14; Z=731 (avg 731.1)
20241228 21:53:51 77936867 Stopping, please wait..
20241228 21:54:03 77936867 OK     20000 3cd1bd9d5e09cbc5 8636 ETA 7d 18:54; Z=654 (avg 718.3)
20241228 21:54:03  Exception "stop requested"
20241228 21:54:03  Bye

📦[f40@u22.04 gpuowl_rocm]$ ./build-release/prpll-amd -prp 77936867 -iters 10000 -use NO_ASM
20241228 21:54:12  PRPLL 0.15-95-g1582454 starting
20241228 21:54:12  config: -prp 77936867 -iters 10000 -use NO_ASM 
20241228 21:54:12  device 0, OpenCL 3635.0 (HSA1.1,LC), unique id ''
20241228 21:54:12 77936867 config:  -DNO_ASM=1
20241228 21:54:12 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241228 21:54:24 77936867 OK     20000 on-load: blockSize 1000, 3cd1bd9d5e09cbc5
20241228 21:54:24 77936867 Proof of power 10 requires about 9.8GB of disk space
20241228 21:54:51 77936867 OK     22000 09bbce6a70f555f2 8901 ETA 8d 00:39; Z=712 (avg 712.3)
HW Exception by GPU node-1 (Agent handle: 0x55b0d6e2e030) reason :GPU Hang
Aborted (core dumped)

📦[f40@u22.04 gpuowl_rocm]$ ./build-release/prpll-amd -prp 77936867 -iters 10000
20241228 21:56:15  PRPLL 0.15-95-g1582454 starting
20241228 21:56:15  config: -prp 77936867 -iters 10000 
20241228 21:56:15  device 0, OpenCL 3635.0 (HSA1.1,LC), unique id ''
20241228 21:56:16 77936867 config: 
20241228 21:56:16 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241228 21:56:33 77936867 OK     22000 on-load: blockSize 1000, 09bbce6a70f555f2
20241228 21:56:33 77936867 Proof of power 10 requires about 9.8GB of disk space
20241228 21:57:00 77936867 OK     24000 8c0a4153f473625c 8935 ETA 8d 01:22; Z=704 (avg 703.9)
20241228 21:58:05 77936867 Stopping, please wait..
20241228 21:58:17 77936867 OK     32000 a9c98e7681c11c56 8607 ETA 7d 18:16; Z=712 (avg 705.3)
20241228 21:58:17  Exception "stop requested"
20241228 21:58:17  Bye

📦[f40@u22.04 gpuowl_rocm]$ ./build-release/prpll-amd -prp 77936867 -iters 10000
20241228 21:58:28  PRPLL 0.15-95-g1582454 starting
20241228 21:58:28  config: -prp 77936867 -iters 10000 
20241228 21:58:28  device 0, OpenCL 3635.0 (HSA1.1,LC), unique id ''
20241228 21:58:28 77936867 config: 
20241228 21:58:28 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241228 21:58:40 77936867 OK     32000 on-load: blockSize 1000, a9c98e7681c11c56
20241228 21:58:40 77936867 Proof of power 10 requires about 9.8GB of disk space
20241228 21:59:07 77936867 OK     34000 a135af9f5701e25f 8874 ETA 8d 00:01; Z=717 (avg 717.4)
20241228 21:59:59 77936867        40000 dffe1b1b0d748128 8620
HW Exception by GPU node-1 (Agent handle: 0x556bfd7a2fa0) reason :GPU Hang
Aborted (core dumped)

📦[f40@u22.04 gpuowl_rocm]$ ./build-release/prpll-amd -prp 77936867 -iters 10000 -carry long
20241228 22:01:56  PRPLL 0.15-95-g1582454 starting
20241228 22:01:56  config: -prp 77936867 -iters 10000 -carry long 
20241228 22:01:56  device 0, OpenCL 3635.0 (HSA1.1,LC), unique id ''
20241228 22:01:57 77936867 config: 
20241228 22:01:57 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241228 22:01:57 77936867 Using long carry!
20241228 22:02:10 77936867 OK     40000 on-load: blockSize 1000, dffe1b1b0d748128
20241228 22:02:10 77936867 Proof of power 10 requires about 9.8GB of disk space
20241228 22:02:38 77936867 OK     42000 485a511150854576 9383 ETA 8d 11:01; Z=723 (avg 722.8)
20241228 22:03:50 77936867 Stopping, please wait..
20241228 22:04:03 77936867 OK     50000 52e286945371ed29 9412 ETA 8d 11:38; Z=763 (avg 729.4)
20241228 22:04:03  Exception "stop requested"
20241228 22:04:03  Bye

📦[f40@u22.04 gpuowl_rocm]$ ./build-release/prpll-amd -prp 77936867 -iters 10000
20241228 22:04:19  PRPLL 0.15-95-g1582454 starting
20241228 22:04:19  config: -prp 77936867 -iters 10000 
20241228 22:04:19  device 0, OpenCL 3635.0 (HSA1.1,LC), unique id ''
20241228 22:04:20 77936867 config: 
20241228 22:04:20 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241228 22:04:32 77936867 OK     50000 on-load: blockSize 1000, 52e286945371ed29
20241228 22:04:32 77936867 Proof of power 10 requires about 9.8GB of disk space
20241228 22:04:59 77936867 OK     52000 b679553a3e73b42d 8933 ETA 8d 01:16; Z=728 (avg 727.8)
20241228 22:06:05 77936867 Stopping, please wait..
20241228 22:06:17 77936867 OK     60000 0945da4dc08bdd95 8645 ETA 7d 19:01; Z=733 (avg 728.6)
20241228 22:06:17  Exception "stop requested"
20241228 22:06:17  Bye

📦[f40@u22.04 gpuowl_rocm]$ ./build-release/prpll-amd -prp 77936867 -iters 10000
20241228 22:06:51  PRPLL 0.15-95-g1582454 starting
20241228 22:06:51  config: -prp 77936867 -iters 10000 
20241228 22:06:51  device 0, OpenCL 3635.0 (HSA1.1,LC), unique id ''
20241228 22:06:52 77936867 config: 
20241228 22:06:52 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241228 22:07:04 77936867 OK     60000 on-load: blockSize 1000, 0945da4dc08bdd95
20241228 22:07:04 77936867 Proof of power 10 requires about 9.8GB of disk space
20241228 22:07:31 77936867 OK     62000 0e41a14eed9c1be1 8919 ETA 8d 00:56; Z=717 (avg 716.9)
20241228 22:08:36 77936867 Stopping, please wait..
20241228 22:08:49 77936867 OK     70000 7131fa4eb77f4bb2 8638 ETA 7d 18:51; Z=697 (avg 713.6)
20241228 22:08:49  Exception "stop requested"
20241228 22:08:49  Bye

📦[f40@u22.04 gpuowl_rocm]$ ./build-release/prpll-amd -prp 77936867 -iters 100000
20241228 22:08:52  PRPLL 0.15-95-g1582454 starting
20241228 22:08:52  config: -prp 77936867 -iters 100000 
20241228 22:08:52  device 0, OpenCL 3635.0 (HSA1.1,LC), unique id ''
20241228 22:08:52 77936867 config: 
20241228 22:08:52 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241228 22:09:04 77936867 OK     70000 on-load: blockSize 1000, 7131fa4eb77f4bb2
20241228 22:09:04 77936867 Proof of power 10 requires about 9.8GB of disk space
20241228 22:09:31 77936867 OK     72000 65b24e3cdc2e7ad2 8934 ETA 8d 01:15; Z=705 (avg 705.2)
HW Exception by GPU node-1 (Agent handle: 0x558e5dc63fa0) reason :GPU Hang
Aborted (core dumped)

Resumed from a rusticl save fine. Doesn't need -use NO_ASM or -carry long (why is that?). ~10% faster than rusticl. The GPU keeps hanging, not sure what the deal is there. Whenever the GPU hangs the screen goes blank and comes back 5-10 seconds later. dmesg has this to say

[44436.803321] amdgpu: sdma_engine_id 1 exceeds maximum id of 0
[44517.472904] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[44517.472913] amdgpu 0000:03:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
[44517.472917] amdgpu 0000:03:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[44517.472921] amdgpu 0000:03:00.0: amdgpu: Failed to evict queue 1
[44517.472924] amdgpu 0000:03:00.0: amdgpu: Failed to evict process queues
[44517.473034] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[44517.473169] amdgpu 0000:03:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 50085
[44517.473203] amdgpu 0000:03:00.0: amdgpu: Dumping IP State
[44517.475115] amdgpu 0000:03:00.0: amdgpu: Dumping IP State Completed
[44520.475810] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=SUSPEND
[44520.475819] [drm:amdgpu_mes_suspend [amdgpu]] *ERROR* failed to suspend all gangs
[44520.476189] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <mes_v11_0> failed -110
[44523.410736] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[44523.410744] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[44526.343137] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[44526.343145] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[44530.574802] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[44530.574810] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[44530.576603] amdgpu 0000:03:00.0: amdgpu: MODE2 reset
[44530.616791] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
[44530.617337] [drm] PCIE GART of 512M enabled (table at 0x000000817FD00000).
[44530.617431] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[44530.620183] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
[44530.622673] [drm] DMUB hardware initialized: version=0x08004800
[44530.627902] amdgpu 0000:03:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:225
[44530.630566] amdgpu 0000:03:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:233
[44530.633256] amdgpu 0000:03:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:241
[44530.636019] amdgpu 0000:03:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:249
[44530.918354] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[44530.918366] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[44530.918369] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[44530.918372] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[44530.918374] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[44530.918376] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[44530.918379] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[44530.918382] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[44530.918384] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[44530.918387] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[44530.918389] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[44530.918392] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[44530.918394] amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[44530.920622] amdgpu 0000:03:00.0: amdgpu: GPU reset(2) succeeded!

So is that some rocm bug, or I've messed up installing rocm somehow?

edit: I'm running the rusticl build some more to try and get it to crash. It's well past the point that the rocm build hung 3 times and so far no problem.

edit edit: After looking more closely both builds average around 52W of total laptop draw from the wall in current conditions, but the rusticl build is very consistent at 49-53W, the rocm build stays there for a bit before crashing to ~30W then spiking to ~65W, at which point it often but not always hangs the GPU. Seems like something in the stack isn't configured properly for the rdna3 iGPU.

@preda
Copy link
Owner

preda commented Dec 29, 2024 via email

@chocolate42
Copy link
Author

Thanks. That's tricky to test because the way distrobox works the container shares the host kernel which I don't want to mess with. Guess a live image (24.04 to match your 6.8 kernel) is the way (which will take hours to download, why is the Ubuntu image so massive compared to Fedora, 5.8GB vs 2.3GB my god).

In the meantime I managed to crash while running the rusticl build but it was a herculean effort. It ran for about 300k iterations while a PS2 emulator was taxing the iGPU via vulkan, crashed after ~30 minutes of sustained power draw from the wall of ~80W. This is a thin and light laptop I'm pretty sure it overheated (or soft-crashed because it was overheating), this might not be a bug bug but the rocm crashes surely are. This logged me out of the session and closed everything but didn't fully crash the PC.

[ 4133.736680] amdgpu 0000:03:00.0: amdgpu: Dumping IP State
[ 4133.739046] amdgpu 0000:03:00.0: amdgpu: Dumping IP State Completed
[ 4133.749134] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=570395, emitted seq=570397
[ 4133.749140] amdgpu 0000:03:00.0: amdgpu: Process information: process Xwayland pid 9031 thread Xwayland:cs0 pid 9032
[ 4135.753018] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=RESET
[ 4135.753029] [drm:amdgpu_mes_reset_legacy_queue [amdgpu]] *ERROR* failed to reset legacy queue
[ 4135.753562] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[ 4137.847234] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 4137.847243] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 4138.078188] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[ 4138.079912] amdgpu 0000:03:00.0: amdgpu: MODE2 reset
[ 4138.120568] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 4138.121194] [drm] PCIE GART of 512M enabled (table at 0x000000817FD00000).
[ 4138.121270] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[ 4138.123458] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
[ 4138.125672] [drm] DMUB hardware initialized: version=0x08004800
[ 4138.130757] amdgpu 0000:03:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:225
[ 4138.133378] amdgpu 0000:03:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:233
[ 4138.136001] amdgpu 0000:03:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:241
[ 4138.138650] amdgpu 0000:03:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:249
[ 4138.417024] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 4138.417035] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 4138.417039] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 4138.417041] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 4138.417043] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 4138.417046] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 4138.417048] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 4138.417051] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 4138.417056] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 4138.417059] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 4138.417061] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 4138.417064] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[ 4138.417067] amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[ 4138.419238] amdgpu 0000:03:00.0: amdgpu: GPU reset(2) succeeded!
[ 4138.422286] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 4140.193419] rfkill: input handler enabled
[ 4141.851645] rfkill: input handler disabled
[ 4157.317890] rfkill: input handler enabled
[ 4158.690249] rfkill: input handler disabled

@chocolate42
Copy link
Author

From a live image of Ubuntu 24.04, kernel 6.8, rocm 6.3.1, still getting GPU hangs just like in distrobox. Screen goes blank, 10 seconds later it comes back after a successful reset. I'm guessing RDNA3 iGPU support is poor in rocm, AMD always have been terrible at supporting newer (and older, and current) hardware with rocm. Unless you have a rocm/kernel/prpll combo suggestion or someone demonstrates they get it working, I'm going to write off rocm as broken with the 780M. This is a persistent live image so rocm etc is installed on it still so I can test further if need be.

ubuntu@ubuntu:~/Documents/gpu/builds/preda/gpuowl$ ./build-release/prpll-amd -prp 77936867 -iters 10000
20241229 16:34:30  PRPLL 0.15-95-g1582454 starting
20241229 16:34:30  config: -prp 77936867 -iters 10000 
20241229 16:34:30  device 0, OpenCL 3635.0 (HSA1.1,LC), unique id ''
20241229 16:34:30 77936867 config: 
20241229 16:34:30 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241229 16:34:42 77936867 OK         0 on-load: blockSize 1000, 0000000000000003
20241229 16:34:42 77936867 Proof of power 10 requires about 9.8GB of disk space
20241229 16:35:09 77936867 OK      2000 f54bf64335a7266e 8921 ETA 8d 01:08; Z=700 (avg 700.3)
20241229 16:36:15 77936867 Stopping, please wait..
20241229 16:36:27 77936867 OK     10000 fc4f135f7cf4ad29 8649 ETA 7d 19:13; Z=748 (avg 708.3)
20241229 16:36:28  Exception "stop requested"
20241229 16:36:28  Bye
ubuntu@ubuntu:~/Documents/gpu/builds/preda/gpuowl$ ./build-release/prpll-amd -prp 77936867 -iters 200000
20241229 16:36:35  PRPLL 0.15-95-g1582454 starting
20241229 16:36:35  config: -prp 77936867 -iters 200000 
20241229 16:36:35  device 0, OpenCL 3635.0 (HSA1.1,LC), unique id ''
20241229 16:36:35 77936867 config: 
20241229 16:36:35 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241229 16:36:47 77936867 OK     10000 on-load: blockSize 1000, fc4f135f7cf4ad29
20241229 16:36:47 77936867 Proof of power 10 requires about 9.8GB of disk space
20241229 16:37:14 77936867 OK     12000 e637ca5a70dab6ec 8942 ETA 8d 01:34; Z=731 (avg 731.1)
HW Exception by GPU node-1 (Agent handle: 0x57ae98eb08b0) reason :GPU Hang
Aborted (core dumped)
ubuntu@ubuntu:~/Documents/gpu/builds/preda/gpuowl$ ./build-release/prpll-amd -prp 77936867 -iters 200000
20241229 16:37:57  PRPLL 0.15-95-g1582454 starting
20241229 16:37:57  config: -prp 77936867 -iters 200000 
20241229 16:37:57  device 0, OpenCL 3635.0 (HSA1.1,LC), unique id ''
20241229 16:37:57 77936867 config: 
20241229 16:37:57 77936867 FFT: 4.50M 1K:9:256:1:0 (16.52 bpw)
20241229 16:38:09 77936867 OK     12000 on-load: blockSize 1000, e637ca5a70dab6ec
20241229 16:38:09 77936867 Proof of power 10 requires about 9.8GB of disk space
20241229 16:38:36 77936867 OK     14000 b75984c8c4ab3869 8882 ETA 8d 00:16; Z=706 (avg 705.9)
20241229 16:39:28 77936867        20000 3cd1bd9d5e09cbc5 8599
HW Exception by GPU node-1 (Agent handle: 0x6535665c28b0) reason :GPU Hang
Aborted (core dumped)
ubuntu@ubuntu:~/Documents/gpu/builds/preda/gpuowl$ uname -a
Linux ubuntu 6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug  2 20:41:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

780M_u24.04_clinfo_dmesg.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants