LWS autotune ends up with a poor choice #5661

magnumripper · 2025-02-08T15:27:34Z

I frequently see that a manual bump of LWS makes for a boost you wouldn't want to miss. I'm not sure why but the current autotune somehow fails. I'm currently working on a kernel that does 4373 c/s with the autotuned worksize of only 32 while it achieves 5150 c/s using 256. I'd like to have those 17% please!

Interestingly enough, the current autotune seems to do the right thing:

Calculating best LWS for GWS=4096
Testing LWS=32 GWS=4096 ... 199.775 ms+
Testing LWS=64 GWS=4096 ... 200.720 ms
Testing LWS=128 GWS=4096 ... 205.342 ms
Testing LWS=256 GWS=4096 ... 383.430 ms

No wonder it picks 32. Yet, specifying LWS=256 manually ends up much faster. How come?

I tried adding __attribute__((work_group_size_hint(256, 1, 1))) to the kernels but it doesn't help - a query of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE still returns 32 on nvidia and 64 on AMD, which is probably intended (it does say multiple).

Using __attribute__((reqd_work_group_size(256, 1, 1))) feels like overkill even if it had worked. More importantly it fails miserably with our self-tests (can be worked around with --skip-self-test) and then with our autotune. The latter is not fixable as I'm not aware of any OpenCL query that actually tells us that number so it probably can't be fixed. So we must also hard code that size into the host code... and then we'd not need to query it 🙄. The runtime optimizer can benefit from it though, so there's that.

The text was updated successfully, but these errors were encountered:

magnumripper added the enhancement label Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LWS autotune ends up with a poor choice #5661

LWS autotune ends up with a poor choice #5661

magnumripper commented Feb 8, 2025

LWS autotune ends up with a poor choice #5661

LWS autotune ends up with a poor choice #5661

Comments

magnumripper commented Feb 8, 2025