You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I frequently see that a manual bump of LWS makes for a boost you wouldn't want to miss. I'm not sure why but the current autotune somehow fails. I'm currently working on a kernel that does 4373 c/s with the autotuned worksize of only 32 while it achieves 5150 c/s using 256. I'd like to have those 17% please!
Interestingly enough, the current autotune seems to do the right thing:
Calculating best LWS for GWS=4096
Testing LWS=32 GWS=4096 ... 199.775 ms+
Testing LWS=64 GWS=4096 ... 200.720 ms
Testing LWS=128 GWS=4096 ... 205.342 ms
Testing LWS=256 GWS=4096 ... 383.430 ms
No wonder it picks 32. Yet, specifying LWS=256 manually ends up much faster. How come?
I tried adding __attribute__((work_group_size_hint(256, 1, 1))) to the kernels but it doesn't help - a query of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE still returns 32 on nvidia and 64 on AMD, which is probably intended (it does say multiple).
Using __attribute__((reqd_work_group_size(256, 1, 1))) feels like overkill even if it had worked. More importantly it fails miserably with our self-tests (can be worked around with --skip-self-test) and then with our autotune. The latter is not fixable as I'm not aware of any OpenCL query that actually tells us that number so it probably can't be fixed. So we must also hard code that size into the host code... and then we'd not need to query it 🙄. The runtime optimizer can benefit from it though, so there's that.
The text was updated successfully, but these errors were encountered:
I frequently see that a manual bump of LWS makes for a boost you wouldn't want to miss. I'm not sure why but the current autotune somehow fails. I'm currently working on a kernel that does 4373 c/s with the autotuned worksize of only 32 while it achieves 5150 c/s using 256. I'd like to have those 17% please!
Interestingly enough, the current autotune seems to do the right thing:
No wonder it picks 32. Yet, specifying LWS=256 manually ends up much faster. How come?
I tried adding
__attribute__((work_group_size_hint(256, 1, 1)))
to the kernels but it doesn't help - a query ofCL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
still returns 32 on nvidia and 64 on AMD, which is probably intended (it does say multiple).Using
__attribute__((reqd_work_group_size(256, 1, 1)))
feels like overkill even if it had worked. More importantly it fails miserably with our self-tests (can be worked around with--skip-self-test
) and then with our autotune. The latter is not fixable as I'm not aware of any OpenCL query that actually tells us that number so it probably can't be fixed. So we must also hard code that size into the host code... and then we'd not need to query it 🙄. The runtime optimizer can benefit from it though, so there's that.The text was updated successfully, but these errors were encountered: