-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] float atomic_ref performance fix #2714
[SYCL] float atomic_ref performance fix #2714
Conversation
Signed-off-by: Chris Perkins <chris.perkins@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see the same improvement just from moving the load inside of the loop, or do you need the reinterpret_cast
changes as well? I'm pretty sure that the reinterpret_cast
usage here is undefined behavior, and we shouldn't rely on it.
Tagging @rolandschulz to confirm.
@Pennycook - The most significant performance improvement definitely comes from the .load change. I am not sure I understand why the reinterpret cast would be UB in this case. I certainly know it's not a good practice generally and can result in UB if you aren't careful, but we have static_asserts guaranteeing the two types are the same size. We are forcibly moving between floats and int (and vice versa). Before this PR, bit_cast was calling detail::memcpy which loops through the individual bytes as if they are chars. To me that does not seem preferable from any standpoint, especially performance. |
Great, thanks!
The pointer that you get back from
The Even when the |
… 16 million iterations is 20% for CPU and 500% for GPU. (no impact for HOST, of course). Signed-off-by: Chris Perkins <chris.perkins@intel.com>
Signed-off-by: Chris Perkins <chris.perkins@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still concerned that we don't understand why this helps, but since the changes I requested have been implemented I'll mark this as approved.
/summary:run |
pinging the runtime owners. In a separate discussion it appears everyone is ok with this change. s.b. ready to merge. |
) The refactoring is to simplify the vectorization of generated functions. Signed-off-by: Cui, Dele <dele.cui@intel.com> Original commit: KhronosGroup/SPIRV-LLVM-Translator@a5952614c12594c
Moving the load into the CAS loop greatly improves performance, especially on GPU. It isn't entirely clear to me why this should produce such a dramatic improvement.
Signed-off-by: Chris Perkins chris.perkins@intel.com