Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] float atomic_ref performance fix #2714

Merged

Conversation

cperkinsintel
Copy link
Contributor

Moving the load into the CAS loop greatly improves performance, especially on GPU. It isn't entirely clear to me why this should produce such a dramatic improvement.

Signed-off-by: Chris Perkins chris.perkins@intel.com

Signed-off-by: Chris Perkins <chris.perkins@intel.com>
Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see the same improvement just from moving the load inside of the loop, or do you need the reinterpret_cast changes as well? I'm pretty sure that the reinterpret_cast usage here is undefined behavior, and we shouldn't rely on it.

Tagging @rolandschulz to confirm.

sycl/include/CL/sycl/atomic.hpp Outdated Show resolved Hide resolved
sycl/include/CL/sycl/detail/helpers.hpp Outdated Show resolved Hide resolved
@cperkinsintel
Copy link
Contributor Author

@Pennycook - The most significant performance improvement definitely comes from the .load change.
On Tuesday I will run some tests to get a better picture of how much the memcpy is costing us, so we can decide whether to keep them or not.

I am not sure I understand why the reinterpret cast would be UB in this case. I certainly know it's not a good practice generally and can result in UB if you aren't careful, but we have static_asserts guaranteeing the two types are the same size. We are forcibly moving between floats and int (and vice versa). Before this PR, bit_cast was calling detail::memcpy which loops through the individual bytes as if they are chars. To me that does not seem preferable from any standpoint, especially performance.

@Pennycook
Copy link
Contributor

On Tuesday I will run some tests to get a better picture of how much the memcpy is costing us, so we can decide whether to keep them or not.

Great, thanks!

I am not sure I understand why the reinterpret cast would be UB in this case. I certainly know it's not a good practice generally and can result in UB if you aren't careful, but we have static_asserts guaranteeing the two types are the same size. We are forcibly moving between floats and int (and vice versa).

The pointer that you get back from reinterpret_cast is still bound by C++ strict aliasing rules. The compiler is free to assume that the float* and int* point to different memory because they are different types. reinterpret_cast is safe for conversions to char* because of aliasing rules, but isn't safe for type-punning in general. That's why C++20 gives us bit_cast.

Before this PR, bit_cast was calling detail::memcpy which loops through the individual bytes as if they are chars. To me that does not seem preferable from any standpoint, especially performance.

The memcpy in atomic.hpp should be replaced by a bit_cast. But the one in bit_cast itself is there as a fallback, and should only be called if the compiler doesn't have native support for bit_cast. Are you seeing memcpy generated? If so, that sounds like a bug.

Even when the memcpy is there, the intent is that the compiler recognizes it as type-punning and optimizes it away.

… 16 million iterations is 20% for CPU and 500% for GPU. (no impact for HOST, of course).

Signed-off-by: Chris Perkins <chris.perkins@intel.com>
Signed-off-by: Chris Perkins <chris.perkins@intel.com>
Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still concerned that we don't understand why this helps, but since the changes I requested have been implemented I'll mark this as approved.

@againull
Copy link
Contributor

againull commented Nov 4, 2020

/summary:run

@cperkinsintel
Copy link
Contributor Author

pinging the runtime owners. In a separate discussion it appears everyone is ok with this change. s.b. ready to merge.

@romanovvlad romanovvlad merged commit 0b7dacf into intel:sycl Nov 6, 2020
jsji pushed a commit that referenced this pull request Sep 21, 2024
)

The refactoring is to simplify the vectorization of generated functions.

Signed-off-by: Cui, Dele <dele.cui@intel.com>

Original commit:
KhronosGroup/SPIRV-LLVM-Translator@a5952614c12594c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants