[SYCL] float atomic_ref performance fix #2714

cperkinsintel · 2020-10-31T01:02:31Z

Moving the load into the CAS loop greatly improves performance, especially on GPU. It isn't entirely clear to me why this should produce such a dramatic improvement.

Signed-off-by: Chris Perkins chris.perkins@intel.com

Signed-off-by: Chris Perkins <chris.perkins@intel.com>

Pennycook

Do you see the same improvement just from moving the load inside of the loop, or do you need the reinterpret_cast changes as well? I'm pretty sure that the reinterpret_cast usage here is undefined behavior, and we shouldn't rely on it.

Tagging @rolandschulz to confirm.

sycl/include/CL/sycl/atomic.hpp

sycl/include/CL/sycl/detail/helpers.hpp

cperkinsintel · 2020-11-01T06:44:08Z

@Pennycook - The most significant performance improvement definitely comes from the .load change.
On Tuesday I will run some tests to get a better picture of how much the memcpy is costing us, so we can decide whether to keep them or not.

I am not sure I understand why the reinterpret cast would be UB in this case. I certainly know it's not a good practice generally and can result in UB if you aren't careful, but we have static_asserts guaranteeing the two types are the same size. We are forcibly moving between floats and int (and vice versa). Before this PR, bit_cast was calling detail::memcpy which loops through the individual bytes as if they are chars. To me that does not seem preferable from any standpoint, especially performance.

Pennycook · 2020-11-02T15:06:26Z

On Tuesday I will run some tests to get a better picture of how much the memcpy is costing us, so we can decide whether to keep them or not.

Great, thanks!

I am not sure I understand why the reinterpret cast would be UB in this case. I certainly know it's not a good practice generally and can result in UB if you aren't careful, but we have static_asserts guaranteeing the two types are the same size. We are forcibly moving between floats and int (and vice versa).

The pointer that you get back from reinterpret_cast is still bound by C++ strict aliasing rules. The compiler is free to assume that the float* and int* point to different memory because they are different types. reinterpret_cast is safe for conversions to char* because of aliasing rules, but isn't safe for type-punning in general. That's why C++20 gives us bit_cast.

Before this PR, bit_cast was calling detail::memcpy which loops through the individual bytes as if they are chars. To me that does not seem preferable from any standpoint, especially performance.

The memcpy in atomic.hpp should be replaced by a bit_cast. But the one in bit_cast itself is there as a fallback, and should only be called if the compiler doesn't have native support for bit_cast. Are you seeing memcpy generated? If so, that sounds like a bug.

Even when the memcpy is there, the intent is that the compiler recognizes it as type-punning and optimizes it away.

… 16 million iterations is 20% for CPU and 500% for GPU. (no impact for HOST, of course). Signed-off-by: Chris Perkins <chris.perkins@intel.com>

Signed-off-by: Chris Perkins <chris.perkins@intel.com>

Pennycook

I'm still concerned that we don't understand why this helps, but since the changes I requested have been implemented I'll mark this as approved.

againull · 2020-11-04T18:34:14Z

/summary:run

cperkinsintel · 2020-11-06T00:11:47Z

pinging the runtime owners. In a separate discussion it appears everyone is ok with this change. s.b. ready to merge.

) The refactoring is to simplify the vectorization of generated functions. Signed-off-by: Cui, Dele <dele.cui@intel.com> Original commit: KhronosGroup/SPIRV-LLVM-Translator@a5952614c12594c

[SYCL] float atomic_ref performance fix

843d03a

Signed-off-by: Chris Perkins <chris.perkins@intel.com>

cperkinsintel requested a review from a team as a code owner October 31, 2020 01:02

cperkinsintel requested review from againull, Pennycook and romanovvlad October 31, 2020 01:02

Pennycook requested changes Oct 31, 2020

View reviewed changes

sycl/include/CL/sycl/atomic.hpp Outdated Show resolved Hide resolved

sycl/include/CL/sycl/detail/helpers.hpp Outdated Show resolved Hide resolved

cperkinsintel added 2 commits November 3, 2020 15:25

changes requested by reviewers. Performance improvement measured over…

82fc74b

… 16 million iterations is 20% for CPU and 500% for GPU. (no impact for HOST, of course). Signed-off-by: Chris Perkins <chris.perkins@intel.com>

updated comment

1a40779

Signed-off-by: Chris Perkins <chris.perkins@intel.com>

cperkinsintel requested a review from Pennycook November 4, 2020 04:47

Pennycook approved these changes Nov 4, 2020

View reviewed changes

romanovvlad approved these changes Nov 6, 2020

View reviewed changes

romanovvlad merged commit 0b7dacf into intel:sycl Nov 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL] float atomic_ref performance fix #2714

[SYCL] float atomic_ref performance fix #2714

Uh oh!

cperkinsintel commented Oct 31, 2020

Uh oh!

Pennycook left a comment

Uh oh!

Uh oh!

Uh oh!

cperkinsintel commented Nov 1, 2020

Uh oh!

Pennycook commented Nov 2, 2020

Uh oh!

Pennycook left a comment

Uh oh!

againull commented Nov 4, 2020

Uh oh!

cperkinsintel commented Nov 6, 2020

Uh oh!

Uh oh!

[SYCL] float atomic_ref performance fix #2714

[SYCL] float atomic_ref performance fix #2714

Uh oh!

Conversation

cperkinsintel commented Oct 31, 2020

Uh oh!

Pennycook left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cperkinsintel commented Nov 1, 2020

Uh oh!

Pennycook commented Nov 2, 2020

Uh oh!

Pennycook left a comment

Choose a reason for hiding this comment

Uh oh!

againull commented Nov 4, 2020

Uh oh!

cperkinsintel commented Nov 6, 2020

Uh oh!

Uh oh!