[SYCL][CUDA] Remove unnecessary memfence #1935

bjoernknafla · 2020-06-19T16:48:04Z

Remove unnecessary memory fence after a CUDA memory barrier
(__syncthreads).

The emitted bar.sync 0 PTX instruction ensures that all memory
accesses of threads involved in the barrier 0 have been performed and
that no new memory accesses happen before the barrier completes.

The removed memory fence reduced performance without adding any
functionality to the barrier memory behavior.

Signed-off-by: Bjoern Knafla bjoern@codeplay.com
Co-authored-be: Victor Lomuller victor@codeplay.com

Remove unnecessary memory fence after a CUDA memory barrier (__syncthreads). The emitted `bar.sync 0` PTX instruction ensures that all memory accesses of threads involved in the barrier `0` have been performed and that no new memory accesses happen before the barrier completes. The removed memory fence reduced performance without adding any functionality to the barrier memory behavior. Signed-off-by: Bjoern Knafla <bjoern@codeplay.com> Co-authored-be: Victor Lomuller <victor@codeplay.com>

bader · 2020-06-19T16:51:25Z

Remove unnecessary memory fence after a CUDA memory barrier
(__syncthreads).

I have doubts that it's unnecessary.
Please, take a look at #1258.

Tagging @againull, @Naghasan.

againull · 2020-06-19T19:04:01Z

@bader, thanks for double checking.
I agree that it is unnecessary. And keeping or removing it doesn't affect/resolve #1258

#1258 is about the following:
sync_threads() is lowered to llvm.nvvm.barrier0 and __spirv_MemoryBarrier() is lowered to llvm.nvvm.membar. LLVM middle-end doesn't know that llvm.nvvm.barrier0 and llvm.nvvm.membar are cuda barriers, it treats them as a regular llvm intrinsic and can move memory accesses across any of these intrinsics -> barrier logic will be broken. See workaround: https://github.com/intel/llvm/pull/1334/files

Naghasan · 2020-06-22T07:22:36Z

Remove unnecessary memory fence after a CUDA memory barrier
(__syncthreads).

I have doubts that it's unnecessary.

it is, from the ptx doc prior memory accesses requested by this thread are performed relative to all threads participating in the barrier

Please, take a look at #1258.

as @againull pointed out, the issue is about llvm not understanding the semantic of the these instructions.

bader · 2020-06-22T11:56:38Z

Sorry, I forgot about the workaround implemented by #1334.

LIT failure seems to be related to #1919. I'll re-run it to verify.

* upstream/sycl: [SYCL] Implement braced-init-list or a number as range for queue::parallel_for (intel#1931) [SYCL][Doc] Add SYCL_INTEL_accessor_properties extension specification (intel#1925) [SYCL-PTX] Add builtins for the relational category (intel#1831) [SYCL][CUDA] Remove unnecessary memfence (intel#1935) [SYCL] Add handling for wrapped sampler (intel#1942) [SYCL] Release notes for June'20 DPCPP implementation update (intel#1948) [SYCL] Fix assert when calling get_binaries() on host (intel#1944) [SYCL] Fix check for reqd_sub_group_size attribute mismatches (intel#1905)

The patch adds TypeJointMatrixINTELv2 which maps to new type OpCode 6184. Under new OpCode matrix type no longer has Layout parameter. The patch also moved 'scope' to optional matrix muladd instruction. The changes are done only in the consumer part to prepare the switch and make E2E switch backward compatible by preparing consumers ahead of time. Unfortunately there is no way to add a test foe this unless it's binary test, but it seems to be a bit unsafe to add this, so the patch was tested locally. Spec change: intel#8175 Signed-off-by: Sidorov, Dmitry <dmitry.sidorov@intel.com> Original commit: KhronosGroup/SPIRV-LLVM-Translator@a6fcade

bjoernknafla requested a review from bader as a code owner June 19, 2020 16:48

againull self-requested a review June 20, 2020 07:29

againull approved these changes Jun 20, 2020

View reviewed changes

bader added the cuda CUDA back-end label Jun 23, 2020

bader approved these changes Jun 23, 2020

View reviewed changes

bader merged commit e2fc1b8 into intel:sycl Jun 23, 2020

bjoernknafla deleted the bjoern/remove-unnecessary-fence-after-barrier branch June 25, 2020 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][CUDA] Remove unnecessary memfence #1935

[SYCL][CUDA] Remove unnecessary memfence #1935

Uh oh!

bjoernknafla commented Jun 19, 2020

Uh oh!

bader commented Jun 19, 2020

Uh oh!

againull commented Jun 19, 2020 •

edited

Loading

Uh oh!

Naghasan commented Jun 22, 2020

Uh oh!

bader commented Jun 22, 2020

Uh oh!

Uh oh!

[SYCL][CUDA] Remove unnecessary memfence #1935

[SYCL][CUDA] Remove unnecessary memfence #1935

Uh oh!

Conversation

bjoernknafla commented Jun 19, 2020

Uh oh!

bader commented Jun 19, 2020

Uh oh!

againull commented Jun 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Naghasan commented Jun 22, 2020

Uh oh!

bader commented Jun 22, 2020

Uh oh!

Uh oh!

againull commented Jun 19, 2020 •

edited

Loading