Skip to content

[SYCL][CUDA] Remove unnecessary memfence #1935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

bjoernknafla
Copy link
Contributor

Remove unnecessary memory fence after a CUDA memory barrier
(__syncthreads).

The emitted bar.sync 0 PTX instruction ensures that all memory
accesses of threads involved in the barrier 0 have been performed and
that no new memory accesses happen before the barrier completes.

The removed memory fence reduced performance without adding any
functionality to the barrier memory behavior.

Signed-off-by: Bjoern Knafla bjoern@codeplay.com
Co-authored-be: Victor Lomuller victor@codeplay.com

Remove unnecessary memory fence after a CUDA memory barrier
(__syncthreads).

The emitted `bar.sync 0` PTX instruction ensures that all memory
accesses of threads involved in the barrier `0` have been performed and
that no new memory accesses happen before the barrier completes.

The removed memory fence reduced performance without adding any
functionality to the barrier memory behavior.

Signed-off-by: Bjoern Knafla <bjoern@codeplay.com>
Co-authored-be: Victor Lomuller <victor@codeplay.com>
@bjoernknafla bjoernknafla requested a review from bader as a code owner June 19, 2020 16:48
@bader
Copy link
Contributor

bader commented Jun 19, 2020

Remove unnecessary memory fence after a CUDA memory barrier
(__syncthreads).

I have doubts that it's unnecessary.
Please, take a look at #1258.

Tagging @againull, @Naghasan.

@againull
Copy link
Contributor

againull commented Jun 19, 2020

@bader, thanks for double checking.
I agree that it is unnecessary. And keeping or removing it doesn't affect/resolve #1258

#1258 is about the following:
sync_threads() is lowered to llvm.nvvm.barrier0 and __spirv_MemoryBarrier() is lowered to llvm.nvvm.membar. LLVM middle-end doesn't know that llvm.nvvm.barrier0 and llvm.nvvm.membar are cuda barriers, it treats them as a regular llvm intrinsic and can move memory accesses across any of these intrinsics -> barrier logic will be broken. See workaround: https://github.com/intel/llvm/pull/1334/files

@againull againull self-requested a review June 20, 2020 07:29
@Naghasan
Copy link
Contributor

Remove unnecessary memory fence after a CUDA memory barrier
(__syncthreads).

I have doubts that it's unnecessary.

it is, from the ptx doc prior memory accesses requested by this thread are performed relative to all threads participating in the barrier

Please, take a look at #1258.

as @againull pointed out, the issue is about llvm not understanding the semantic of the these instructions.

@bader
Copy link
Contributor

bader commented Jun 22, 2020

Sorry, I forgot about the workaround implemented by #1334.

LIT failure seems to be related to #1919. I'll re-run it to verify.

@bader bader added the cuda CUDA back-end label Jun 23, 2020
@bader bader merged commit e2fc1b8 into intel:sycl Jun 23, 2020
alexbatashev pushed a commit to alexbatashev/llvm that referenced this pull request Jun 24, 2020
* upstream/sycl:
  [SYCL] Implement braced-init-list or a number as range for queue::parallel_for (intel#1931)
  [SYCL][Doc] Add SYCL_INTEL_accessor_properties extension specification (intel#1925)
  [SYCL-PTX] Add builtins for the relational category (intel#1831)
  [SYCL][CUDA] Remove unnecessary memfence (intel#1935)
  [SYCL] Add handling for wrapped sampler (intel#1942)
  [SYCL] Release notes for June'20 DPCPP implementation update (intel#1948)
  [SYCL] Fix assert when calling get_binaries() on host (intel#1944)
  [SYCL] Fix check for reqd_sub_group_size attribute mismatches (intel#1905)
@bjoernknafla bjoernknafla deleted the bjoern/remove-unnecessary-fence-after-barrier branch June 25, 2020 16:14
againull pushed a commit to againull/llvm that referenced this pull request Apr 11, 2023
The patch adds TypeJointMatrixINTELv2 which maps to new type OpCode
6184. Under new OpCode matrix type no longer has Layout parameter. The patch also moved 'scope' to optional matrix muladd instruction.

The changes are done only in the consumer part to prepare the switch and make E2E switch backward compatible by preparing consumers ahead of time.

Unfortunately there is no way to add a test foe this unless it's binary test, but it seems to be a bit unsafe to add this, so the patch was tested locally.

Spec change:
intel#8175

Signed-off-by: Sidorov, Dmitry <dmitry.sidorov@intel.com>

Original commit:
KhronosGroup/SPIRV-LLVM-Translator@a6fcade
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants