-
Notifications
You must be signed in to change notification settings - Fork 770
[SYCL][CUDA] Remove unnecessary memfence #1935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][CUDA] Remove unnecessary memfence #1935
Conversation
Remove unnecessary memory fence after a CUDA memory barrier (__syncthreads). The emitted `bar.sync 0` PTX instruction ensures that all memory accesses of threads involved in the barrier `0` have been performed and that no new memory accesses happen before the barrier completes. The removed memory fence reduced performance without adding any functionality to the barrier memory behavior. Signed-off-by: Bjoern Knafla <bjoern@codeplay.com> Co-authored-be: Victor Lomuller <victor@codeplay.com>
@bader, thanks for double checking. #1258 is about the following: |
it is, from the ptx doc
as @againull pointed out, the issue is about llvm not understanding the semantic of the these instructions. |
* upstream/sycl: [SYCL] Implement braced-init-list or a number as range for queue::parallel_for (intel#1931) [SYCL][Doc] Add SYCL_INTEL_accessor_properties extension specification (intel#1925) [SYCL-PTX] Add builtins for the relational category (intel#1831) [SYCL][CUDA] Remove unnecessary memfence (intel#1935) [SYCL] Add handling for wrapped sampler (intel#1942) [SYCL] Release notes for June'20 DPCPP implementation update (intel#1948) [SYCL] Fix assert when calling get_binaries() on host (intel#1944) [SYCL] Fix check for reqd_sub_group_size attribute mismatches (intel#1905)
The patch adds TypeJointMatrixINTELv2 which maps to new type OpCode 6184. Under new OpCode matrix type no longer has Layout parameter. The patch also moved 'scope' to optional matrix muladd instruction. The changes are done only in the consumer part to prepare the switch and make E2E switch backward compatible by preparing consumers ahead of time. Unfortunately there is no way to add a test foe this unless it's binary test, but it seems to be a bit unsafe to add this, so the patch was tested locally. Spec change: intel#8175 Signed-off-by: Sidorov, Dmitry <dmitry.sidorov@intel.com> Original commit: KhronosGroup/SPIRV-LLVM-Translator@a6fcade
Remove unnecessary memory fence after a CUDA memory barrier
(__syncthreads).
The emitted
bar.sync 0
PTX instruction ensures that all memoryaccesses of threads involved in the barrier
0
have been performed andthat no new memory accesses happen before the barrier completes.
The removed memory fence reduced performance without adding any
functionality to the barrier memory behavior.
Signed-off-by: Bjoern Knafla bjoern@codeplay.com
Co-authored-be: Victor Lomuller victor@codeplay.com