-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][CUDA] Add sub-group barrier #2606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Uses __nvvm_bar_warp_sync, which is equivalent to CUDA __syncwarp(). Because sub-group functions must always be called in converged control flow, the membermask is always set to represent all active work-items in the warp. Enabling this functionality requires that we switch to PTX 6.4, which is consistent with the existing requirement to use CUDA 10.1. Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
Thanks to @Naghasan and @bader for their help in getting this working. Also, a note to reviewers: I had some trouble getting CMake to handle the additional PTX flags correctly. I'm not a CMake expert, and would welcome any suggestions regarding how to improve what I've committed here. The issue as I understand it is that the list of compilation options constructed in libclc/CMakeLists.txt is passed to two functions in AddLibclc.cmake, but each function consumes those options differently. One passes the options to |
FILES generic/libspirv/sycldevice-binding.cpp) | ||
endif() | ||
|
||
add_libclc_builtin_set(libspirv-${arch_suffix} | ||
TRIPLE ${t} | ||
TARGET_ENV libspirv | ||
COMPILE_OPT ${mcpu} | ||
COMPILE_OPT ${flags} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COMPILE_OPT
is a multi value option, so you should be able to add the extra flags directly.
A more long term solution would be perhaps to define flag per arch_sufix
(they can then be accessed later), but should be for later I guess.
set( mcpu ) | ||
# FIXME: Ideally we would not be tied to a specific PTX ISA version | ||
if( ${ARCH} STREQUAL nvptx OR ${ARCH} STREQUAL nvptx64 ) | ||
set( flags "SHELL:-Xclang -target-feature" "SHELL:-Xclang +ptx64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using "SHELL:
and string( REGEX REPLACE "SHELL:"
later is needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_target_options
only works if the SHELL:
is there, but add_custom_command
only works if the SHELL:
is not there.
This is definitely a bit of a hack, but it seemed less error-prone than defining the same set of flags twice. If there's a more standard way to do this, please let me know and I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'm no CMake expert so I'm not quite sure how to make it better.
When there is forward declaration of a spirv entry, its decorates are not translated until its definition is seen. Forward id is re-used for its entry. Id in entry decorates should use forward id as well. Original commit: KhronosGroup/SPIRV-LLVM-Translator@305f48884606abf
Rename urCommandBufferEnqueueExp to urEnqueueCommandBufferExp
…actor" This reverts commit cc60d08, from oneapi-src/unified-runtime#2606 due to CI fails in the DPC++ bump PR that need further investigation intel#16747
Revert "Merge pull request intel#2606 from Bensuo/cmd-buf_enqueue_refactor"
Rename urCommandBufferEnqueueExp to urEnqueueCommandBufferExp
This reverts commit cc60d08, from oneapi-src/unified-runtime#2606 due to CI fails in the DPC++ bump PR that need further investigation #16747
Revert "Merge pull request #2606 from Bensuo/cmd-buf_enqueue_refactor"
Uses __nvvm_bar_warp_sync, which is equivalent to CUDA __syncwarp().
Because sub-group functions must always be called in converged control flow,
the membermask is always set to represent all active work-items in the warp.
Enabling this functionality requires that we switch to PTX 6.4, which is
consistent with the existing requirement to use CUDA 10.1.
Signed-off-by: John Pennycook john.pennycook@intel.com