-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace SYCL backend reduce_by_segment
implementation with reduce-then-scan call
#1915
base: main
Are you sure you want to change the base?
Conversation
66ead80
to
53adeb8
Compare
I have made some design changes based on offline discussion. There is quite a bit of code movement that has happened, so here is a summary of the recently made changes:
|
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
This reverts commit 0e0d50e.
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…e testing Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…write operations Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…binary size Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…ce-then-scan Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
* An iterator based __pattern_reduce_by_segment is added * Due to compiler issues prior to icpx 2025.0, the reduce-then-scan path is disabled and the previous handcrafted SYCL implementation is restored to prevent performance regressions with older compilers * The previous range-based fallback implementation has been moved to the SYCL backend along with the handcrafted SYCL version Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…N macro" This reverts commit a4c7835.
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
5643e6c
to
fdf6a39
Compare
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this implementation and thing it is basically good to go.
I think more comments are necessary, and there is some potential for future gains.
I probably want to look a bit further in to minor details before approving with another pass but at a high level I think this is in good shape.
namespace __par_backend_hetero | ||
{ | ||
|
||
template <typename... Name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed in an outside editor that the only changes to this from the previous location are only cosmetic. I'm not looking deeply into this files changes otherwise since it has already been reviewed and it was just moved.
{ | ||
const _KeyType& __next_key = __in_keys[__id + 1]; | ||
return oneapi::dpl::__internal::make_tuple( | ||
oneapi::dpl::__internal::make_tuple(std::size_t{0}, _ValueType{__in_vals[__id]}), | ||
!__binary_pred(__current_key, __next_key), __next_key, __current_key); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be possible to avoid the __id == 0
case, in a similar way to unique. It is a little more complicated because we would need to set up the carry-in appropriately, but I think its possible and could provide some branch avoiding (and tuple shrinking) gains in the helpers.
If you think its possible to do this, lets leave it as an issue to be explored in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass of comments. I have looked at primarily the fallback algorithms and intend to focus on the reduce-then-scan implementation next.
|
||
#include "sycl_defs.h" | ||
#include "parallel_backend_sycl_utils.h" | ||
#include "parallel_backend_sycl_reduce.h" | ||
#include "parallel_backend_sycl_merge.h" | ||
#include "parallel_backend_sycl_merge_sort.h" | ||
#include "parallel_backend_sycl_reduce_by_segment.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, there are now three implementations of reduce-by-segment: a reduce-then-scan version, a fallback version with known identities and a fallback version with no known identities. It is only the latter that is in this new header file.
I think since we are introducing a new header file for reduce-by-segment, it might make sense to move all implementations to the new header file instead of just one of the fallback algorithms.
Edit: I just saw your comment about not wanting to move these because of forward declarations. In that case, I can see why it is structured the way it is and I'm fine with the current layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will reevaluate to see if consolidating all the implementations in the same header and having the forward declarations is cleaner than what is currently done.
sycl::nd_item<1> __item) { | ||
auto __group = __item.get_group(); | ||
std::size_t __group_id = __item.get_group(0); | ||
std::size_t __local_id = __item.get_local_id(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably safely be a uint32_t
. Or, you can remove this variable and replace its only usage with if (__group.leader()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this type to be std::uint32_t
|
||
// 2c. Count the number of prior work segments cooperatively over group | ||
std::size_t __prior_segs_in_wg = __dpl_sycl::__exclusive_scan_over_group( | ||
__group, __item_segments, __dpl_sycl::__plus<decltype(__item_segments)>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since __item_segments
is defined with a type just before the loop above, I think it's okay to use the type here for more clarity.
__group, __item_segments, __dpl_sycl::__plus<decltype(__item_segments)>()); | |
__group, __item_segments, __dpl_sycl::__plus<std::size_t>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this occurrence and a few lines below in the inclusive scan call as well which was a similar case.
if (__local_id == 0) | ||
{ | ||
__apply_aggs = false; | ||
if (__global_id == 0 && __n > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the __n > 0
condition necessary here? I believe that is handled in a function higher in the stack (__pattern_reduce_by_segment
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, it is not needed and I have removed it.
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level I agree with the changes in this PR, but there are still a few remaining nit picks outstanding.
I have run out of time before my time off to get into the small details like sizes of types and forwarding of references, things like that. The clang format suggestions can be ignored as of now.
So, I wont hit approve officially but I think this is very close and trust @adamfidel / others to be able to get it across the finish line and have no objections to merging with another approval.
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Summary
This PR implements a SYCL backend
reduce_by_segment
by using higher level calls to reduce-then-scan along with new specialty functors to achieve a segmented reduction. This PR is an initial step of porting the implementation to reduce-then-scan with optimization likely to follow. Future efforts may include additional modification to reduce-then-scan kernels.Performance improves for all input sizes. For small inputs, we see 3-5x improvements and for very large sizes ~1.25x on GPU Series Max 1550. Please contact me if you would like to see performance data.
Description of changes
reduce_by_segment
implementation that was previously handwritten is replaced by a higher level call to our reduce-then-scan kernels. Several new callback functors for the reduce-then-scan kernel have been made to achieve this operation.reduce_by_segment.pass
was encountering linker crashes due to the large number of test cases being compiled growing past the maximum size of the binary's data region. SYCL testing has been trimmed down with regards to USM device and shared testing which resolves this issue. Instead of running each test with a device and shared USM allocation, every other test switches the USM type.ONEDPL_WORKAROUND_FOR_IGPU_64BIT_REDUCTION
has been removed as the SYCL implementation has been replaced, and we are no longer impacted by this issue.reduce_by_segment
implementation is used as a fallback for when the sub-group size, device, and trivial copyability constraints cannot be satisfied.Future work
Future efforts on
reduce_by_segment
may built on top of this implementation and the reduce-then-scan kernels to better handle first and last element cases.