Replace SYCL backend `reduce_by_segment` implementation with reduce-then-scan call #1915

mmichel11 · 2024-10-22T19:18:24Z

Summary

This PR implements a SYCL backend reduce_by_segment by using higher level calls to reduce-then-scan along with new specialty functors to achieve a segmented reduction. This PR is an initial step of porting the implementation to reduce-then-scan with optimization likely to follow. Future efforts may include additional modification to reduce-then-scan kernels.

Performance improves for all input sizes. For small inputs, we see 3-5x improvements and for very large sizes ~1.25x on GPU Series Max 1550. Please contact me if you would like to see performance data.

Description of changes

The SYCL reduce_by_segment implementation that was previously handwritten is replaced by a higher level call to our reduce-then-scan kernels. Several new callback functors for the reduce-then-scan kernel have been made to achieve this operation.
reduce_by_segment.pass was encountering linker crashes due to the large number of test cases being compiled growing past the maximum size of the binary's data region. SYCL testing has been trimmed down with regards to USM device and shared testing which resolves this issue. Instead of running each test with a device and shared USM allocation, every other test switches the USM type.
ONEDPL_WORKAROUND_FOR_IGPU_64BIT_REDUCTION has been removed as the SYCL implementation has been replaced, and we are no longer impacted by this issue.
The legacy reduce_by_segment implementation is used as a fallback for when the sub-group size, device, and trivial copyability constraints cannot be satisfied.

Future work

Future efforts on reduce_by_segment may built on top of this implementation and the reduce-then-scan kernels to better handle first and last element cases.

mmichel11 · 2024-11-05T20:30:03Z

I have made some design changes based on offline discussion. There is quite a bit of code movement that has happened, so here is a summary of the recently made changes:

An iterator-based __pattern_reduce_by_segment has been added to algorithm_impl_hetero.h. Previously, we just had a range-based version. This resolves the issue of calling range-based patterns from iterator-based algorithms.
The fallback reduce_by_segment implementation based on high-level copy_if and parallel_for calls has been moved down a level from algorithm_ranges_impl_hetero.h to dpcpp/parallel_backend_sycl.h so that we can fallback on this implementation when reduce-then-scan cannot be used in __parallel_reduce_by_segment. This pattern has been implemented synchronously as parallel pattern calls cannot currently depend on each other.
Due to observed performance issues for compilers prior to icpx 2025.0, the reduce-then-scan path must be disabled in this case. The known-identity based implementation has been added back to avoid introducing a performance regression for icpx 2024.2.1 and prior. It has been moved into dpcpp/parallel_backend_sycl_reduce_by_segment.h so the implementations are not split across several directories.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

This reverts commit 0e0d50e.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…e testing Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…write operations Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…binary size Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…ce-then-scan Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

* An iterator based __pattern_reduce_by_segment is added * Due to compiler issues prior to icpx 2025.0, the reduce-then-scan path is disabled and the previous handcrafted SYCL implementation is restored to prevent performance regressions with older compilers * The previous range-based fallback implementation has been moved to the SYCL backend along with the handcrafted SYCL version Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…N macro" This reverts commit a4c7835.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

danhoeflinger

I like this implementation and thing it is basically good to go.
I think more comments are necessary, and there is some potential for future gains.

I probably want to look a bit further in to minor details before approving with another pass but at a high level I think this is in good shape.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

test/general/implementation_details/device_copyable.pass.cpp

test/parallel_api/numeric/numeric.ops/reduce_by_segment.pass.cpp

danhoeflinger · 2024-11-21T19:28:04Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_by_segment.h

+namespace __par_backend_hetero
+{
+
+template <typename... Name>


I confirmed in an outside editor that the only changes to this from the previous location are only cosmetic. I'm not looking deeply into this files changes otherwise since it has already been reviewed and it was just moved.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

danhoeflinger · 2024-11-21T19:50:03Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+        {
+            const _KeyType& __next_key = __in_keys[__id + 1];
+            return oneapi::dpl::__internal::make_tuple(
+                oneapi::dpl::__internal::make_tuple(std::size_t{0}, _ValueType{__in_vals[__id]}),
+                !__binary_pred(__current_key, __next_key), __next_key, __current_key);
+        }


It may be possible to avoid the __id == 0 case, in a similar way to unique. It is a little more complicated because we would need to set up the carry-in appropriately, but I think its possible and could provide some branch avoiding (and tuple shrinking) gains in the helpers.
If you think its possible to do this, lets leave it as an issue to be explored in a follow up.

adamfidel

First pass of comments. I have looked at primarily the fallback algorithms and intend to focus on the reduce-then-scan implementation next.

adamfidel · 2024-11-20T21:11:47Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h


 #include "sycl_defs.h"
 #include "parallel_backend_sycl_utils.h"
 #include "parallel_backend_sycl_reduce.h"
 #include "parallel_backend_sycl_merge.h"
 #include "parallel_backend_sycl_merge_sort.h"
+#include "parallel_backend_sycl_reduce_by_segment.h"


From what I can tell, there are now three implementations of reduce-by-segment: a reduce-then-scan version, a fallback version with known identities and a fallback version with no known identities. It is only the latter that is in this new header file.

I think since we are introducing a new header file for reduce-by-segment, it might make sense to move all implementations to the new header file instead of just one of the fallback algorithms.

Edit: I just saw your comment about not wanting to move these because of forward declarations. In that case, I can see why it is structured the way it is and I'm fine with the current layout.

I will reevaluate to see if consolidating all the implementations in the same header and having the forward declarations is cleaner than what is currently done.

adamfidel · 2024-11-21T17:35:23Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_by_segment.h

+                                                                              sycl::nd_item<1> __item) {
+                auto __group = __item.get_group();
+                std::size_t __group_id = __item.get_group(0);
+                std::size_t __local_id = __item.get_local_id(0);


This could probably safely be a uint32_t. Or, you can remove this variable and replace its only usage with if (__group.leader().

I changed this type to be std::uint32_t

adamfidel · 2024-11-21T17:41:05Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_by_segment.h

+
+                // 2c. Count the number of prior work segments cooperatively over group
+                std::size_t __prior_segs_in_wg = __dpl_sycl::__exclusive_scan_over_group(
+                    __group, __item_segments, __dpl_sycl::__plus<decltype(__item_segments)>());


Since __item_segments is defined with a type just before the loop above, I think it's okay to use the type here for more clarity.

Suggested change

__group, __item_segments, __dpl_sycl::__plus<decltype(__item_segments)>());

__group, __item_segments, __dpl_sycl::__plus<std::size_t>());

I changed this occurrence and a few lines below in the inclusive scan call as well which was a similar case.

adamfidel · 2024-11-21T17:43:01Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_by_segment.h

+                if (__local_id == 0)
+                {
+                    __apply_aggs = false;
+                    if (__global_id == 0 && __n > 0)


Is the __n > 0 condition necessary here? I believe that is handled in a function higher in the stack (__pattern_reduce_by_segment).

Good point, it is not needed and I have removed it.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

test/general/implementation_details/device_copyable.pass.cpp

danhoeflinger

At a high level I agree with the changes in this PR, but there are still a few remaining nit picks outstanding.

I have run out of time before my time off to get into the small details like sizes of types and forwarding of references, things like that. The clang format suggestions can be ignored as of now.

So, I wont hit approve officially but I think this is very close and trust @adamfidel / others to be able to get it across the finish line and have no objections to merging with another approval.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 force-pushed the dev/mmichel11/rts_reduce_by_segment branch from 66ead80 to 53adeb8 Compare October 22, 2024 19:33

mmichel11 requested review from danhoeflinger and adamfidel October 22, 2024 20:29

mmichel11 marked this pull request as ready for review October 23, 2024 18:42

mmichel11 mentioned this pull request Oct 23, 2024

Add code change workaround for 64-bit reduce_by_segment bug #1791

Closed

mmichel11 added 24 commits November 6, 2024 08:28

Initial commit of reduce_by_segment with the reduce-then-scan path

6c11271

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Revert change to ranges and use zip_view over segments / values instead

b7328e8

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Implement correct return for reduce_by_segment

f017ffd

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Add support for flag predicates

5cfa661

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Revert "Add support for flag predicates"

4f8059a

This reverts commit 0e0d50e.

Re-implement support for flag predicates in a more performant manner

d56bc5a

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Add fallback and remove old SYCL implementation

543e82f

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Switch from using lambdas to functors

7c92238

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Add device copyable specializations for red-by-seg functors and updat…

6c4267b

…e testing Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Fix typo in error message in device_copyable.pass.cpp

3eee242

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Introduce separate input generation for scan phase and update tests

b334fb0

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Improve code readability

d564b33

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Add optional first key field to scan input and remove input range in …

358ec3b

…write operations Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Update __write_op in reduce-then-scan

db0bc25

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove now unneeded ONEDPL_WORKAROUND_FOR_IGPU_64BIT_REDUCTION macro

7abbff8

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Alternate testing between usm shared and device to prevent excessive …

92cf29a

…binary size Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Performance tuning within scan input functor

d292e07

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Handle n=0, n=1 first in reduce_by_segment

e41298d

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Code cleanup

fb9a306

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Improve comments and mark relevant variables as const

d32ea4d

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Add condition to ensure value type is trivially copyable to call redu…

aa4baaf

…ce-then-scan Signed-off-by: Matthew Michel <matthew.michel@intel.com>

clang-format

8cc59df

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Revert "Remove now unneeded ONEDPL_WORKAROUND_FOR_IGPU_64BIT_REDUCTIO…

74143b2

…N macro" This reverts commit a4c7835.

Fix test bug where device allocation is always used for testing

fdf6a39

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 force-pushed the dev/mmichel11/rts_reduce_by_segment branch from 5643e6c to fdf6a39 Compare November 6, 2024 15:53

danhoeflinger reviewed Nov 12, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

mmichel11 added 2 commits November 13, 2024 13:45

Separate each reduce_by_segment fallback path into their own functions

5987df9

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

clang-format

d80377e

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 requested review from MikeDvorskiy, dmitriy-sobolev and SergeyKopienko November 19, 2024 19:19

danhoeflinger reviewed Nov 21, 2024

View reviewed changes

adamfidel reviewed Nov 21, 2024

View reviewed changes

Address comments in reduce-then-scan based implementation

3c8154e

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

danhoeflinger reviewed Nov 22, 2024

View reviewed changes

test/general/implementation_details/device_copyable.pass.cpp Outdated Show resolved Hide resolved

danhoeflinger reviewed Nov 22, 2024

View reviewed changes

mmichel11 added 3 commits November 22, 2024 15:36

Improve explanations of reduce-by-segment approach

db63d45

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Use binary_op[_non]_device_copyable where appropriate

3deed76

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Address comments in fallback implementation

c641bd3

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace SYCL backend `reduce_by_segment` implementation with reduce-then-scan call #1915

Replace SYCL backend `reduce_by_segment` implementation with reduce-then-scan call #1915

mmichel11 commented Oct 22, 2024

mmichel11 commented Nov 5, 2024

danhoeflinger left a comment

danhoeflinger Nov 21, 2024

danhoeflinger Nov 21, 2024

adamfidel left a comment

adamfidel Nov 20, 2024 •

edited

Loading

mmichel11 Nov 21, 2024

adamfidel Nov 21, 2024

mmichel11 Nov 22, 2024 •

edited

Loading

adamfidel Nov 21, 2024

mmichel11 Nov 22, 2024

adamfidel Nov 21, 2024

mmichel11 Nov 22, 2024

danhoeflinger left a comment

	__group, __item_segments, __dpl_sycl::__plus<decltype(__item_segments)>());
	__group, __item_segments, __dpl_sycl::__plus<std::size_t>());

Replace SYCL backend reduce_by_segment implementation with reduce-then-scan call #1915

Are you sure you want to change the base?

Replace SYCL backend reduce_by_segment implementation with reduce-then-scan call #1915

Conversation

mmichel11 commented Oct 22, 2024

Summary

Description of changes

Future work

mmichel11 commented Nov 5, 2024

danhoeflinger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamfidel left a comment

Choose a reason for hiding this comment

adamfidel Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmichel11 Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danhoeflinger left a comment

Choose a reason for hiding this comment

Replace SYCL backend `reduce_by_segment` implementation with reduce-then-scan call #1915

Replace SYCL backend `reduce_by_segment` implementation with reduce-then-scan call #1915

adamfidel Nov 20, 2024 •

edited

Loading

mmichel11 Nov 22, 2024 •

edited

Loading