Improve SYCL backend `__parallel_for` performance for large input sizes #1870

mmichel11 · 2024-09-20T21:10:21Z

Summary

This PR improves __parallel_for performance for large input sizes by switching to an nd-range kernel to process multiple inputs per work item which enables us to use the full hardware bandwidth.

Details

On some target architectures, we are currently not hitting roofline memory bandwidth performance in our __parallel_for pattern. The cause is that our SYCL basic kernel implementation only processes a single element per item. This is insufficient to fully utilize memory bandwidth on some target architectures. Processing multiple inputs per work item enables us to perform enough loads / stores to saturate the hardware bandwidth. Explicitly using a coalesced pattern through either a sub-group or work-group stride ensures that a good access pattern is achieved.

A nd-range kernel has been added for large input sizes that uses a heuristic based upon the smallest sized type in the set of provided ranges to determine the number of iterations to process per input item. This drastically improves performance on target architectures for large inputs across nearly all for-based algorithms.

A second kernel has been added as opposed to merging both paths within a single kernel to prevent extra runtime dispatch within the kernel which hurt performance for small inputs. There is a smaller runtime overhead for selecting the best path from the host and compiling two kernels. For small-to-medium inputs, the SYCL basic kernel performs the best.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

include/oneapi/dpl/pstl/utils.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

SergeyKopienko · 2024-09-25T07:54:02Z

@mmichel11 I have take a look to the history of this branch, probably make sense to rebase your branch from the current main state or merge main branch into your PR: we have a lot of new commits in the main branch now.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

dmitriy-sobolev

There are some things not directly related to the main part of the PR, which looks good to me.

dmitriy-sobolev · 2024-10-02T13:19:56Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+            const bool __is_full_sub_group =
+                __sub_group_start_idx + __iters_per_work_item * __sub_group_size <= __count;
+            const std::size_t __work_item_idx = __sub_group_start_idx + __sub_group_local_id;
+            return std::make_tuple(__work_item_idx, __sub_group_size, __is_full_sub_group);


Could you include <tuple>?

dmitriy-sobolev · 2024-10-02T17:00:50Z

include/oneapi/dpl/pstl/utils.h

+template <typename _Tuple>
+class __min_tuple_type_size;
+
+template <typename _T>
+class __min_tuple_type_size<std::tuple<_T>>
+{
+  public:
+    static constexpr std::size_t value = sizeof(_T);
+};
+
+template <typename _T, typename... _Ts>
+class __min_tuple_type_size<std::tuple<_T, _Ts...>>
+{
+    static constexpr std::size_t __min_type_value_ts = __min_tuple_type_size<std::tuple<_Ts...>>::value;
+
+  public:
+    static constexpr std::size_t value = std::min(sizeof(_T), __min_type_value_ts);
+};
+
+template <typename _Tuple>
+inline constexpr std::size_t __min_tuple_type_size_v = __min_tuple_type_size<_Tuple>::value;


This can be simplified:

template<typename _Tuple> struct __min_tuple_type_size; template<typename... Ts> struct __min_tuple_type_size<std::tuple<Ts...>> { static constexpr std::size_t value = std::min({sizeof(Ts)...}); };

_v alias is not-necessary as it is used only once.

Thanks, I have replaced this __min_tuple_type_size with __min_nested_type_size which avoids having to flatten the tuple first and applied these ideas there.

dmitriy-sobolev · 2024-10-02T17:22:15Z

include/oneapi/dpl/pstl/tuple_impl.h

@@ -793,6 +793,25 @@ struct __decay_with_tuple_specialization<::std::tuple<_Args...>>
 template <typename... _Args>
 using __decay_with_tuple_specialization_t = typename __decay_with_tuple_specialization<_Args...>::type;

+// Flatten nested std::tuple or oneapi::dpl::__internal::tuple types into a single std::tuple.
+template <typename _T>
+struct __flatten_std_or_internal_tuple


Optional suggestion: __flatten_std_or_internal_tuple -> __flatten_tuple.

This utility has been removed

dmitriy-sobolev · 2024-10-02T17:34:47Z

include/oneapi/dpl/pstl/tuple_impl.h

@@ -793,6 +793,25 @@ struct __decay_with_tuple_specialization<::std::tuple<_Args...>>
 template <typename... _Args>
 using __decay_with_tuple_specialization_t = typename __decay_with_tuple_specialization<_Args...>::type;

+// Flatten nested std::tuple or oneapi::dpl::__internal::tuple types into a single std::tuple.
+template <typename _T>
+struct __flatten_std_or_internal_tuple


I'd recommend moving __flatten_std_or_internal_tuple into utils.h. It is a niche utility not related to the core part of the class.

This was originally done since tuple_impl.h includes utils.h, so we would have to forward declare our internal tuple otherwise to avoid a circular dependency.

The new utility supports on arbitrary type and doesn't require any specializations for our internal tuple, so it can be easily placed in utils.h

dmitriy-sobolev · 2024-10-02T17:36:57Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+    __stride_recommender(const sycl::nd_item<1>& __item, std::size_t __count, std::size_t __iters_per_work_item,
+                         std::size_t __work_group_size)
+    {
+        if constexpr (oneapi::dpl::__internal::__is_spirv_target_v)


Could you include utils.h where __is_spirv_target_v is defined?

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

128 byte memory operations are performed instead of 512 after inspecting the assembly. Processing 512 bytes per sub-group still seems to be the best value after experimentation. Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…ute work for small inputs Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

This reverts commit e4cbceb. Small sizes slightly slower and for horizontal vectorization no "real" benefit is observed.

Small but measurable overheads can be observed for small inputs where runtime dispatch in the kernel is present to check for the correct path to take. Letting the compiler handle the the small input case in the original kernel shows the best performance. Signed-off-by: Matthew Michel <matthew.michel@intel.com>

We now flatten the user-provided ranges and find the minimum sized type to estimate the best __iters_per_work_item. This benefits performance in calls that wrap multiple buffers in a single input / output through a zip_iterator (e.g. dpct::scatter_if in SYCLomatic compatibility headers). Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

* Move __stride_recommender into __parallel_for_large_submitter * Use {} to invoke constructor * Simplify if-else statements in for dispatch Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 · 2024-11-06T19:54:43Z

Thanks for the reviews everyone. I have addressed all current comments.

As discussed offline, we will use the current state of the PR as a starting point to introduce vectorized load / store paths where it is performant by rewriting our bricks for parallel for. Likely, I will have a second PR into this branch with these changes once they are complete.

mmichel11 marked this pull request as ready for review September 23, 2024 21:24

mmichel11 requested review from julianmi, dmitriy-sobolev, danhoeflinger, adamfidel and MikeDvorskiy September 23, 2024 21:25

mmichel11 added the enhancement label Sep 23, 2024

mmichel11 requested a review from SergeyKopienko September 23, 2024 21:26

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Show resolved Hide resolved

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Show resolved Hide resolved

danhoeflinger reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/utils.h Outdated Show resolved Hide resolved

dmitriy-sobolev reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Show resolved Hide resolved

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Show resolved Hide resolved

SergeyKopienko reviewed Sep 25, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Sep 25, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h Outdated Show resolved Hide resolved

mmichel11 force-pushed the dev/mmichel11/parallel_for_sub_group_stride branch from e4b40a5 to 8b9c3c9 Compare September 25, 2024 13:22

adamfidel reviewed Sep 25, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Show resolved Hide resolved

mmichel11 added this to the 2022.8.0 milestone Sep 25, 2024

dmitriy-sobolev reviewed Oct 2, 2024

View reviewed changes

mmichel11 added 4 commits November 6, 2024 13:44

Optimize memory transactions in SYCL backend parallel for

e0e03bd

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

clang-format

9f3384a

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Correct comment and error handling.

a244bbb

128 byte memory operations are performed instead of 512 after inspecting the assembly. Processing 512 bytes per sub-group still seems to be the best value after experimentation. Signed-off-by: Matthew Michel <matthew.michel@intel.com>

__num_groups bugfix

ad05086

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 added 22 commits November 6, 2024 13:44

Introduce stride recommender for different targets and better distrib…

bb83642

…ute work for small inputs Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Cleanup

3dbdcec

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Unroll loop if possible

f53ae6c

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Revert "Unroll loop if possible"

0bf77a7

This reverts commit e4cbceb. Small sizes slightly slower and for horizontal vectorization no "real" benefit is observed.

Code cleanup

d2cf632

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Clang format

4eeaf97

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Update comments

b717ac7

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Bugfix in comment

839f1ad

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

More cleanup and better handle non-full case

b347606

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Rename __ndi to __item for consistency with codebase

72a0941

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Update all comments on kernel naming trick

9f82b11

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Handle non-full case in a cleaner way

b5f0021

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Switch min tuple type utility to return size of type

5519dac

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove unnecessary template parameter

2d985a5

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Make non-template function inline for ODR compliance

6ab46ad

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

If the iters per work item is 1, then only compile the basic pfor kernel

e3e05a7

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Address several PR comments

8f9c5bb

* Move __stride_recommender into __parallel_for_large_submitter * Use {} to invoke constructor * Simplify if-else statements in for dispatch Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove free function __stride_recommender

55623b2

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Accept ranges as forwarding references in __parallel_for_large_submitter

39b572f

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Address reviewer comments

33337f8

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 force-pushed the dev/mmichel11/parallel_for_sub_group_stride branch from 8a95b24 to 33337f8 Compare November 6, 2024 19:46

mmichel11 marked this pull request as draft November 6, 2024 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SYCL backend `__parallel_for` performance for large input sizes #1870

Improve SYCL backend `__parallel_for` performance for large input sizes #1870

mmichel11 commented Sep 20, 2024 •

edited

Loading

SergeyKopienko commented Sep 25, 2024

dmitriy-sobolev left a comment •

edited

Loading

dmitriy-sobolev Oct 2, 2024 •

edited

Loading

mmichel11 Nov 6, 2024

dmitriy-sobolev Oct 2, 2024 •

edited

Loading

mmichel11 Nov 6, 2024

dmitriy-sobolev Oct 2, 2024

mmichel11 Nov 6, 2024

dmitriy-sobolev Oct 2, 2024

mmichel11 Nov 6, 2024

dmitriy-sobolev Oct 2, 2024

mmichel11 Nov 6, 2024

mmichel11 commented Nov 6, 2024

Improve SYCL backend __parallel_for performance for large input sizes #1870

Are you sure you want to change the base?

Improve SYCL backend __parallel_for performance for large input sizes #1870

Conversation

mmichel11 commented Sep 20, 2024 • edited Loading

Summary

Details

SergeyKopienko commented Sep 25, 2024

dmitriy-sobolev left a comment • edited Loading

Choose a reason for hiding this comment

dmitriy-sobolev Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitriy-sobolev Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmichel11 commented Nov 6, 2024

Improve SYCL backend `__parallel_for` performance for large input sizes #1870

Improve SYCL backend `__parallel_for` performance for large input sizes #1870

mmichel11 commented Sep 20, 2024 •

edited

Loading

dmitriy-sobolev left a comment •

edited

Loading

dmitriy-sobolev Oct 2, 2024 •

edited

Loading

dmitriy-sobolev Oct 2, 2024 •

edited

Loading