-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SYCL backend __parallel_for
performance for large input sizes
#1870
base: main
Are you sure you want to change the base?
Conversation
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
@mmichel11 I have take a look to the history of this branch, probably make sense to rebase your branch from the current main state or merge main branch into your PR: we have a lot of new commits in the main branch now. |
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
e4b40a5
to
8b9c3c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some things not directly related to the main part of the PR, which looks good to me.
const bool __is_full_sub_group = | ||
__sub_group_start_idx + __iters_per_work_item * __sub_group_size <= __count; | ||
const std::size_t __work_item_idx = __sub_group_start_idx + __sub_group_local_id; | ||
return std::make_tuple(__work_item_idx, __sub_group_size, __is_full_sub_group); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you include <tuple>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
include/oneapi/dpl/pstl/utils.h
Outdated
template <typename _Tuple> | ||
class __min_tuple_type_size; | ||
|
||
template <typename _T> | ||
class __min_tuple_type_size<std::tuple<_T>> | ||
{ | ||
public: | ||
static constexpr std::size_t value = sizeof(_T); | ||
}; | ||
|
||
template <typename _T, typename... _Ts> | ||
class __min_tuple_type_size<std::tuple<_T, _Ts...>> | ||
{ | ||
static constexpr std::size_t __min_type_value_ts = __min_tuple_type_size<std::tuple<_Ts...>>::value; | ||
|
||
public: | ||
static constexpr std::size_t value = std::min(sizeof(_T), __min_type_value_ts); | ||
}; | ||
|
||
template <typename _Tuple> | ||
inline constexpr std::size_t __min_tuple_type_size_v = __min_tuple_type_size<_Tuple>::value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified:
template<typename _Tuple>
struct __min_tuple_type_size;
template<typename... Ts>
struct __min_tuple_type_size<std::tuple<Ts...>> {
static constexpr std::size_t value = std::min({sizeof(Ts)...});
};
_v
alias is not-necessary as it is used only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have replaced this __min_tuple_type_size
with __min_nested_type_size
which avoids having to flatten the tuple first and applied these ideas there.
include/oneapi/dpl/pstl/tuple_impl.h
Outdated
@@ -793,6 +793,25 @@ struct __decay_with_tuple_specialization<::std::tuple<_Args...>> | |||
template <typename... _Args> | |||
using __decay_with_tuple_specialization_t = typename __decay_with_tuple_specialization<_Args...>::type; | |||
|
|||
// Flatten nested std::tuple or oneapi::dpl::__internal::tuple types into a single std::tuple. | |||
template <typename _T> | |||
struct __flatten_std_or_internal_tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional suggestion: __flatten_std_or_internal_tuple
-> __flatten_tuple
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This utility has been removed
include/oneapi/dpl/pstl/tuple_impl.h
Outdated
@@ -793,6 +793,25 @@ struct __decay_with_tuple_specialization<::std::tuple<_Args...>> | |||
template <typename... _Args> | |||
using __decay_with_tuple_specialization_t = typename __decay_with_tuple_specialization<_Args...>::type; | |||
|
|||
// Flatten nested std::tuple or oneapi::dpl::__internal::tuple types into a single std::tuple. | |||
template <typename _T> | |||
struct __flatten_std_or_internal_tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend moving __flatten_std_or_internal_tuple
into utils.h
. It is a niche utility not related to the core part of the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was originally done since tuple_impl.h
includes utils.h
, so we would have to forward declare our internal tuple otherwise to avoid a circular dependency.
The new utility supports on arbitrary type and doesn't require any specializations for our internal tuple, so it can be easily placed in utils.h
__stride_recommender(const sycl::nd_item<1>& __item, std::size_t __count, std::size_t __iters_per_work_item, | ||
std::size_t __work_group_size) | ||
{ | ||
if constexpr (oneapi::dpl::__internal::__is_spirv_target_v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you include utils.h
where __is_spirv_target_v
is defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
128 byte memory operations are performed instead of 512 after inspecting the assembly. Processing 512 bytes per sub-group still seems to be the best value after experimentation. Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…ute work for small inputs Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
This reverts commit e4cbceb. Small sizes slightly slower and for horizontal vectorization no "real" benefit is observed.
Small but measurable overheads can be observed for small inputs where runtime dispatch in the kernel is present to check for the correct path to take. Letting the compiler handle the the small input case in the original kernel shows the best performance. Signed-off-by: Matthew Michel <matthew.michel@intel.com>
We now flatten the user-provided ranges and find the minimum sized type to estimate the best __iters_per_work_item. This benefits performance in calls that wrap multiple buffers in a single input / output through a zip_iterator (e.g. dpct::scatter_if in SYCLomatic compatibility headers). Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
* Move __stride_recommender into __parallel_for_large_submitter * Use {} to invoke constructor * Simplify if-else statements in for dispatch Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
8a95b24
to
33337f8
Compare
Thanks for the reviews everyone. I have addressed all current comments. As discussed offline, we will use the current state of the PR as a starting point to introduce vectorized load / store paths where it is performant by rewriting our bricks for parallel for. Likely, I will have a second PR into this branch with these changes once they are complete. |
Summary
This PR improves
__parallel_for
performance for large input sizes by switching to an nd-range kernel to process multiple inputs per work item which enables us to use the full hardware bandwidth.Details
On some target architectures, we are currently not hitting roofline memory bandwidth performance in our
__parallel_for
pattern. The cause is that our SYCL basic kernel implementation only processes a single element per item. This is insufficient to fully utilize memory bandwidth on some target architectures. Processing multiple inputs per work item enables us to perform enough loads / stores to saturate the hardware bandwidth. Explicitly using a coalesced pattern through either a sub-group or work-group stride ensures that a good access pattern is achieved.A nd-range kernel has been added for large input sizes that uses a heuristic based upon the smallest sized type in the set of provided ranges to determine the number of iterations to process per input item. This drastically improves performance on target architectures for large inputs across nearly all for-based algorithms.
A second kernel has been added as opposed to merging both paths within a single kernel to prevent extra runtime dispatch within the kernel which hurt performance for small inputs. There is a smaller runtime overhead for selecting the best path from the host and compiling two kernels. For small-to-medium inputs, the SYCL basic kernel performs the best.