-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance for SYCL parallel_reduce #3732
Conversation
7c46954
to
d238a47
Compare
d238a47
to
b9844de
Compare
Benchmark showed better performance for the recursive implementation so I dropped the commit related to the one that lets the last block do the final reduction. A backup for that implementation can be found at https://github.com/masterleinad/kokkos/tree/sycl_reduce_performance_backup. |
b9844de
to
6fc5a11
Compare
bce6cc5
to
19eda78
Compare
19e1c92
to
1f32a62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard to parse what is going on in this reduction algorithm :/
const typename Policy::index_type upper_bound = | ||
(global_id + values_per_thread * wgroup_size) < size | ||
? global_id + values_per_thread * wgroup_size | ||
: size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::min
might be more readable than this ternary op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wasn't quite sure if that's supported on the device. I can check.
I thought I didn't change it too much but I'm happy to add some more comments. |
const auto init_size = std::max<std::size_t>( | ||
((size + values_per_thread - 1) / values_per_thread + wgroup_size - 1) / | ||
wgroup_size, | ||
1); | ||
const auto value_count = | ||
FunctorValueTraits<ReducerTypeFwd, WorkTagFwd>::value_count( | ||
selected_reducer); | ||
const auto results_ptr = | ||
static_cast<pointer_type>(Experimental::SYCLSharedUSMSpace().allocate( | ||
"SYCL parallel_reduce result storage", | ||
sizeof(*m_result_ptr) * std::max(value_count, 1u) * init_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sizeof(value_type)
should be the same. Let me try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the values per thread thing is good for dot but bad for physics codes with reducitons
const auto init_size = | ||
std::max<std::size_t>((size + wgroup_size - 1) / wgroup_size, 1); | ||
constexpr size_t wgroup_size = 128; | ||
constexpr size_t values_per_thread = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this values per thread, is probably right for something like dot product, but probably wrong for LAMMPS where the reduction is on rather beefy kernels which often don't have enough parallelism (think 30k atoms, each looping serially over its neighbors and then do a reduction over all of them) not sure if you care right now
- let the last block do the final reduction