-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLAS IAMAX support #438
BLAS IAMAX support #438
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BLAS IxAMAX
functions only support the 1-D input array case. If we don't have a use case for 2-D input arrays, let's get rid of that case and thereby simplify the code.
src/blas/KokkosBlas1_iamax.hpp
Outdated
/// corresponding entry in X. | ||
/// | ||
/// \tparam RMV 1-D or 2-D Kokkos::View specialization. | ||
/// \tparam XMV 1-D or 2-D Kokkos::View specialization. It must have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation here doesn't match the "brief" description. R can't possibly have the same rank as X.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mhoemmen . It is fixed.
const int tid = teamMember.team_rank(); // threadId | ||
|
||
maxloc_type col_maxloc; | ||
Kokkos::parallel_reduce( Kokkos::TeamThreadRange(teamMember, m_x.extent(0)), [&] (const int& i, maxloc_type& thread_lmaxloc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to pass int
by reference. int i
is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhoemmen It is fixed.
typename RV::array_layout, | ||
typename XMV::device_type> RV_D; | ||
typedef MV_Iamax_FunctorVector<RV_D, XMV, mag_type, SizeType> functor_type; | ||
RV_D r_d("r_d", r.extent(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to allocate temporary storage here? Is the issue that r
must be host storage?
BLAS IxAMAX
functions only work with 1-D input arrays, so unless we have a use case for the 2-D case, why not just get rid of it and thereby reduce code complexity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhoemmen yes, r
is in host space. But, actually, we do not have a use case for the 2-D. I just added this for the sake of completeness. Of course, I can remove this 2-D case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2-D case could be useful for panel factorizations, but in that case, r
would always be on device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhoemmen So, I need to change the multi-vector template such that r
is on device. I wasn't sure if r
should be on device or on host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as these functions never allocate Views, I'm happy. You have a View, so you can check its memory space accessibility. The 1-D input case can either return a value (on host) or write to a 0-D View (on device).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mhoemmen. The code is fixed accordingly. Basically, there is no view allocation.
…ice for the 1-D input
…ice for the 1-D input
The code was modified such that results can be on device for TPL cuBLAS and it can either return a value on host or write to a 0-D View on device for the 1-D input case. |
|
||
maxloc_type col_maxloc; | ||
Kokkos::parallel_reduce( Kokkos::TeamThreadRange(teamMember, m_x.extent(0)), [&] (const int i, maxloc_type& thread_lmaxloc) { | ||
mag_type val = IPT::norm (m_x(i,lid)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vqd8a What you are doing is that parallel reduce along a vector, which is typically large compared to the number of columns (extent(1)). Logically this can do the job but it will be extremely slow comparable to a serial version. You should put the most outer parallel loop for the biggest work chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kyungjoo-kim Thanks. I have fixed the implementation as per your suggestion. Please take a look. Basically, it follows the idea of other existing BLAS functions.
src/blas/KokkosBlas1_iamax.hpp
Outdated
/// \tparam RMV 0-D or 1-D Kokkos::View specialization. | ||
/// \tparam XMV 1-D or 2-D Kokkos::View specialization. | ||
/// | ||
/// Special note for TPL cuBLAS: RMV must be 0-D view and XMV must be 1-D view, and the index returned in RMV is 1-based since cuBLAS uses 1-based indexing for compatibility with Fortran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index needs to be the same, regardless of the implementation. If that means subtracting 1 to convert from 1-based to 0-based indexing, then please do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhoemmen Understand. I already did the subtraction when the cuBLAS functions return a value on host.
But when the cuBLAS functions return directly the value on device memory, I do not know how to the subtraction efficiently. I do not want to call a kernel with only one thread for just doing the change for a single value so I leave the subtraction for user's kernel.
What would be the good way to do it? Can I just return the value to host and copy it to device if RMV is 0-D view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a kokkos-kernels design decision. I would say, if the BLAS returns a 1-based index, then kokkos-kernels should return a 1-based index. Kokkos users who want a 0-based index could easily implement their own, using Kokkos::MaxLoc
. I'm almost certain that a single kernel that kokkos-kernels provide would be faster than calling cuBLAS and then invoking another kernel just to decrement a single value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srajama1 What is your opinion on this issue? Should we use 0-based index or 1-based index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no good answer for this in the decades we have dealt with it. I recommend to do whatever is most efficient for our users, we can jump through hoops to help them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srajama1 @mhoemmen For now, I think I would use 0-based index for all cases, except when TPL cuBLAS iamax is used and returns result to a 0-D view on device memory. I would leave users doing the 1-based to 0-based conversion in their kernels if they need to. I added note to specify this case in the KokkosBlas1_iamax.hpp
However, I am open to changing to 0-based indexing for cuBLAS or using 1-based indexing for all cases. Please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a note in the wiki, explaining this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -140,7 +144,7 @@ struct MV_Iamax_FunctorVector | |||
const int tid = teamMember.team_rank(); // threadId | |||
|
|||
maxloc_type col_maxloc; | |||
Kokkos::parallel_reduce( Kokkos::TeamThreadRange(teamMember, m_x.extent(0)), [&] (const int& i, maxloc_type& thread_lmaxloc) { | |||
Kokkos::parallel_reduce( Kokkos::TeamThreadRange(teamMember, m_x.extent(0)), [&] (const int i, maxloc_type& thread_lmaxloc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please label all kernels; thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhoemmen I just did.
unit_test/blas/Test_Blas1_iamax.hpp
Outdated
|
||
#ifdef KOKKOSKERNELS_ENABLE_TPL_CUBLAS | ||
if(std::is_same<typename Device::memory_space,Kokkos::CudaSpace>::value) | ||
const_max_loc = h_r()-1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment below.
src/blas/KokkosBlas1_iamax.hpp
Outdated
@@ -136,7 +136,10 @@ iamax (const RV& R, const XMV& X, | |||
typename RV::non_const_value_type, | |||
typename RV::non_const_value_type* >::type, | |||
typename KokkosKernels::Impl::GetUnifiedLayout<RV>::array_layout, | |||
typename RV::device_type, | |||
typename Kokkos::Impl::if_c< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use std::conditional
(lives in <type_traits>
). It works just like if_c
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhoemmen I am curious. why don't we just use if_c
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vqd8a I'm pretty sure if_c
is Kokkos' pre-c++11 implementation of std::conditional
, and some future cleanup in Kokkos will likely remove if_c
so if nothing else it saves work later making the change; in general it is better to use standard library implementations when feasible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if_c
is in theImpl
namespace, and therefore should not be used outside of Kokkos.- Prefer Standard Library features to Kokkos features that do the same thing.
- The feature
if_c
has thatstd::conditional
lacks is theselect
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Thanks @ndellingwood and @mhoemmen for clarifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhoemmen std::conditional
is used now.
@@ -63,7 +63,7 @@ struct V_Iamax_Functor | |||
typedef MagType mag_type; | |||
typedef typename XV::non_const_value_type xvalue_type; | |||
typedef Kokkos::Details::InnerProductSpaceTraits<xvalue_type> IPT; | |||
typedef typename Kokkos::MaxLoc<mag_type,size_type>::value_type maxloc_type; | |||
typedef typename RV::value_type value_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer using
alias syntax for new code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhoemmen using
is used instead of typedef
now.
This won't let me merge unless @mhoemmen @kyungjoo-kim and @ndellingwood approve. |
Spotchecks passed
|
Merging is blocked because of changes requested. At least in my browser it says changes requested from @mhoemmen |
@mhoemmen As said above, for now, I opt to use 0-based index for all cases, except when TPL cuBLAS iamax is used and returns result to a 0-D view on device memory (1-based index). @mhoemmen If you are okay with this, could you please approve? Thanks. |
You should be able to dismiss my review, but if you don't know how to do that, I'll do it. |
@vqd8a I didn't quite appreciate that this is just for IAMAX. That seems weird and I am worried now. Why is this exception needed ? |
@srajama1 The issue right now is BLAS and cuBLAS As suggested by @mhoemmen, we can just use 1-based index for all cases. Modifying the KokkosKernels implementation is not difficult. I was just afraid that this can complicate the custom reducer a bit in the KokkosKernels implementation with many subtractions. However, to avoid confusions, I think I should change to 1-based indexing for for all cases. |
Updated the code such that the returned value is 1-based indexing.
|
Added Iamax support (KK implementation, TPL BLAS, TPL cuBLAS).