Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLAS IAMAX support #438

Merged
merged 13 commits into from
Jul 30, 2019
Merged

BLAS IAMAX support #438

merged 13 commits into from
Jul 30, 2019

Conversation

vqd8a
Copy link
Contributor

@vqd8a vqd8a commented Jun 23, 2019

Added Iamax support (KK implementation, TPL BLAS, TPL cuBLAS).

Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLAS IxAMAX functions only support the 1-D input array case. If we don't have a use case for 2-D input arrays, let's get rid of that case and thereby simplify the code.

/// corresponding entry in X.
///
/// \tparam RMV 1-D or 2-D Kokkos::View specialization.
/// \tparam XMV 1-D or 2-D Kokkos::View specialization. It must have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation here doesn't match the "brief" description. R can't possibly have the same rank as X.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mhoemmen . It is fixed.

const int tid = teamMember.team_rank(); // threadId

maxloc_type col_maxloc;
Kokkos::parallel_reduce( Kokkos::TeamThreadRange(teamMember, m_x.extent(0)), [&] (const int& i, maxloc_type& thread_lmaxloc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to pass int by reference. int i is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhoemmen It is fixed.

typename RV::array_layout,
typename XMV::device_type> RV_D;
typedef MV_Iamax_FunctorVector<RV_D, XMV, mag_type, SizeType> functor_type;
RV_D r_d("r_d", r.extent(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to allocate temporary storage here? Is the issue that r must be host storage?

BLAS IxAMAX functions only work with 1-D input arrays, so unless we have a use case for the 2-D case, why not just get rid of it and thereby reduce code complexity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhoemmen yes, r is in host space. But, actually, we do not have a use case for the 2-D. I just added this for the sake of completeness. Of course, I can remove this 2-D case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2-D case could be useful for panel factorizations, but in that case, r would always be on device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhoemmen So, I need to change the multi-vector template such that r is on device. I wasn't sure if r should be on device or on host.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as these functions never allocate Views, I'm happy. You have a View, so you can check its memory space accessibility. The 1-D input case can either return a value (on host) or write to a 0-D View (on device).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mhoemmen. The code is fixed accordingly. Basically, there is no view allocation.

@vqd8a
Copy link
Contributor Author

vqd8a commented Jun 27, 2019

The code was modified such that results can be on device for TPL cuBLAS and it can either return a value on host or write to a 0-D View on device for the 1-D input case.


maxloc_type col_maxloc;
Kokkos::parallel_reduce( Kokkos::TeamThreadRange(teamMember, m_x.extent(0)), [&] (const int i, maxloc_type& thread_lmaxloc) {
mag_type val = IPT::norm (m_x(i,lid));
Copy link
Contributor

@kyungjoo-kim kyungjoo-kim Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vqd8a What you are doing is that parallel reduce along a vector, which is typically large compared to the number of columns (extent(1)). Logically this can do the job but it will be extremely slow comparable to a serial version. You should put the most outer parallel loop for the biggest work chunk.

Copy link
Contributor Author

@vqd8a vqd8a Jun 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyungjoo-kim Thanks. I have fixed the implementation as per your suggestion. Please take a look. Basically, it follows the idea of other existing BLAS functions.

/// \tparam RMV 0-D or 1-D Kokkos::View specialization.
/// \tparam XMV 1-D or 2-D Kokkos::View specialization.
///
/// Special note for TPL cuBLAS: RMV must be 0-D view and XMV must be 1-D view, and the index returned in RMV is 1-based since cuBLAS uses 1-based indexing for compatibility with Fortran
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index needs to be the same, regardless of the implementation. If that means subtracting 1 to convert from 1-based to 0-based indexing, then please do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhoemmen Understand. I already did the subtraction when the cuBLAS functions return a value on host.
But when the cuBLAS functions return directly the value on device memory, I do not know how to the subtraction efficiently. I do not want to call a kernel with only one thread for just doing the change for a single value so I leave the subtraction for user's kernel.
What would be the good way to do it? Can I just return the value to host and copy it to device if RMV is 0-D view?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a kokkos-kernels design decision. I would say, if the BLAS returns a 1-based index, then kokkos-kernels should return a 1-based index. Kokkos users who want a 0-based index could easily implement their own, using Kokkos::MaxLoc. I'm almost certain that a single kernel that kokkos-kernels provide would be faster than calling cuBLAS and then invoking another kernel just to decrement a single value.

Copy link
Contributor Author

@vqd8a vqd8a Jun 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srajama1 What is your opinion on this issue? Should we use 0-based index or 1-based index?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no good answer for this in the decades we have dealt with it. I recommend to do whatever is most efficient for our users, we can jump through hoops to help them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srajama1 @mhoemmen For now, I think I would use 0-based index for all cases, except when TPL cuBLAS iamax is used and returns result to a 0-D view on device memory. I would leave users doing the 1-based to 0-based conversion in their kernels if they need to. I added note to specify this case in the KokkosBlas1_iamax.hpp

However, I am open to changing to 0-based indexing for cuBLAS or using 1-based indexing for all cases. Please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a note in the wiki, explaining this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @srajama1 I will take care of the wiki.
@mhoemmen Could you please approve this so that I can run spotcheck and get it merged?

@@ -140,7 +144,7 @@ struct MV_Iamax_FunctorVector
const int tid = teamMember.team_rank(); // threadId

maxloc_type col_maxloc;
Kokkos::parallel_reduce( Kokkos::TeamThreadRange(teamMember, m_x.extent(0)), [&] (const int& i, maxloc_type& thread_lmaxloc) {
Kokkos::parallel_reduce( Kokkos::TeamThreadRange(teamMember, m_x.extent(0)), [&] (const int i, maxloc_type& thread_lmaxloc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please label all kernels; thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhoemmen I just did.


#ifdef KOKKOSKERNELS_ENABLE_TPL_CUBLAS
if(std::is_same<typename Device::memory_space,Kokkos::CudaSpace>::value)
const_max_loc = h_r()-1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment below.

@@ -136,7 +136,10 @@ iamax (const RV& R, const XMV& X,
typename RV::non_const_value_type,
typename RV::non_const_value_type* >::type,
typename KokkosKernels::Impl::GetUnifiedLayout<RV>::array_layout,
typename RV::device_type,
typename Kokkos::Impl::if_c<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use std::conditional (lives in <type_traits>). It works just like if_c here.

Copy link
Contributor Author

@vqd8a vqd8a Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhoemmen I am curious. why don't we just use if_c?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vqd8a I'm pretty sure if_c is Kokkos' pre-c++11 implementation of std::conditional, and some future cleanup in Kokkos will likely remove if_c so if nothing else it saves work later making the change; in general it is better to use standard library implementations when feasible.

Copy link
Contributor

@mhoemmen mhoemmen Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. if_c is in the Impl namespace, and therefore should not be used outside of Kokkos.
  2. Prefer Standard Library features to Kokkos features that do the same thing.
  3. The feature if_c has that std::conditional lacks is the select method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Thanks @ndellingwood and @mhoemmen for clarifying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhoemmen std::conditional is used now.

@@ -63,7 +63,7 @@ struct V_Iamax_Functor
typedef MagType mag_type;
typedef typename XV::non_const_value_type xvalue_type;
typedef Kokkos::Details::InnerProductSpaceTraits<xvalue_type> IPT;
typedef typename Kokkos::MaxLoc<mag_type,size_type>::value_type maxloc_type;
typedef typename RV::value_type value_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer using alias syntax for new code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhoemmen using is used instead of typedef now.

src/blas/impl/KokkosBlas1_iamax_impl.hpp Show resolved Hide resolved
@srajama1
Copy link
Contributor

This won't let me merge unless @mhoemmen @kyungjoo-kim and @ndellingwood approve.

@vqd8a
Copy link
Contributor Author

vqd8a commented Jul 24, 2019

Spotchecks passed

../Kokkos/kokkos-kernels/scripts/test_all_sandia --spot-check --with-cuda-options=enable_lambda

Running on machine: white
Going to test compilers:  gcc/6.4.0 gcc/7.2.0 ibm/16.1.0 cuda/9.2.88 cuda/10.0.130
Testing compiler gcc/6.4.0
  Starting job gcc-6.4.0-OpenMP_Serial-release
  PASSED gcc-6.4.0-OpenMP_Serial-release
Testing compiler gcc/7.2.0
  Starting job gcc-7.2.0-OpenMP-release
  PASSED gcc-7.2.0-OpenMP-release
  Starting job gcc-7.2.0-Serial-release
  PASSED gcc-7.2.0-Serial-release
  Starting job gcc-7.2.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-OpenMP_Serial-release
Testing compiler ibm/16.1.0
  Starting job ibm-16.1.0-Serial-release
  PASSED ibm-16.1.0-Serial-release
Testing compiler cuda/9.2.88
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
Testing compiler cuda/10.0.130
  Starting job cuda-10.0.130-Cuda_Serial-release
  PASSED cuda-10.0.130-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=952 run_time=340
cuda-9.2.88-Cuda_OpenMP-release build_time=1056 run_time=279
gcc-6.4.0-OpenMP_Serial-release build_time=475 run_time=309
gcc-7.2.0-OpenMP-release build_time=359 run_time=126
gcc-7.2.0-OpenMP_Serial-release build_time=582 run_time=365
gcc-7.2.0-Serial-release build_time=216 run_time=229
ibm-16.1.0-Serial-release build_time=1038 run_time=263
#######################################################
FAILED TESTS
#######################################################


../Kokkos/kokkos-kernels/scripts/test_all_sandia cuda --spot-check --with-cuda-options=enable_lambda --with-tpls=cublas

Running on machine: white
Going to test compilers:  cuda/9.2.88 cuda/10.0.130
Testing compiler cuda/9.2.88
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
Testing compiler cuda/10.0.130
  Starting job cuda-10.0.130-Cuda_Serial-release
  PASSED cuda-10.0.130-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=1006 run_time=341
cuda-9.2.88-Cuda_OpenMP-release build_time=1032 run_time=280
#######################################################
FAILED TESTS
#######################################################


../Kokkos/kokkos-kernels/scripts/test_all_sandia gcc --spot-check --with-tpls=blas

Running on machine: white
Going to test compilers:  gcc/6.4.0 gcc/7.2.0
Testing compiler gcc/6.4.0
  Starting job gcc-6.4.0-OpenMP_Serial-release
  PASSED gcc-6.4.0-OpenMP_Serial-release
Testing compiler gcc/7.2.0
  Starting job gcc-7.2.0-OpenMP-release
  PASSED gcc-7.2.0-OpenMP-release
  Starting job gcc-7.2.0-Serial-release
  PASSED gcc-7.2.0-Serial-release
  Starting job gcc-7.2.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-OpenMP_Serial-release
#######################################################
PASSED TESTS
#######################################################
gcc-6.4.0-OpenMP_Serial-release build_time=531 run_time=277
gcc-7.2.0-OpenMP-release build_time=315 run_time=135
gcc-7.2.0-OpenMP_Serial-release build_time=383 run_time=253
gcc-7.2.0-Serial-release build_time=213 run_time=162
#######################################################
FAILED TESTS
#######################################################

@srajama1
Copy link
Contributor

Merging is blocked because of changes requested. At least in my browser it says changes requested from @mhoemmen

@vqd8a
Copy link
Contributor Author

vqd8a commented Jul 24, 2019

@mhoemmen As said above, for now, I opt to use 0-based index for all cases, except when TPL cuBLAS iamax is used and returns result to a 0-D view on device memory (1-based index).
In this 1-based case, users can use the result in their kernels (have to decrement by 1 before using it in the same kernel). As said by @srajama1, I will add a note to the wiki to explain the detail.
In the future, if needed, I am willing to change to using 1-based indexing for all cases.

@mhoemmen If you are okay with this, could you please approve? Thanks.

@mhoemmen
Copy link
Contributor

You should be able to dismiss my review, but if you don't know how to do that, I'll do it.

@mhoemmen mhoemmen dismissed their stale review July 24, 2019 23:05

See comments above.

@srajama1
Copy link
Contributor

@vqd8a I didn't quite appreciate that this is just for IAMAX. That seems weird and I am worried now. Why is this exception needed ?

@vqd8a
Copy link
Contributor Author

vqd8a commented Jul 25, 2019

@srajama1 The issue right now is BLAS and cuBLAS iamax return a 1-based index while KokkosKernels implementation returns 0-based index. I wanted to use 0-based index for all. For BLAS and cuBLAS with on-host result, we can subtract the result by 1 easily. But when the cuBLAS function return directly the result to device memory, I do not know how to do the subtraction efficiently before giving it to users since they can do the subtraction when they need the index result. That is why I just left the subtraction for users. I did not think this is a big problem as long as we document it in the wiki.

As suggested by @mhoemmen, we can just use 1-based index for all cases. Modifying the KokkosKernels implementation is not difficult. I was just afraid that this can complicate the custom reducer a bit in the KokkosKernels implementation with many subtractions.

However, to avoid confusions, I think I should change to 1-based indexing for for all cases.
Please let me fix the code this way and I will run the spotchecks again.

@vqd8a
Copy link
Contributor Author

vqd8a commented Jul 26, 2019

Updated the code such that the returned value is 1-based indexing.
Run spotchecks again:

../Kokkos/kokkos-kernels/scripts/test_all_sandia --spot-check --with-cuda-options=enable_lambda

Running on machine: white
Going to test compilers:  gcc/6.4.0 gcc/7.2.0 ibm/16.1.0 cuda/9.2.88 cuda/10.0.130
Testing compiler gcc/6.4.0
  Starting job gcc-6.4.0-OpenMP_Serial-release
  PASSED gcc-6.4.0-OpenMP_Serial-release
Testing compiler gcc/7.2.0
  Starting job gcc-7.2.0-OpenMP-release
  PASSED gcc-7.2.0-OpenMP-release
  Starting job gcc-7.2.0-Serial-release
  PASSED gcc-7.2.0-Serial-release
  Starting job gcc-7.2.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-OpenMP_Serial-release
Testing compiler ibm/16.1.0
  Starting job ibm-16.1.0-Serial-release
  PASSED ibm-16.1.0-Serial-release
Testing compiler cuda/9.2.88
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
Testing compiler cuda/10.0.130
  Starting job cuda-10.0.130-Cuda_Serial-release
  PASSED cuda-10.0.130-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=957 run_time=342
cuda-9.2.88-Cuda_OpenMP-release build_time=955 run_time=278
gcc-6.4.0-OpenMP_Serial-release build_time=458 run_time=305
gcc-7.2.0-OpenMP-release build_time=371 run_time=110
gcc-7.2.0-OpenMP_Serial-release build_time=527 run_time=313
gcc-7.2.0-Serial-release build_time=211 run_time=175
ibm-16.1.0-Serial-release build_time=964 run_time=263
#######################################################
FAILED TESTS
#######################################################

../Kokkos/kokkos-kernels/scripts/test_all_sandia cuda --spot-check --with-cuda-options=enable_lambda --with-tpls=cublas

Running on machine: white
Going to test compilers:  cuda/9.2.88 cuda/10.0.130
Testing compiler cuda/9.2.88
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
Testing compiler cuda/10.0.130
  Starting job cuda-10.0.130-Cuda_Serial-release
  PASSED cuda-10.0.130-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=991 run_time=342
cuda-9.2.88-Cuda_OpenMP-release build_time=1001 run_time=278
#######################################################
FAILED TESTS
#######################################################

../Kokkos/kokkos-kernels/scripts/test_all_sandia gcc --spot-check --with-tpls=blas

Running on machine: white
Going to test compilers:  gcc/6.4.0 gcc/7.2.0
Testing compiler gcc/6.4.0
  Starting job gcc-6.4.0-OpenMP_Serial-release
  PASSED gcc-6.4.0-OpenMP_Serial-release
Testing compiler gcc/7.2.0
  Starting job gcc-7.2.0-OpenMP-release
  PASSED gcc-7.2.0-OpenMP-release
  Starting job gcc-7.2.0-Serial-release
  PASSED gcc-7.2.0-Serial-release
  Starting job gcc-7.2.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-OpenMP_Serial-release
#######################################################
PASSED TESTS
#######################################################
gcc-6.4.0-OpenMP_Serial-release build_time=437 run_time=302
gcc-7.2.0-OpenMP-release build_time=335 run_time=113
gcc-7.2.0-OpenMP_Serial-release build_time=341 run_time=256
gcc-7.2.0-Serial-release build_time=196 run_time=183
#######################################################
FAILED TESTS
#######################################################

@srajama1 srajama1 merged commit 624ee31 into develop Jul 30, 2019
@srajama1 srajama1 mentioned this pull request Jul 30, 2019
@ndellingwood ndellingwood deleted the iamax branch October 29, 2020 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants