BLAS IAMAX support #438

vqd8a · 2019-06-23T18:38:43Z

Added Iamax support (KK implementation, TPL BLAS, TPL cuBLAS).

mhoemmen

BLAS IxAMAX functions only support the 1-D input array case. If we don't have a use case for 2-D input arrays, let's get rid of that case and thereby simplify the code.

mhoemmen · 2019-06-24T16:13:37Z

src/blas/KokkosBlas1_iamax.hpp

+/// corresponding entry in X.
+///
+/// \tparam RMV 1-D or 2-D Kokkos::View specialization.
+/// \tparam XMV 1-D or 2-D Kokkos::View specialization.  It must have


The documentation here doesn't match the "brief" description. R can't possibly have the same rank as X.

Thanks @mhoemmen . It is fixed.

mhoemmen · 2019-06-24T16:14:52Z

src/blas/impl/KokkosBlas1_iamax_impl.hpp

+      const int tid = teamMember.team_rank(); // threadId
+
+      maxloc_type col_maxloc;
+      Kokkos::parallel_reduce( Kokkos::TeamThreadRange(teamMember, m_x.extent(0)), [&] (const int& i, maxloc_type& thread_lmaxloc) {


We don't need to pass int by reference. int i is fine.

@mhoemmen It is fixed.

mhoemmen · 2019-06-24T16:17:47Z

src/blas/impl/KokkosBlas1_iamax_impl.hpp

+                         typename RV::array_layout,
+                         typename XMV::device_type> RV_D;
+    typedef MV_Iamax_FunctorVector<RV_D, XMV, mag_type, SizeType> functor_type;
+    RV_D r_d("r_d", r.extent(0));


Why do we need to allocate temporary storage here? Is the issue that r must be host storage?

BLAS IxAMAX functions only work with 1-D input arrays, so unless we have a use case for the 2-D case, why not just get rid of it and thereby reduce code complexity?

@mhoemmen yes, r is in host space. But, actually, we do not have a use case for the 2-D. I just added this for the sake of completeness. Of course, I can remove this 2-D case.

The 2-D case could be useful for panel factorizations, but in that case, r would always be on device.

@mhoemmen So, I need to change the multi-vector template such that r is on device. I wasn't sure if r should be on device or on host.

As long as these functions never allocate Views, I'm happy. You have a View, so you can check its memory space accessibility. The 1-D input case can either return a value (on host) or write to a 0-D View (on device).

Thanks @mhoemmen. The code is fixed accordingly. Basically, there is no view allocation.

…ice for the 1-D input

vqd8a · 2019-06-27T22:28:51Z

The code was modified such that results can be on device for TPL cuBLAS and it can either return a value on host or write to a 0-D View on device for the 1-D input case.

kyungjoo-kim · 2019-06-27T22:41:52Z

src/blas/impl/KokkosBlas1_iamax_impl.hpp

+
+      maxloc_type col_maxloc;
+      Kokkos::parallel_reduce( Kokkos::TeamThreadRange(teamMember, m_x.extent(0)), [&] (const int i, maxloc_type& thread_lmaxloc) {
+        mag_type val = IPT::norm (m_x(i,lid));


@vqd8a What you are doing is that parallel reduce along a vector, which is typically large compared to the number of columns (extent(1)). Logically this can do the job but it will be extremely slow comparable to a serial version. You should put the most outer parallel loop for the biggest work chunk.

@kyungjoo-kim Thanks. I have fixed the implementation as per your suggestion. Please take a look. Basically, it follows the idea of other existing BLAS functions.

mhoemmen · 2019-06-28T00:02:11Z

src/blas/KokkosBlas1_iamax.hpp

+/// \tparam RMV 0-D or 1-D Kokkos::View specialization.
+/// \tparam XMV 1-D or 2-D Kokkos::View specialization.
+///
+/// Special note for TPL cuBLAS:  RMV must be 0-D view and XMV must be 1-D view, and the index returned in RMV is 1-based since cuBLAS uses 1-based indexing for compatibility with Fortran


The index needs to be the same, regardless of the implementation. If that means subtracting 1 to convert from 1-based to 0-based indexing, then please do so.

@mhoemmen Understand. I already did the subtraction when the cuBLAS functions return a value on host.
But when the cuBLAS functions return directly the value on device memory, I do not know how to the subtraction efficiently. I do not want to call a kernel with only one thread for just doing the change for a single value so I leave the subtraction for user's kernel.
What would be the good way to do it? Can I just return the value to host and copy it to device if RMV is 0-D view?

This is a kokkos-kernels design decision. I would say, if the BLAS returns a 1-based index, then kokkos-kernels should return a 1-based index. Kokkos users who want a 0-based index could easily implement their own, using Kokkos::MaxLoc. I'm almost certain that a single kernel that kokkos-kernels provide would be faster than calling cuBLAS and then invoking another kernel just to decrement a single value.

@srajama1 What is your opinion on this issue? Should we use 0-based index or 1-based index?

There is no good answer for this in the decades we have dealt with it. I recommend to do whatever is most efficient for our users, we can jump through hoops to help them.

@srajama1 @mhoemmen For now, I think I would use 0-based index for all cases, except when TPL cuBLAS iamax is used and returns result to a 0-D view on device memory. I would leave users doing the 1-based to 0-based conversion in their kernels if they need to. I added note to specify this case in the KokkosBlas1_iamax.hpp

However, I am open to changing to 0-based indexing for cuBLAS or using 1-based indexing for all cases. Please let me know.

Can we add a note in the wiki, explaining this ?

Thanks, @srajama1 I will take care of the wiki.
@mhoemmen Could you please approve this so that I can run spotcheck and get it merged?

mhoemmen · 2019-06-28T00:02:29Z

src/blas/impl/KokkosBlas1_iamax_impl.hpp

@@ -140,7 +144,7 @@ struct MV_Iamax_FunctorVector
      const int tid = teamMember.team_rank(); // threadId

      maxloc_type col_maxloc;
-      Kokkos::parallel_reduce( Kokkos::TeamThreadRange(teamMember, m_x.extent(0)), [&] (const int& i, maxloc_type& thread_lmaxloc) {
+      Kokkos::parallel_reduce( Kokkos::TeamThreadRange(teamMember, m_x.extent(0)), [&] (const int i, maxloc_type& thread_lmaxloc) {


Please label all kernels; thanks!

@mhoemmen I just did.

mhoemmen · 2019-06-28T00:06:26Z

unit_test/blas/Test_Blas1_iamax.hpp

+
+#ifdef KOKKOSKERNELS_ENABLE_TPL_CUBLAS
+    if(std::is_same<typename Device::memory_space,Kokkos::CudaSpace>::value)
+      const_max_loc = h_r()-1;


Please see my comment below.

…1-D input case and 2-D input case

mhoemmen · 2019-07-01T17:03:32Z

src/blas/KokkosBlas1_iamax.hpp

@@ -136,7 +136,10 @@ iamax (const RV& R, const XMV& X,
      typename RV::non_const_value_type,
      typename RV::non_const_value_type* >::type,
    typename KokkosKernels::Impl::GetUnifiedLayout<RV>::array_layout,
-    typename RV::device_type,
+    typename Kokkos::Impl::if_c<


Use std::conditional (lives in <type_traits>). It works just like if_c here.

@mhoemmen I am curious. why don't we just use if_c?

@vqd8a I'm pretty sure if_c is Kokkos' pre-c++11 implementation of std::conditional, and some future cleanup in Kokkos will likely remove if_c so if nothing else it saves work later making the change; in general it is better to use standard library implementations when feasible.

if_c is in the Impl namespace, and therefore should not be used outside of Kokkos.

Prefer Standard Library features to Kokkos features that do the same thing.

The feature if_c has that std::conditional lacks is the select method.

Okay. Thanks @ndellingwood and @mhoemmen for clarifying.

@mhoemmen std::conditional is used now.

mhoemmen · 2019-07-01T17:04:25Z

src/blas/impl/KokkosBlas1_iamax_impl.hpp

@@ -63,7 +63,7 @@ struct V_Iamax_Functor
  typedef MagType                                                  mag_type;
  typedef typename XV::non_const_value_type                        xvalue_type;
  typedef Kokkos::Details::InnerProductSpaceTraits<xvalue_type>    IPT;
-  typedef typename Kokkos::MaxLoc<mag_type,size_type>::value_type  maxloc_type;
+  typedef typename RV::value_type                                  value_type;


Prefer using alias syntax for new code.

@mhoemmen using is used instead of typedef now.

src/blas/impl/KokkosBlas1_iamax_impl.hpp

srajama1 · 2019-07-23T18:34:05Z

This won't let me merge unless @mhoemmen @kyungjoo-kim and @ndellingwood approve.

vqd8a · 2019-07-24T15:34:58Z

Spotchecks passed

../Kokkos/kokkos-kernels/scripts/test_all_sandia --spot-check --with-cuda-options=enable_lambda

Running on machine: white
Going to test compilers:  gcc/6.4.0 gcc/7.2.0 ibm/16.1.0 cuda/9.2.88 cuda/10.0.130
Testing compiler gcc/6.4.0
  Starting job gcc-6.4.0-OpenMP_Serial-release
  PASSED gcc-6.4.0-OpenMP_Serial-release
Testing compiler gcc/7.2.0
  Starting job gcc-7.2.0-OpenMP-release
  PASSED gcc-7.2.0-OpenMP-release
  Starting job gcc-7.2.0-Serial-release
  PASSED gcc-7.2.0-Serial-release
  Starting job gcc-7.2.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-OpenMP_Serial-release
Testing compiler ibm/16.1.0
  Starting job ibm-16.1.0-Serial-release
  PASSED ibm-16.1.0-Serial-release
Testing compiler cuda/9.2.88
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
Testing compiler cuda/10.0.130
  Starting job cuda-10.0.130-Cuda_Serial-release
  PASSED cuda-10.0.130-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=952 run_time=340
cuda-9.2.88-Cuda_OpenMP-release build_time=1056 run_time=279
gcc-6.4.0-OpenMP_Serial-release build_time=475 run_time=309
gcc-7.2.0-OpenMP-release build_time=359 run_time=126
gcc-7.2.0-OpenMP_Serial-release build_time=582 run_time=365
gcc-7.2.0-Serial-release build_time=216 run_time=229
ibm-16.1.0-Serial-release build_time=1038 run_time=263
#######################################################
FAILED TESTS
#######################################################


../Kokkos/kokkos-kernels/scripts/test_all_sandia cuda --spot-check --with-cuda-options=enable_lambda --with-tpls=cublas

Running on machine: white
Going to test compilers:  cuda/9.2.88 cuda/10.0.130
Testing compiler cuda/9.2.88
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
Testing compiler cuda/10.0.130
  Starting job cuda-10.0.130-Cuda_Serial-release
  PASSED cuda-10.0.130-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=1006 run_time=341
cuda-9.2.88-Cuda_OpenMP-release build_time=1032 run_time=280
#######################################################
FAILED TESTS
#######################################################


../Kokkos/kokkos-kernels/scripts/test_all_sandia gcc --spot-check --with-tpls=blas

Running on machine: white
Going to test compilers:  gcc/6.4.0 gcc/7.2.0
Testing compiler gcc/6.4.0
  Starting job gcc-6.4.0-OpenMP_Serial-release
  PASSED gcc-6.4.0-OpenMP_Serial-release
Testing compiler gcc/7.2.0
  Starting job gcc-7.2.0-OpenMP-release
  PASSED gcc-7.2.0-OpenMP-release
  Starting job gcc-7.2.0-Serial-release
  PASSED gcc-7.2.0-Serial-release
  Starting job gcc-7.2.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-OpenMP_Serial-release
#######################################################
PASSED TESTS
#######################################################
gcc-6.4.0-OpenMP_Serial-release build_time=531 run_time=277
gcc-7.2.0-OpenMP-release build_time=315 run_time=135
gcc-7.2.0-OpenMP_Serial-release build_time=383 run_time=253
gcc-7.2.0-Serial-release build_time=213 run_time=162
#######################################################
FAILED TESTS
#######################################################

srajama1 · 2019-07-24T19:45:32Z

Merging is blocked because of changes requested. At least in my browser it says changes requested from @mhoemmen

vqd8a · 2019-07-24T20:18:06Z

@mhoemmen As said above, for now, I opt to use 0-based index for all cases, except when TPL cuBLAS iamax is used and returns result to a 0-D view on device memory (1-based index).
In this 1-based case, users can use the result in their kernels (have to decrement by 1 before using it in the same kernel). As said by @srajama1, I will add a note to the wiki to explain the detail.
In the future, if needed, I am willing to change to using 1-based indexing for all cases.

@mhoemmen If you are okay with this, could you please approve? Thanks.

mhoemmen · 2019-07-24T23:05:08Z

You should be able to dismiss my review, but if you don't know how to do that, I'll do it.

See comments above.

srajama1 · 2019-07-24T23:53:49Z

@vqd8a I didn't quite appreciate that this is just for IAMAX. That seems weird and I am worried now. Why is this exception needed ?

vqd8a · 2019-07-25T19:14:07Z

@srajama1 The issue right now is BLAS and cuBLAS iamax return a 1-based index while KokkosKernels implementation returns 0-based index. I wanted to use 0-based index for all. For BLAS and cuBLAS with on-host result, we can subtract the result by 1 easily. But when the cuBLAS function return directly the result to device memory, I do not know how to do the subtraction efficiently before giving it to users since they can do the subtraction when they need the index result. That is why I just left the subtraction for users. I did not think this is a big problem as long as we document it in the wiki.

As suggested by @mhoemmen, we can just use 1-based index for all cases. Modifying the KokkosKernels implementation is not difficult. I was just afraid that this can complicate the custom reducer a bit in the KokkosKernels implementation with many subtractions.

However, to avoid confusions, I think I should change to 1-based indexing for for all cases.
Please let me fix the code this way and I will run the spotchecks again.

vqd8a · 2019-07-26T15:11:04Z

Updated the code such that the returned value is 1-based indexing.
Run spotchecks again:

../Kokkos/kokkos-kernels/scripts/test_all_sandia --spot-check --with-cuda-options=enable_lambda

Running on machine: white
Going to test compilers:  gcc/6.4.0 gcc/7.2.0 ibm/16.1.0 cuda/9.2.88 cuda/10.0.130
Testing compiler gcc/6.4.0
  Starting job gcc-6.4.0-OpenMP_Serial-release
  PASSED gcc-6.4.0-OpenMP_Serial-release
Testing compiler gcc/7.2.0
  Starting job gcc-7.2.0-OpenMP-release
  PASSED gcc-7.2.0-OpenMP-release
  Starting job gcc-7.2.0-Serial-release
  PASSED gcc-7.2.0-Serial-release
  Starting job gcc-7.2.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-OpenMP_Serial-release
Testing compiler ibm/16.1.0
  Starting job ibm-16.1.0-Serial-release
  PASSED ibm-16.1.0-Serial-release
Testing compiler cuda/9.2.88
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
Testing compiler cuda/10.0.130
  Starting job cuda-10.0.130-Cuda_Serial-release
  PASSED cuda-10.0.130-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=957 run_time=342
cuda-9.2.88-Cuda_OpenMP-release build_time=955 run_time=278
gcc-6.4.0-OpenMP_Serial-release build_time=458 run_time=305
gcc-7.2.0-OpenMP-release build_time=371 run_time=110
gcc-7.2.0-OpenMP_Serial-release build_time=527 run_time=313
gcc-7.2.0-Serial-release build_time=211 run_time=175
ibm-16.1.0-Serial-release build_time=964 run_time=263
#######################################################
FAILED TESTS
#######################################################

../Kokkos/kokkos-kernels/scripts/test_all_sandia cuda --spot-check --with-cuda-options=enable_lambda --with-tpls=cublas

Running on machine: white
Going to test compilers:  cuda/9.2.88 cuda/10.0.130
Testing compiler cuda/9.2.88
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
Testing compiler cuda/10.0.130
  Starting job cuda-10.0.130-Cuda_Serial-release
  PASSED cuda-10.0.130-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=991 run_time=342
cuda-9.2.88-Cuda_OpenMP-release build_time=1001 run_time=278
#######################################################
FAILED TESTS
#######################################################

../Kokkos/kokkos-kernels/scripts/test_all_sandia gcc --spot-check --with-tpls=blas

Running on machine: white
Going to test compilers:  gcc/6.4.0 gcc/7.2.0
Testing compiler gcc/6.4.0
  Starting job gcc-6.4.0-OpenMP_Serial-release
  PASSED gcc-6.4.0-OpenMP_Serial-release
Testing compiler gcc/7.2.0
  Starting job gcc-7.2.0-OpenMP-release
  PASSED gcc-7.2.0-OpenMP-release
  Starting job gcc-7.2.0-Serial-release
  PASSED gcc-7.2.0-Serial-release
  Starting job gcc-7.2.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-OpenMP_Serial-release
#######################################################
PASSED TESTS
#######################################################
gcc-6.4.0-OpenMP_Serial-release build_time=437 run_time=302
gcc-7.2.0-OpenMP-release build_time=335 run_time=113
gcc-7.2.0-OpenMP_Serial-release build_time=341 run_time=256
gcc-7.2.0-Serial-release build_time=196 run_time=183
#######################################################
FAILED TESTS
#######################################################

vqd8a added 2 commits June 23, 2019 12:29

Add BLAS Iamax

4745ca3

Add BLAS Iamax

42a1d4d

vqd8a requested review from srajama1, kyungjoo-kim and mhoemmen June 23, 2019 18:39

mhoemmen reviewed Jun 24, 2019

View reviewed changes

vqd8a added 5 commits June 27, 2019 15:48

Allow results residing on device for TPL cuBLAS

77a3273

iamax can either return a value on host or write to a 0-D View on dev…

d0d309a

…ice for the 1-D input

Fix the interface description

df8804a

Add test for result in 0-D view

68df725

iamax can either return a value on host or write to a 0-D View on dev…

ee6c99c

…ice for the 1-D input

kyungjoo-kim reviewed Jun 27, 2019

View reviewed changes

mhoemmen previously requested changes Jun 28, 2019

View reviewed changes

vqd8a added 3 commits June 30, 2019 16:53

Results returned to host or device

0e42b2e

Parallel_reduce loop as outermost loop and use custom reductions for …

59702cf

…1-D input case and 2-D input case

Add note for TPL cuBLAS 1-based indexing

e78c6f1

mhoemmen reviewed Jul 1, 2019

View reviewed changes

vqd8a added 2 commits July 3, 2019 10:09

Use std::conditional instead of if_c and use using instead of typedef

540f48d

Merge branch 'develop' into iamax

8243f95

srajama1 approved these changes Jul 23, 2019

View reviewed changes

Use 1-based indexing

bf1346d

srajama1 merged commit 624ee31 into develop Jul 30, 2019

srajama1 mentioned this pull request Jul 30, 2019

Blas 1: implement iamax #87

Closed

ndellingwood deleted the iamax branch October 29, 2020 16:01

kokkos-devops-admin mentioned this pull request Feb 20, 2024

Refactor Test_Sparse_sptrsv #2102

Merged

BLAS IAMAX support #438

BLAS IAMAX support #438

Conversation

vqd8a commented Jun 23, 2019

mhoemmen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vqd8a commented Jun 27, 2019

kyungjoo-kim Jun 27, 2019 • edited Loading

Choose a reason for hiding this comment

vqd8a Jun 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vqd8a Jun 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vqd8a Jul 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhoemmen Jul 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srajama1 commented Jul 23, 2019

vqd8a commented Jul 24, 2019

srajama1 commented Jul 24, 2019

vqd8a commented Jul 24, 2019 • edited Loading

mhoemmen commented Jul 24, 2019

srajama1 commented Jul 24, 2019

vqd8a commented Jul 25, 2019

vqd8a commented Jul 26, 2019

kyungjoo-kim Jun 27, 2019 •

edited

Loading

vqd8a Jun 30, 2019 •

edited

Loading

vqd8a Jun 30, 2019 •

edited

Loading

vqd8a Jul 2, 2019 •

edited

Loading

mhoemmen Jul 2, 2019 •

edited

Loading

vqd8a commented Jul 24, 2019 •

edited

Loading