Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getrs implementation #338

Merged
merged 4 commits into from
Dec 4, 2018
Merged

getrs implementation #338

merged 4 commits into from
Dec 4, 2018

Conversation

vqd8a
Copy link
Contributor

@vqd8a vqd8a commented Oct 30, 2018

Address #332: An implementation of batched getrs (serial and team) and unit tests.

Modified InverseLU (getri) such that it just calls SolveLU (getrs) with B as identity matrix.

Added new trsm interfaces to support transpose (not conjugate) operation in getrs and unit tests for these interfaces.

@vqd8a vqd8a changed the title GETRS implementation getrs implementation Oct 30, 2018
Copy link
Contributor

@kyungjoo-kim kyungjoo-kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine if they pass the unit tests.

@kyungjoo-kim
Copy link
Contributor

@ndellingwood could we include this PR in the promotion ? This won't hurt trilinos integration testing.

@vqd8a
Copy link
Contributor Author

vqd8a commented Oct 30, 2018

Thanks @kyungjoo-kim . Unit tests passed.

@kyungjoo-kim
Copy link
Contributor

@vqd8a Did you also check the cmake modification to include unit tests ? src directory should be fine as they are merely hpp files and they are grabbed by *.hpp. However, the testing files may need to be included manually in the cmake.

//Second, compute X by solving the system U*X = Y for X
SerialTrsm<Side::Left,Uplo::Upper,Trans::NoTranspose,Diag::NonUnit,Algo::Trsm::Unblocked>::invoke(one, A, B);

return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return an error code if it's always zero? It would be better to follow the LAPACK error code convention here, for example by returning an error code instead of asserting on input dimension error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No error checking is intended as this is designed for small batch codes. It means that the code is used in parallel for. Unfortunately, error checking brings some amount of overhead and it is difficult (maybe impossible if the device is a GPU) to handle the error during the parallel run. So, the user should understand that the batched blas routines do not check errors and the numeric error should be checked in a higher level when the user think it is necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, but then why return an integer? Why not just make these functions void?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reserve the integer for a potential use case if an application has a strong use case that requires error checking.

(double*)B.data(), B.stride_1(),
format, (MKL_INT)vector_type::vector_length);
} else if (A.stride_1() == 1 && B.stride_1() == 1) {
mkl_dtrsm_compact(MKL_ROW_MAJOR,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll continue to express my concerns about exposing TPL details in header files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Your connern is well acknowledged. mkl compact routines have the mkl prefix and using the mkl prefixed routines should not harm. However, I will encapsulate all other fortran blas and lapack interface which is highly possible to conflict with users declarations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyungjoo-kim I'm OK with that, as long as including the standard (non-batched) kokkos-kernels BLAS wrappers doesn't automaticallly include these headers. Thanks!

@ndellingwood
Copy link
Contributor

@kyungjoo-kim if it passes spot-checks on white, bowman, and a kokkos-kernels check in Trilinos (to get a cmake check) then it will be good to merge.

@ndellingwood
Copy link
Contributor

@vqd8a can you run spot-checks on white and bowman and post the results? Thanks!

@vqd8a
Copy link
Contributor Author

vqd8a commented Oct 30, 2018

Thanks @ndellingwood
Spot-check on white

Running on machine: white
Going to test compilers:  cuda/9.2.88 gcc/6.4.0 gcc/7.2.0
Testing compiler cuda/9.2.88
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
  Starting job cuda-9.2.88-Cuda_Serial-release
  PASSED cuda-9.2.88-Cuda_Serial-release
Testing compiler gcc/6.4.0
  Starting job gcc-6.4.0-OpenMP-release
  PASSED gcc-6.4.0-OpenMP-release
  Starting job gcc-6.4.0-Serial-release
  PASSED gcc-6.4.0-Serial-release
  Starting job gcc-6.4.0-OpenMP_Serial-release
  PASSED gcc-6.4.0-OpenMP_Serial-release
Testing compiler gcc/7.2.0
  Starting job gcc-7.2.0-OpenMP-release
  PASSED gcc-7.2.0-OpenMP-release
  Starting job gcc-7.2.0-Serial-release
  PASSED gcc-7.2.0-Serial-release
  Starting job gcc-7.2.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-OpenMP_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-9.2.88-Cuda_OpenMP-release build_time=751 run_time=604
cuda-9.2.88-Cuda_Serial-release build_time=870 run_time=1125
gcc-6.4.0-OpenMP-release build_time=288 run_time=311
gcc-6.4.0-OpenMP_Serial-release build_time=361 run_time=888
gcc-6.4.0-Serial-release build_time=273 run_time=953
gcc-7.2.0-OpenMP-release build_time=241 run_time=228
gcc-7.2.0-OpenMP_Serial-release build_time=307 run_time=863
gcc-7.2.0-Serial-release build_time=180 run_time=663
#######################################################
FAILED TESTS
#######################################################

@vqd8a
Copy link
Contributor Author

vqd8a commented Oct 30, 2018

MKL test on bowman:

./KokkosKernels_UnitTest_OpenMP
[==========] Running 124 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 124 tests from openmp
...
[----------] 124 tests from openmp (83729 ms total)

[----------] Global test environment tear-down
[==========] 124 tests from 1 test case ran. (83729 ms total)
[  PASSED  ] 124 tests.
./KokkosKernels_UnitTest_Serial
[==========] Running 124 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 124 tests from serial
...
[----------] 124 tests from serial (647620 ms total)

[----------] Global test environment tear-down
[==========] 124 tests from 1 test case ran. (647621 ms total)
[  PASSED  ] 124 tests.

@vqd8a
Copy link
Contributor Author

vqd8a commented Nov 1, 2018

spot-check on bowman:

Running on machine: bowman
Going to test compilers:  intel/16.4.258 intel/17.2.174 intel/18.2.199
Testing compiler intel/16.4.258
  Starting job intel-16.4.258-Serial-release
  PASSED intel-16.4.258-Serial-release
  Starting job intel-16.4.258-Pthread-release
  PASSED intel-16.4.258-Pthread-release
  Starting job intel-16.4.258-Pthread_Serial-release
  PASSED intel-16.4.258-Pthread_Serial-release
Testing compiler intel/17.2.174
  Starting job intel-17.2.174-OpenMP-release
  PASSED intel-17.2.174-OpenMP-release
  Starting job intel-17.2.174-Pthread-release
  PASSED intel-17.2.174-Pthread-release
  Starting job intel-17.2.174-Serial-release
  PASSED intel-17.2.174-Serial-release
  Starting job intel-17.2.174-OpenMP_Serial-release
  PASSED intel-17.2.174-OpenMP_Serial-release
  Starting job intel-17.2.174-Pthread_Serial-release
  PASSED intel-17.2.174-Pthread_Serial-release
Testing compiler intel/18.2.199
  Starting job intel-18.2.199-OpenMP-release
  FAILED intel-18.2.199-OpenMP-release
[==========] 360 tests from 1 test case ran. (804383 ms total)
[  PASSED  ] 356 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] openmp.batched_scalar_serial_inverselu_dcomplex
[  FAILED  ] openmp.batched_scalar_team_inverselu_dcomplex
[  FAILED  ] openmp.batched_scalar_serial_solvelu_dcomplex
[  FAILED  ] openmp.batched_scalar_team_solvelu_dcomplex

  Starting job intel-18.2.199-Pthread-release
  PASSED intel-18.2.199-Pthread-release

  Starting job intel-18.2.199-Serial-release
  FAILED intel-18.2.199-Serial-release
[==========] 360 tests from 1 test case ran. (2149582 ms total)
[  PASSED  ] 356 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] serial.batched_scalar_serial_inverselu_dcomplex
[  FAILED  ] serial.batched_scalar_team_inverselu_dcomplex
[  FAILED  ] serial.batched_scalar_serial_solvelu_dcomplex
[  FAILED  ] serial.batched_scalar_team_solvelu_dcomplex

  Starting job intel-18.2.199-OpenMP_Serial-release
  FAILED intel-18.2.199-OpenMP_Serial-release
[==========] 360 tests from 1 test case ran. (860797 ms total)
[  PASSED  ] 356 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] openmp.batched_scalar_serial_inverselu_dcomplex
[  FAILED  ] openmp.batched_scalar_team_inverselu_dcomplex
[  FAILED  ] openmp.batched_scalar_serial_solvelu_dcomplex
[  FAILED  ] openmp.batched_scalar_team_solvelu_dcomplex

  Starting job intel-18.2.199-Pthread_Serial-release
  FAILED intel-18.2.199-Pthread_Serial-release
[==========] 360 tests from 1 test case ran. (2041762 ms total)
[  PASSED  ] 356 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] serial.batched_scalar_serial_inverselu_dcomplex
[  FAILED  ] serial.batched_scalar_team_inverselu_dcomplex
[  FAILED  ] serial.batched_scalar_serial_solvelu_dcomplex
[  FAILED  ] serial.batched_scalar_team_solvelu_dcomplex

#######################################################
PASSED TESTS
#######################################################
intel-16.4.258-Pthread-release build_time=1290 run_time=2213
intel-16.4.258-Pthread_Serial-release build_time=1920 run_time=4583
intel-16.4.258-Serial-release build_time=1233 run_time=2231
intel-17.2.174-OpenMP-release build_time=1584 run_time=827
intel-17.2.174-OpenMP_Serial-release build_time=1988 run_time=3161
intel-17.2.174-Pthread-release build_time=1131 run_time=2137
intel-17.2.174-Pthread_Serial-release build_time=1735 run_time=4440
intel-17.2.174-Serial-release build_time=1111 run_time=2275
intel-18.2.199-Pthread-release build_time=1055 run_time=2153
#######################################################
FAILED TESTS
#######################################################
intel-18.2.199-OpenMP-release (test failed)
intel-18.2.199-OpenMP_Serial-release (test failed)
intel-18.2.199-Pthread_Serial-release (test failed)
intel-18.2.199-Serial-release (test failed)

Any idea why it only fails when running with intel-18.2.199 for complex double?

@vqd8a
Copy link
Contributor Author

vqd8a commented Nov 5, 2018

Add if check in SerialTrsmInternalLeftUpper<Algo::Trsm::Unblocked> to give correct results for complex<double> with Intel-18.2.199

if (p>0){
for (int i=0;i<iend;++i)
#if defined(KOKKOS_ENABLE_PRAGMA_UNROLL)
#pragma unroll
#endif
for (int j=0;j<jend;++j)
B0[i*bs0+j*bs1] -= a01[i*as0] * b1t[j*bs1];
}

@vqd8a
Copy link
Contributor Author

vqd8a commented Nov 5, 2018

Re-run spotcheck on white:

Running on machine: white
Going to test compilers:  cuda/9.2.88 gcc/6.4.0 gcc/7.2.0
Testing compiler cuda/9.2.88
  Starting job cuda-9.2.88-Cuda_OpenMP-release
  PASSED cuda-9.2.88-Cuda_OpenMP-release
  Starting job cuda-9.2.88-Cuda_Serial-release
  PASSED cuda-9.2.88-Cuda_Serial-release
Testing compiler gcc/6.4.0
  Starting job gcc-6.4.0-OpenMP-release
  PASSED gcc-6.4.0-OpenMP-release
  Starting job gcc-6.4.0-Serial-release
  PASSED gcc-6.4.0-Serial-release
  Starting job gcc-6.4.0-OpenMP_Serial-release
  PASSED gcc-6.4.0-OpenMP_Serial-release
Testing compiler gcc/7.2.0
  Starting job gcc-7.2.0-OpenMP-release
  PASSED gcc-7.2.0-OpenMP-release
  Starting job gcc-7.2.0-Serial-release
  PASSED gcc-7.2.0-Serial-release
  Starting job gcc-7.2.0-OpenMP_Serial-release
  PASSED gcc-7.2.0-OpenMP_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-9.2.88-Cuda_OpenMP-release build_time=734 run_time=606
cuda-9.2.88-Cuda_Serial-release build_time=816 run_time=1066
gcc-6.4.0-OpenMP-release build_time=310 run_time=325
gcc-6.4.0-OpenMP_Serial-release build_time=302 run_time=975
gcc-6.4.0-Serial-release build_time=221 run_time=620
gcc-7.2.0-OpenMP-release build_time=198 run_time=223
gcc-7.2.0-OpenMP_Serial-release build_time=309 run_time=862
gcc-7.2.0-Serial-release build_time=174 run_time=668
#######################################################
FAILED TESTS
#######################################################

@vqd8a
Copy link
Contributor Author

vqd8a commented Nov 5, 2018

Re-run spotcheck on bowman:

Running on machine: bowman
Going to test compilers:  intel/16.4.258 intel/17.2.174 intel/18.2.199
Testing compiler intel/16.4.258
  Starting job intel-16.4.258-Serial-release
  PASSED intel-16.4.258-Serial-release
  Starting job intel-16.4.258-Pthread-release
  PASSED intel-16.4.258-Pthread-release
  Starting job intel-16.4.258-Pthread_Serial-release
  PASSED intel-16.4.258-Pthread_Serial-release
Testing compiler intel/17.2.174
  Starting job intel-17.2.174-OpenMP-release
  PASSED intel-17.2.174-OpenMP-release
  Starting job intel-17.2.174-Pthread-release
  PASSED intel-17.2.174-Pthread-release
  Starting job intel-17.2.174-Serial-release
  PASSED intel-17.2.174-Serial-release
  Starting job intel-17.2.174-OpenMP_Serial-release
  PASSED intel-17.2.174-OpenMP_Serial-release
  Starting job intel-17.2.174-Pthread_Serial-release
  PASSED intel-17.2.174-Pthread_Serial-release
Testing compiler intel/18.2.199
  Starting job intel-18.2.199-OpenMP-release
  PASSED intel-18.2.199-OpenMP-release
  Starting job intel-18.2.199-Pthread-release
  PASSED intel-18.2.199-Pthread-release
  Starting job intel-18.2.199-Serial-release
  PASSED intel-18.2.199-Serial-release
  Starting job intel-18.2.199-OpenMP_Serial-release
  PASSED intel-18.2.199-OpenMP_Serial-release
  Starting job intel-18.2.199-Pthread_Serial-release
  PASSED intel-18.2.199-Pthread_Serial-release
#######################################################
PASSED TESTS
#######################################################
intel-16.4.258-Pthread-release build_time=1305 run_time=2181
intel-16.4.258-Pthread_Serial-release build_time=1903 run_time=4560
intel-16.4.258-Serial-release build_time=1228 run_time=2204
intel-17.2.174-OpenMP-release build_time=1591 run_time=800
intel-17.2.174-OpenMP_Serial-release build_time=1984 run_time=3179
intel-17.2.174-Pthread-release build_time=1127 run_time=2125
intel-17.2.174-Pthread_Serial-release build_time=1735 run_time=4394
intel-17.2.174-Serial-release build_time=1095 run_time=2262
intel-18.2.199-OpenMP-release build_time=1240 run_time=808
intel-18.2.199-OpenMP_Serial-release build_time=1869 run_time=2897
intel-18.2.199-Pthread-release build_time=1026 run_time=2175
intel-18.2.199-Pthread_Serial-release build_time=1637 run_time=4052
intel-18.2.199-Serial-release build_time=986 run_time=2174
#######################################################
FAILED TESTS
#######################################################

@vqd8a
Copy link
Contributor Author

vqd8a commented Nov 5, 2018

Re-run MKL test on bowman:

./KokkosKernels_UnitTest_OpenMP
[==========] Running 272 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 272 tests from openmp
...
[----------] 272 tests from openmp (194181 ms total)

[----------] Global test environment tear-down
[==========] 272 tests from 1 test case ran. (194183 ms total)
[  PASSED  ] 272 tests.

./KokkosKernels_UnitTest_Serial
[==========] Running 272 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 272 tests from serial
...
[----------] 272 tests from serial (1422204 ms total)

[----------] Global test environment tear-down
[==========] 272 tests from 1 test case ran. (1422205 ms total)
[  PASSED  ] 272 tests.

@crtrott
Copy link
Member

crtrott commented Nov 5, 2018

Do not merge yet until after the promotion.

@kyungjoo-kim
Copy link
Contributor

@vqd8a The following is the right way to generate complex random numbers.

    Kokkos::Random_XorShift64_Pool<SpT> random(13245);                                                                                                                                                                              
    {                                                                                                                                                                                                                               
      Kokkos::View<Kokkos::complex<value_type>*,SpT> trott("trott", 10);                                                                                                                                                            
      Kokkos::fill_random(trott, random, Kokkos::rand<Kokkos::Random_XorShift64<SpT>,Kokkos::complex<value_type> >::max());                                                                                                         
                                                                                                                                                                                                                                    
      for (int i=0;i<10;++i) {                                                                                                                                                                                                      
        printf(" trott = %e, %e\n", trott(i).real(), trott(i).imag());                                                                                                                                                              
      }                                                                                                                                                                                                                             
    }    

@crtrott It generates the complex imag random numbers. Can we put a more intuitive interface instead of max() ? It is a bit difficult to consider the double max is 1. Maybe an overriding interface with mag_type max(const mag_type user_max_val) ?

@crtrott
Copy link
Member

crtrott commented Nov 6, 2018

What do you mean with overriding interface? Also from a random number generator perspective what is a "max" value for float in the sense of a uniform distribution. I thought a bit about that and think that 1.0 just makes imminently the most sense. Effectively what we are doing is ignoring the exponent for the purpose of defining distributions.

@kyungjoo-kim
Copy link
Contributor

Okay. I understand it now.

@srajama1
Copy link
Contributor

@kyungjoo-kim : Can we merge this in ?

@kyungjoo-kim
Copy link
Contributor

merge them.

@ndellingwood ndellingwood merged commit 486dda8 into develop Dec 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants