New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

TPL Support for BLAS functions (nrm2, axpy, dot, gemm) using CuBLAS (Issue #247) #262

Merged

srajama1 merged 13 commits into develop from issue-247

Jul 9, 2018

Contributor

vqd8a commented Jun 18, 2018

Added cuBLAS TPL Support for nrm2, axpy, dot. More to come.

vqd8a added 3 commits

June 18, 2018 10:55


          Add cuBLAS TPL Support for nrm2, axpy, dot

eb0eab0


          Delete Test_Blas1_nrm2.hppORIG

2704c2b


          Delete KokkosBatched_Vector_test.hpp

0d670ff

vqd8a requested review from crtrott, kyungjoo-kim, ndellingwood and srajama1

June 18, 2018 17:05

vqd8a changed the title ~~TPL Support for all BLAS functions using CuBLAS (Issue 247)~~ TPL Support for all BLAS functions using CuBLAS (Issue #247)

mhoemmen reviewed

View reviewed changes

Contributor

mhoemmen left a comment

See comments. It's likely that this change will cause build warnings, due to signed / unsigned comparisons. Are you building with -Wall?

Makefile.kokkos-kernels Outdated

@@ @@ -386,6 +386,8 @@ ifeq ($(KOKKOSKERNELS_INTERNAL_INST_EXECSPACE_CUDA), 1) @@
                 endif
               endif
+              KOKKOSKERNELS_INTERNAL_SRC_BLAS += $(wildcard ${KOKKOSKERNELS_PATH}/src/impl/generated_specializations_cpp/KokkosBlas_Cuda_tpl.cpp)

Contributor

mhoemmen Jun 18, 2018

Does this even work? wildcard should expand * etc., but there's no * here.

Contributor Author

vqd8a Jun 18, 2018

@mhoemmen I did not build with -Wall. But it is still fine with -Wall. wildcard works. I use it here just for other possible PRs. To avoid confusion, I will wildcard.

Contributor Author

vqd8a Jun 18, 2018

remove wildcard

src/impl/tpls/KokkosBlas1_axpby_tpl_spec_decl.hpp Outdated

+              \
+                static void \
+                axpby (const AV& alpha, const XV& X, const BV& beta, const YV& Y) { \
+                  if((X.extent(0) < INT_MAX) && (beta == 1.0)) { \

Contributor

mhoemmen Jun 18, 2018

X.extent(0) may return size_t, which is usually unsigned. Comparison with INT_MAX may cause a build warning ("signed / unsigned comparison").

src/impl/tpls/KokkosBlas1_axpby_tpl_spec_decl.hpp Outdated

+                axpby (const AV& alpha, const XV& X, const BV& beta, const YV& Y) { \
+                  if((X.extent(0) < INT_MAX) && (beta == 1.0)) { \
+                    axpby_print_specialization<AV,XV,BV,YV>(); \
+                    int N = X.extent(0); \

Contributor

mhoemmen Jun 18, 2018

This may cause a build warning, since it assigns size_t (usually unsigned) to int. Try this instead:

const int N = static_cast<int> (X.extent (0));

src/impl/tpls/KokkosBlas1_axpby_tpl_spec_decl.hpp Outdated

+              \
+                static void \
+                axpby (const AV& alpha, const XV& X, const BV& beta, const YV& Y) { \
+                  if((X.extent(0) < INT_MAX) && (beta == 1.0f)) { \

Contributor

mhoemmen Jun 18, 2018

See above.

src/impl/tpls/KokkosBlas1_axpby_tpl_spec_decl.hpp Outdated

+                axpby (const AV& alpha, const XV& X, const BV& beta, const YV& Y) { \
+                  if((X.extent(0) < INT_MAX) && (beta == 1.0f)) { \
+                    axpby_print_specialization<AV,XV,BV,YV>(); \
+                    int N = X.extent(0); \

Contributor

mhoemmen Jun 18, 2018

See above.

src/impl/tpls/KokkosBlas1_nrm2_tpl_spec_decl.hpp Outdated

+                  if (numElems < static_cast<size_type> (INT_MAX)) { \
+                    nrm2_print_specialization<RV,XV>(); \
+                    int N = numElems; \
+                    int one = 1; \

Contributor

mhoemmen Jun 18, 2018

See above.

src/impl/tpls/KokkosBlas1_nrm2_tpl_spec_decl.hpp Outdated

+                  const size_type numElems = X.extent(0); \
+                  if (numElems < static_cast<size_type> (INT_MAX)) { \
+                    nrm2_print_specialization<RV,XV>(); \
+                    int N = numElems; \

Contributor

mhoemmen Jun 18, 2018

See above.

src/impl/tpls/KokkosBlas1_nrm2_tpl_spec_decl.hpp Outdated

+                  if (numElems < static_cast<size_type> (INT_MAX)) { \
+                    nrm2_print_specialization<RV,XV>(); \
+                    int N = numElems; \
+                    int one = 1; \

Contributor

mhoemmen Jun 18, 2018

See above.

src/impl/tpls/KokkosBlas1_nrm2_tpl_spec_decl.hpp Outdated

+                  const size_type numElems = X.extent(0); \
+                  if (numElems < static_cast<size_type> (INT_MAX)) { \
+                    nrm2_print_specialization<RV,XV>(); \
+                    int N = numElems; \

Contributor

mhoemmen Jun 18, 2018

See above.

src/impl/tpls/KokkosBlas1_nrm2_tpl_spec_decl.hpp Outdated

+                  if (numElems < static_cast<size_type> (INT_MAX)) { \
+                    nrm2_print_specialization<RV,XV>(); \
+                    int N = numElems; \
+                    int one = 1; \

Contributor

mhoemmen Jun 18, 2018

See above.


          Avoid potential build warnings (e.g. signed/unsigned comparisons)

29cd214

Contributor Author

vqd8a commented Jun 18, 2018

Thanks @mhoemmen for the comments. I will fix these.

Contributor

srajama1 commented Jun 19, 2018

Is the BLAS3 (GEMM especially) done ? We have an use for it, so it would be nice to prioritize that next if you can.

Contributor

srajama1 commented Jun 19, 2018 •

edited

Loading

Do we have to do anything in generate_makefile ? How about the integration w/ Trilinos TPL mechanisms ?

ndellingwood mentioned this pull request

TPL Support for all BLAS functions using CuBLAS #247

Closed

Contributor Author

vqd8a commented Jun 19, 2018

@srajama1 I made a minor change in Makefile.kokkos-kernels which adds the source file for handling cuBLAS.
I have not had an experience in Trilinos integration. Maybe @crtrott can answer.
I haven't done BLAS3 yet. But can work on it now.

Contributor

kyungjoo-kim commented Jun 19, 2018

I remembered that @mhoemmen commented that two versions of cublas should be harnessed for sierra build. @mhoemmen How do we test it ?

Contributor

mhoemmen commented Jun 19, 2018

@kyungjoo-kim Just don't expose any of the cublas headers in any header file, and you should be fine. The main issue is that FETI uses v2 of the cuBLAS API. If you include the v1 cuBLAS header, FETI won't build.

Contributor

mhoemmen commented Jun 19, 2018

@kyungjoo-kim It's OK to use both versions of the cuBLAS API in the same executable, but you aren't allowed to include both header files in the same compilation unit.

Contributor

mhoemmen commented Jun 19, 2018

@kyungjoo-kim If you must expose the header file (ask yourself why you really must; you almost certainly do not), then please use v2 of the cuBLAS API. That shouldn't break FETI.

crtrott requested changes

View reviewed changes

Member

crtrott left a comment

At minimum move a file

Makefile.kokkos-kernels Outdated

@@ @@ -386,6 +386,8 @@ ifeq ($(KOKKOSKERNELS_INTERNAL_INST_EXECSPACE_CUDA), 1) @@
                 endif
               endif
+              KOKKOSKERNELS_INTERNAL_SRC_BLAS += ${KOKKOSKERNELS_PATH}/src/impl/generated_specializations_cpp/KokkosBlas_Cuda_tpl.cpp

Member

crtrott Jun 19, 2018

This file should not be in a generated_specializations_cpp. That should only have auto generated files.

Contributor Author

vqd8a Jun 20, 2018

@crtrott I am moving this file to tpls

src/impl/tpls/KokkosBlas_Cuda_tpl.hpp Outdated

+              }
+              CudaBlasSingleton & CudaBlasSingleton::singleton()
+              { static CudaBlasSingleton s ; return s ; }

Member

crtrott Jun 19, 2018

Do we need to do this via a pointer or a create function or something. cublasCreate needs to be called after Kokkos::initialize().

Contributor Author

vqd8a Jun 20, 2018

@crtrott cublasCreate is already called after Kokkos::initialize()

Contributor

mhoemmen Jun 20, 2018

The singleton won't get destroyed until after main(). I'm not comfortable with the idea of CUDA state persisting after Kokkos::finalize. Why not get rid of the destructor, and have the constructor do the following?

CudaBlasSingleton::CudaBlasSingleton()
{    
  cublasCreate( & handle );
  cublasStatus_t stat = cublasCreate(&handle);
  if (stat != CUBLAS_STATUS_SUCCESS) {
    Kokkos::abort("CUBLAS initialization failed!");
  }
  Kokkos::push_finalize_hook ([&] () { cublasDestroy (handle); });
}

Contributor

mhoemmen Jun 20, 2018

Also, please never print anything (by default) unless something went wrong. Those printf statements will be very annoying if running on many nodes.

Member

crtrott commented Jun 19, 2018

v1 is deprecated. We should not include that ...

vqd8a added 3 commits

June 20, 2018 09:43


          Move KokkosBlas_Cuda_tpl.cpp to tpls

8cc6c91


          Add CUBLAS if guard

5adcfcd


          Add cuBLAS TPL Support for gemm

23c7e71

Contributor Author

vqd8a commented Jun 20, 2018

@srajama1 cuBLAS GEMM support has been added.

vqd8a changed the title ~~TPL Support for all BLAS functions using CuBLAS (Issue #247)~~ TPL Support for BLAS functions (nrm2, axpy, dot, gemm) using CuBLAS (Issue #247)

Contributor Author

vqd8a commented Jun 20, 2018

Changed PR name for partially addressing #247 for convenience. I will create separate PRs for the remaining BLAS functions.

srajama1 reviewed

View reviewed changes

Contributor

srajama1 left a comment

Is the handle that we expose for sparse cases an option here so users create the handle and destroy it both for sparse and dense KK functions ? Discussed this w/ @vqd8a already. @crtrott what do you think ? Can Tpetra create and delete the handles ?

Contributor

mhoemmen commented Jun 20, 2018

If the handle only ever gets created with the default CUDA stream, then the right way to deal with it is this:

Create it once, lazily, on demand
Add it as a Kokkos::finalize hook

If it's possible to create the handle with other CUDA streams, then users (such as Tpetra and other downstream software) need a way to create, manage, and destroy handles.

Contributor

srajama1 commented Jun 20, 2018

I was wondering if the users like Tpetra would like to control things like cublasSetMathMode, cublasSetAtomicMode, cublasSetMatrix/VectorAsync, and cuBLASSetStream. KK can leave most of it to default, but making sure we don't force it. The disadvantage of the handle is of course the users have to worry about it even w/ other libraries like MKL (or with no libraries).

mhoemmen previously requested changes

View reviewed changes

src/impl/tpls/KokkosBlas1_axpby_tpl_spec_avail.hpp Outdated

+                                Kokkos::MemoryTraits<Kokkos::Unmanaged> >, \
+>  { enum : bool { value = true }; };
+              KOKKOSBLAS1_AXPBY_TPL_SPEC_AVAIL_CUBLAS( double,                  Kokkos::LayoutLeft, Kokkos::CudaSpace)

Contributor

mhoemmen Jun 20, 2018

Shouldn't you use appropriate kokkos-kernels macros to detect whether those Scalar types (e.g., float) are enabled? Otherwise, you'll be instantiating for types for which the user did not want to instantiate. This will increase build time and library size.

src/impl/tpls/KokkosBlas_Cuda_tpl.hpp Outdated

+              }
+              CudaBlasSingleton & CudaBlasSingleton::singleton()
+              { static CudaBlasSingleton s ; return s ; }

Contributor

mhoemmen Jun 20, 2018

The singleton won't get destroyed until after main(). I'm not comfortable with the idea of CUDA state persisting after Kokkos::finalize. Why not get rid of the destructor, and have the constructor do the following?

CudaBlasSingleton::CudaBlasSingleton()
{    
  cublasCreate( & handle );
  cublasStatus_t stat = cublasCreate(&handle);
  if (stat != CUBLAS_STATUS_SUCCESS) {
    Kokkos::abort("CUBLAS initialization failed!");
  }
  Kokkos::push_finalize_hook ([&] () { cublasDestroy (handle); });
}

src/impl/tpls/KokkosBlas_Cuda_tpl.hpp Outdated

+              }
+              CudaBlasSingleton & CudaBlasSingleton::singleton()
+              { static CudaBlasSingleton s ; return s ; }

Contributor

mhoemmen Jun 20, 2018

Also, please never print anything (by default) unless something went wrong. Those printf statements will be very annoying if running on many nodes.

vqd8a added 2 commits

June 20, 2018 15:43


          Hook cublasDestroy to Kokkos::finalize

578c409


          Hook cublasDestroy to Kokkos::finalize

864c6fd

Contributor

mhoemmen commented Jun 20, 2018

I was wondering if the users like Tpetra would like to control things like cublasSetMathMode, cublasSetAtomicMode, cublasSetMatrix/VectorAsync, and cuBLASSetStream.

cublasSet{Matrix,Vector}Async appear to be redundant with respect to Kokkos::deep_copy
"math mode" only appears relevant for float and half, not for double etc.
"atomic mode" would give nondeterministic results so I'd rather not enable it unless somebody really insists

That leaves setting the stream. I could see users wanting to do that. MAGMA has a handle tied to a CUDA stream, for example. Better exposure of distinct CUDA streams would help us overlap data movement and computation.

The main value of hiding the stream is for gradually porting from existing interfaces like Teuchos::BLAS to this new interface. I think that's a good idea. It will be a while before Trilinos is ready to use multiple CUDA streams, so maybe we don't have to worry about that for now. Even three-argument Kokkos::deep_copy currently ignores its execution space instance argument: https://github.com/kokkos/kokkos/blob/d3a941925cbfb71785d8ea68259123ed52d3f9da/core/src/Kokkos_CopyViews.hpp#L1511

mhoemmen reviewed

View reviewed changes

Contributor

mhoemmen left a comment

@vqd8a Thanks for fixing the CudaBlasSingleton constructor! I appreciate also that you took out the printf statements :-) .

mhoemmen dismissed their stale review

June 20, 2018 22:19

I'm dismissing my review because I'm not sure how kokkos-kernels wants to do ETI. Also, since this is a BLAS interface and we want to use it to replace things like Teuchos::BLAS, it may make sense to instantiate for the four BLAS types, independently of kokkos-kernels' ETI settings.


          Use kokkos-kernels macros to detect Scalar types

5f7b843

Contributor Author

vqd8a commented Jun 21, 2018

Thanks @mhoemmen for your suggestions. While waiting for decision, I have just fixed the use of kokkos-kernes macros to detect Scalar types as you suggested.

mhoemmen reviewed

View reviewed changes

Contributor

mhoemmen left a comment

Just a comment -- thanks!

src/impl/tpls/KokkosBlas1_dot_tpl_spec_avail.hpp Outdated

               KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL_CUBLAS( float,                   Kokkos::LayoutLeft, Kokkos::CudaSpace)
+              #endif
+              #if defined (KOKKOSKERNELS_INST_KOKKOS_COMPLEX_DOUBLE_)

Contributor

mhoemmen Jun 21, 2018

How come the complex double and complex float macros end with underscores, but the double and float ones don't?

Contributor Author

vqd8a Jun 21, 2018 •

edited

Loading

I think we can use either without or with underscore, since in the KokkosKernels_config.h, it uses

#if defined(KOKKOSKERNELS_INST_COMPLEX_DOUBLE)
#define KOKKOSKERNELS_INST_KOKKOS_COMPLEX_DOUBLE_
#endif

but I just followed the convention in hpp files in \src\impl\generated_specializations_hpp

Contributor

mhoemmen Jun 21, 2018

OK, just checking :)


          Add IF guard for checking CUBLAS enabled

31d3489

Contributor

srajama1 commented Jun 22, 2018

@crtrott : Can you check this ? Need your opinion on two things, exposing (or not exposing) the handle and the ETI use case mentioned above.

Contributor Author

vqd8a commented Jun 26, 2018

@srajama1 , I talked to @crtrott last week. @crtrott said not exposing handles could be the option for now. and leave exposing for future. If it is the case, I can run the spot-check when the testbeds are back on.

Contributor

srajama1 commented Jun 26, 2018

Agreed. Run the spot check when you can. @crtrott still needs to approve it as it is blocked on his request for changes.

crtrott approved these changes

View reviewed changes

Member

crtrott commented Jun 27, 2018

Looks good to me if you can run the spot-check. But also open a follow up: we may want to have the TPL variants independent of the INST macros. You see the INST macros don't say whether something is available, they say wether we do ETI. As long as you don't also define ETI only, all the other calls can still be made. And I think it is fine in that case to have the TPL calls available.

vqd8a added 2 commits

July 6, 2018 13:32


          Fix BLAS library path

6e1821d


          Fix BLAS library path

0d55406

Contributor Author

vqd8a commented Jul 6, 2018 •

edited

Loading

Spot check on white:
kokkos-kernels/scripts/test_all_sandia cuda --spot-check --with-cuda-options=enable_lambda --with-tpls=cublas

Going to test compilers:  cuda/8.0.44 cuda/9.0.103
Testing compiler cuda/8.0.44
  Starting job cuda-8.0.44-Cuda_OpenMP-release
  PASSED cuda-8.0.44-Cuda_OpenMP-release
  Starting job cuda-8.0.44-Cuda_Serial-release
  PASSED cuda-8.0.44-Cuda_Serial-release
Testing compiler cuda/9.0.103
  Starting job cuda-9.0.103-Cuda_OpenMP-release
  PASSED cuda-9.0.103-Cuda_OpenMP-release
  Starting job cuda-9.0.103-Cuda_Serial-release
  PASSED cuda-9.0.103-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-8.0.44-Cuda_OpenMP-release build_time=823 run_time=751
cuda-8.0.44-Cuda_Serial-release build_time=792 run_time=1078
cuda-9.0.103-Cuda_OpenMP-release build_time=862 run_time=761
cuda-9.0.103-Cuda_Serial-release build_time=800 run_time=1087

srajama1 approved these changes

View reviewed changes

Contributor

srajama1 commented Jul 9, 2018

Once you have a KNL spot check we can merge this.

Contributor

srajama1 commented Jul 9, 2018

Ignore the silly request for KNL, this shouldn't affect any KNL, so merging.

srajama1 merged commit fbf2cc8 into develop

vqd8a deleted the issue-247 branch

July 24, 2018 15:56

kokkos-devops-admin mentioned this pull request

Allow choosing coloring algorithm in multicolor GS #1199

Merged

kokkos-devops-admin mentioned this pull request

Remove non-existant subdir kokkos-kernels/common/common (#11921, #11863) #1854

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

mhoemmen mhoemmen left review comments

crtrott crtrott approved these changes

srajama1 srajama1 approved these changes

kyungjoo-kim Awaiting requested review from kyungjoo-kim

ndellingwood Awaiting requested review from ndellingwood

Labels

None yet