Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPL Support for BLAS functions (nrm2, axpy, dot, gemm) using CuBLAS (Issue #247) #262

Merged
merged 13 commits into from
Jul 9, 2018

Conversation

vqd8a
Copy link
Contributor

@vqd8a vqd8a commented Jun 18, 2018

Added cuBLAS TPL Support for nrm2, axpy, dot. More to come.

@vqd8a vqd8a changed the title TPL Support for all BLAS functions using CuBLAS (Issue 247) TPL Support for all BLAS functions using CuBLAS (Issue #247) Jun 18, 2018
Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments. It's likely that this change will cause build warnings, due to signed / unsigned comparisons. Are you building with -Wall?

@@ -386,6 +386,8 @@ ifeq ($(KOKKOSKERNELS_INTERNAL_INST_EXECSPACE_CUDA), 1)
endif
endif

KOKKOSKERNELS_INTERNAL_SRC_BLAS += $(wildcard ${KOKKOSKERNELS_PATH}/src/impl/generated_specializations_cpp/KokkosBlas_Cuda_tpl.cpp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this even work? wildcard should expand * etc., but there's no * here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhoemmen I did not build with -Wall. But it is still fine with -Wall. wildcard works. I use it here just for other possible PRs. To avoid confusion, I will wildcard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove wildcard

\
static void \
axpby (const AV& alpha, const XV& X, const BV& beta, const YV& Y) { \
if((X.extent(0) < INT_MAX) && (beta == 1.0)) { \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X.extent(0) may return size_t, which is usually unsigned. Comparison with INT_MAX may cause a build warning ("signed / unsigned comparison").

axpby (const AV& alpha, const XV& X, const BV& beta, const YV& Y) { \
if((X.extent(0) < INT_MAX) && (beta == 1.0)) { \
axpby_print_specialization<AV,XV,BV,YV>(); \
int N = X.extent(0); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may cause a build warning, since it assigns size_t (usually unsigned) to int. Try this instead:

const int N = static_cast<int> (X.extent (0));

\
static void \
axpby (const AV& alpha, const XV& X, const BV& beta, const YV& Y) { \
if((X.extent(0) < INT_MAX) && (beta == 1.0f)) { \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

axpby (const AV& alpha, const XV& X, const BV& beta, const YV& Y) { \
if((X.extent(0) < INT_MAX) && (beta == 1.0f)) { \
axpby_print_specialization<AV,XV,BV,YV>(); \
int N = X.extent(0); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

if (numElems < static_cast<size_type> (INT_MAX)) { \
nrm2_print_specialization<RV,XV>(); \
int N = numElems; \
int one = 1; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

const size_type numElems = X.extent(0); \
if (numElems < static_cast<size_type> (INT_MAX)) { \
nrm2_print_specialization<RV,XV>(); \
int N = numElems; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

if (numElems < static_cast<size_type> (INT_MAX)) { \
nrm2_print_specialization<RV,XV>(); \
int N = numElems; \
int one = 1; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

const size_type numElems = X.extent(0); \
if (numElems < static_cast<size_type> (INT_MAX)) { \
nrm2_print_specialization<RV,XV>(); \
int N = numElems; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

if (numElems < static_cast<size_type> (INT_MAX)) { \
nrm2_print_specialization<RV,XV>(); \
int N = numElems; \
int one = 1; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

@vqd8a
Copy link
Contributor Author

vqd8a commented Jun 18, 2018

Thanks @mhoemmen for the comments. I will fix these.

@srajama1
Copy link
Contributor

Is the BLAS3 (GEMM especially) done ? We have an use for it, so it would be nice to prioritize that next if you can.

@srajama1
Copy link
Contributor

srajama1 commented Jun 19, 2018

Do we have to do anything in generate_makefile ? How about the integration w/ Trilinos TPL mechanisms ?

@vqd8a
Copy link
Contributor Author

vqd8a commented Jun 19, 2018

@srajama1 I made a minor change in Makefile.kokkos-kernels which adds the source file for handling cuBLAS.
I have not had an experience in Trilinos integration. Maybe @crtrott can answer.
I haven't done BLAS3 yet. But can work on it now.

@kyungjoo-kim
Copy link
Contributor

I remembered that @mhoemmen commented that two versions of cublas should be harnessed for sierra build. @mhoemmen How do we test it ?

@mhoemmen
Copy link
Contributor

@kyungjoo-kim Just don't expose any of the cublas headers in any header file, and you should be fine. The main issue is that FETI uses v2 of the cuBLAS API. If you include the v1 cuBLAS header, FETI won't build.

@mhoemmen
Copy link
Contributor

@kyungjoo-kim It's OK to use both versions of the cuBLAS API in the same executable, but you aren't allowed to include both header files in the same compilation unit.

@mhoemmen
Copy link
Contributor

@kyungjoo-kim If you must expose the header file (ask yourself why you really must; you almost certainly do not), then please use v2 of the cuBLAS API. That shouldn't break FETI.

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At minimum move a file

@@ -386,6 +386,8 @@ ifeq ($(KOKKOSKERNELS_INTERNAL_INST_EXECSPACE_CUDA), 1)
endif
endif

KOKKOSKERNELS_INTERNAL_SRC_BLAS += ${KOKKOSKERNELS_PATH}/src/impl/generated_specializations_cpp/KokkosBlas_Cuda_tpl.cpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be in a generated_specializations_cpp. That should only have auto generated files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crtrott I am moving this file to tpls

}

CudaBlasSingleton & CudaBlasSingleton::singleton()
{ static CudaBlasSingleton s ; return s ; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this via a pointer or a create function or something. cublasCreate needs to be called after Kokkos::initialize().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crtrott cublasCreate is already called after Kokkos::initialize()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The singleton won't get destroyed until after main(). I'm not comfortable with the idea of CUDA state persisting after Kokkos::finalize. Why not get rid of the destructor, and have the constructor do the following?

CudaBlasSingleton::CudaBlasSingleton()
{    
  cublasCreate( & handle );
  cublasStatus_t stat = cublasCreate(&handle);
  if (stat != CUBLAS_STATUS_SUCCESS) {
    Kokkos::abort("CUBLAS initialization failed!");
  }
  Kokkos::push_finalize_hook ([&] () { cublasDestroy (handle); });
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please never print anything (by default) unless something went wrong. Those printf statements will be very annoying if running on many nodes.

@crtrott
Copy link
Member

crtrott commented Jun 19, 2018

v1 is deprecated. We should not include that ...

@vqd8a
Copy link
Contributor Author

vqd8a commented Jun 20, 2018

@srajama1 cuBLAS GEMM support has been added.

@vqd8a vqd8a changed the title TPL Support for all BLAS functions using CuBLAS (Issue #247) TPL Support for BLAS functions (nrm2, axpy, dot, gemm) using CuBLAS (Issue #247) Jun 20, 2018
@vqd8a
Copy link
Contributor Author

vqd8a commented Jun 20, 2018

Changed PR name for partially addressing #247 for convenience. I will create separate PRs for the remaining BLAS functions.

Copy link
Contributor

@srajama1 srajama1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the handle that we expose for sparse cases an option here so users create the handle and destroy it both for sparse and dense KK functions ? Discussed this w/ @vqd8a already. @crtrott what do you think ? Can Tpetra create and delete the handles ?

@mhoemmen
Copy link
Contributor

If the handle only ever gets created with the default CUDA stream, then the right way to deal with it is this:

  1. Create it once, lazily, on demand
  2. Add it as a Kokkos::finalize hook

If it's possible to create the handle with other CUDA streams, then users (such as Tpetra and other downstream software) need a way to create, manage, and destroy handles.

@srajama1
Copy link
Contributor

I was wondering if the users like Tpetra would like to control things like cublasSetMathMode, cublasSetAtomicMode, cublasSetMatrix/VectorAsync, and cuBLASSetStream. KK can leave most of it to default, but making sure we don't force it. The disadvantage of the handle is of course the users have to worry about it even w/ other libraries like MKL (or with no libraries).

Kokkos::MemoryTraits<Kokkos::Unmanaged> >, \
1> { enum : bool { value = true }; };

KOKKOSBLAS1_AXPBY_TPL_SPEC_AVAIL_CUBLAS( double, Kokkos::LayoutLeft, Kokkos::CudaSpace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you use appropriate kokkos-kernels macros to detect whether those Scalar types (e.g., float) are enabled? Otherwise, you'll be instantiating for types for which the user did not want to instantiate. This will increase build time and library size.

}

CudaBlasSingleton & CudaBlasSingleton::singleton()
{ static CudaBlasSingleton s ; return s ; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The singleton won't get destroyed until after main(). I'm not comfortable with the idea of CUDA state persisting after Kokkos::finalize. Why not get rid of the destructor, and have the constructor do the following?

CudaBlasSingleton::CudaBlasSingleton()
{    
  cublasCreate( & handle );
  cublasStatus_t stat = cublasCreate(&handle);
  if (stat != CUBLAS_STATUS_SUCCESS) {
    Kokkos::abort("CUBLAS initialization failed!");
  }
  Kokkos::push_finalize_hook ([&] () { cublasDestroy (handle); });
}

}

CudaBlasSingleton & CudaBlasSingleton::singleton()
{ static CudaBlasSingleton s ; return s ; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please never print anything (by default) unless something went wrong. Those printf statements will be very annoying if running on many nodes.

@mhoemmen
Copy link
Contributor

I was wondering if the users like Tpetra would like to control things like cublasSetMathMode, cublasSetAtomicMode, cublasSetMatrix/VectorAsync, and cuBLASSetStream.

  • cublasSet{Matrix,Vector}Async appear to be redundant with respect to Kokkos::deep_copy
  • "math mode" only appears relevant for float and half, not for double etc.
  • "atomic mode" would give nondeterministic results so I'd rather not enable it unless somebody really insists

That leaves setting the stream. I could see users wanting to do that. MAGMA has a handle tied to a CUDA stream, for example. Better exposure of distinct CUDA streams would help us overlap data movement and computation.

The main value of hiding the stream is for gradually porting from existing interfaces like Teuchos::BLAS to this new interface. I think that's a good idea. It will be a while before Trilinos is ready to use multiple CUDA streams, so maybe we don't have to worry about that for now. Even three-argument Kokkos::deep_copy currently ignores its execution space instance argument: https://github.com/kokkos/kokkos/blob/d3a941925cbfb71785d8ea68259123ed52d3f9da/core/src/Kokkos_CopyViews.hpp#L1511

Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vqd8a Thanks for fixing the CudaBlasSingleton constructor! I appreciate also that you took out the printf statements :-) .

@mhoemmen mhoemmen dismissed their stale review June 20, 2018 22:19

I'm dismissing my review because I'm not sure how kokkos-kernels wants to do ETI. Also, since this is a BLAS interface and we want to use it to replace things like Teuchos::BLAS, it may make sense to instantiate for the four BLAS types, independently of kokkos-kernels' ETI settings.

@vqd8a
Copy link
Contributor Author

vqd8a commented Jun 21, 2018

Thanks @mhoemmen for your suggestions. While waiting for decision, I have just fixed the use of kokkos-kernes macros to detect Scalar types as you suggested.

Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment -- thanks!

KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL_CUBLAS( float, Kokkos::LayoutLeft, Kokkos::CudaSpace)
#endif
#if defined (KOKKOSKERNELS_INST_KOKKOS_COMPLEX_DOUBLE_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come the complex double and complex float macros end with underscores, but the double and float ones don't?

Copy link
Contributor Author

@vqd8a vqd8a Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use either without or with underscore, since in the KokkosKernels_config.h, it uses

#if defined(KOKKOSKERNELS_INST_COMPLEX_DOUBLE)
#define KOKKOSKERNELS_INST_KOKKOS_COMPLEX_DOUBLE_
#endif

but I just followed the convention in hpp files in \src\impl\generated_specializations_hpp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, just checking :)

@srajama1
Copy link
Contributor

@crtrott : Can you check this ? Need your opinion on two things, exposing (or not exposing) the handle and the ETI use case mentioned above.

@vqd8a
Copy link
Contributor Author

vqd8a commented Jun 26, 2018

@srajama1 , I talked to @crtrott last week. @crtrott said not exposing handles could be the option for now. and leave exposing for future. If it is the case, I can run the spot-check when the testbeds are back on.

@srajama1
Copy link
Contributor

Agreed. Run the spot check when you can. @crtrott still needs to approve it as it is blocked on his request for changes.

@crtrott
Copy link
Member

crtrott commented Jun 27, 2018

Looks good to me if you can run the spot-check. But also open a follow up: we may want to have the TPL variants independent of the INST macros. You see the INST macros don't say whether something is available, they say wether we do ETI. As long as you don't also define ETI only, all the other calls can still be made. And I think it is fine in that case to have the TPL calls available.

@vqd8a
Copy link
Contributor Author

vqd8a commented Jul 6, 2018

Spot check on white:
kokkos-kernels/scripts/test_all_sandia cuda --spot-check --with-cuda-options=enable_lambda --with-tpls=cublas

Going to test compilers:  cuda/8.0.44 cuda/9.0.103
Testing compiler cuda/8.0.44
  Starting job cuda-8.0.44-Cuda_OpenMP-release
  PASSED cuda-8.0.44-Cuda_OpenMP-release
  Starting job cuda-8.0.44-Cuda_Serial-release
  PASSED cuda-8.0.44-Cuda_Serial-release
Testing compiler cuda/9.0.103
  Starting job cuda-9.0.103-Cuda_OpenMP-release
  PASSED cuda-9.0.103-Cuda_OpenMP-release
  Starting job cuda-9.0.103-Cuda_Serial-release
  PASSED cuda-9.0.103-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-8.0.44-Cuda_OpenMP-release build_time=823 run_time=751
cuda-8.0.44-Cuda_Serial-release build_time=792 run_time=1078
cuda-9.0.103-Cuda_OpenMP-release build_time=862 run_time=761
cuda-9.0.103-Cuda_Serial-release build_time=800 run_time=1087

@srajama1
Copy link
Contributor

srajama1 commented Jul 9, 2018

Once you have a KNL spot check we can merge this.

@srajama1
Copy link
Contributor

srajama1 commented Jul 9, 2018

Ignore the silly request for KNL, this shouldn't affect any KNL, so merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants