-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPL Support for BLAS functions (nrm2, axpy, dot, gemm) using CuBLAS (Issue #247) #262
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments. It's likely that this change will cause build warnings, due to signed / unsigned comparisons. Are you building with -Wall
?
Makefile.kokkos-kernels
Outdated
@@ -386,6 +386,8 @@ ifeq ($(KOKKOSKERNELS_INTERNAL_INST_EXECSPACE_CUDA), 1) | |||
endif | |||
endif | |||
|
|||
KOKKOSKERNELS_INTERNAL_SRC_BLAS += $(wildcard ${KOKKOSKERNELS_PATH}/src/impl/generated_specializations_cpp/KokkosBlas_Cuda_tpl.cpp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this even work? wildcard
should expand *
etc., but there's no *
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhoemmen I did not build with -Wall
. But it is still fine with -Wall
. wildcard
works. I use it here just for other possible PRs. To avoid confusion, I will wildcard
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove wildcard
\ | ||
static void \ | ||
axpby (const AV& alpha, const XV& X, const BV& beta, const YV& Y) { \ | ||
if((X.extent(0) < INT_MAX) && (beta == 1.0)) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X.extent(0)
may return size_t
, which is usually unsigned. Comparison with INT_MAX
may cause a build warning ("signed / unsigned comparison").
axpby (const AV& alpha, const XV& X, const BV& beta, const YV& Y) { \ | ||
if((X.extent(0) < INT_MAX) && (beta == 1.0)) { \ | ||
axpby_print_specialization<AV,XV,BV,YV>(); \ | ||
int N = X.extent(0); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may cause a build warning, since it assigns size_t (usually unsigned) to int. Try this instead:
const int N = static_cast<int> (X.extent (0));
\ | ||
static void \ | ||
axpby (const AV& alpha, const XV& X, const BV& beta, const YV& Y) { \ | ||
if((X.extent(0) < INT_MAX) && (beta == 1.0f)) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
axpby (const AV& alpha, const XV& X, const BV& beta, const YV& Y) { \ | ||
if((X.extent(0) < INT_MAX) && (beta == 1.0f)) { \ | ||
axpby_print_specialization<AV,XV,BV,YV>(); \ | ||
int N = X.extent(0); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
if (numElems < static_cast<size_type> (INT_MAX)) { \ | ||
nrm2_print_specialization<RV,XV>(); \ | ||
int N = numElems; \ | ||
int one = 1; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
const size_type numElems = X.extent(0); \ | ||
if (numElems < static_cast<size_type> (INT_MAX)) { \ | ||
nrm2_print_specialization<RV,XV>(); \ | ||
int N = numElems; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
if (numElems < static_cast<size_type> (INT_MAX)) { \ | ||
nrm2_print_specialization<RV,XV>(); \ | ||
int N = numElems; \ | ||
int one = 1; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
const size_type numElems = X.extent(0); \ | ||
if (numElems < static_cast<size_type> (INT_MAX)) { \ | ||
nrm2_print_specialization<RV,XV>(); \ | ||
int N = numElems; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
if (numElems < static_cast<size_type> (INT_MAX)) { \ | ||
nrm2_print_specialization<RV,XV>(); \ | ||
int N = numElems; \ | ||
int one = 1; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
Thanks @mhoemmen for the comments. I will fix these. |
Is the BLAS3 (GEMM especially) done ? We have an use for it, so it would be nice to prioritize that next if you can. |
Do we have to do anything in generate_makefile ? How about the integration w/ Trilinos TPL mechanisms ? |
@kyungjoo-kim Just don't expose any of the cublas headers in any header file, and you should be fine. The main issue is that FETI uses v2 of the cuBLAS API. If you include the v1 cuBLAS header, FETI won't build. |
@kyungjoo-kim It's OK to use both versions of the cuBLAS API in the same executable, but you aren't allowed to include both header files in the same compilation unit. |
@kyungjoo-kim If you must expose the header file (ask yourself why you really must; you almost certainly do not), then please use v2 of the cuBLAS API. That shouldn't break FETI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At minimum move a file
Makefile.kokkos-kernels
Outdated
@@ -386,6 +386,8 @@ ifeq ($(KOKKOSKERNELS_INTERNAL_INST_EXECSPACE_CUDA), 1) | |||
endif | |||
endif | |||
|
|||
KOKKOSKERNELS_INTERNAL_SRC_BLAS += ${KOKKOSKERNELS_PATH}/src/impl/generated_specializations_cpp/KokkosBlas_Cuda_tpl.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should not be in a generated_specializations_cpp. That should only have auto generated files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crtrott I am moving this file to tpls
} | ||
|
||
CudaBlasSingleton & CudaBlasSingleton::singleton() | ||
{ static CudaBlasSingleton s ; return s ; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do this via a pointer or a create function or something. cublasCreate needs to be called after Kokkos::initialize().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crtrott cublasCreate is already called after Kokkos::initialize()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The singleton won't get destroyed until after main(). I'm not comfortable with the idea of CUDA state persisting after Kokkos::finalize. Why not get rid of the destructor, and have the constructor do the following?
CudaBlasSingleton::CudaBlasSingleton()
{
cublasCreate( & handle );
cublasStatus_t stat = cublasCreate(&handle);
if (stat != CUBLAS_STATUS_SUCCESS) {
Kokkos::abort("CUBLAS initialization failed!");
}
Kokkos::push_finalize_hook ([&] () { cublasDestroy (handle); });
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please never print anything (by default) unless something went wrong. Those printf
statements will be very annoying if running on many nodes.
v1 is deprecated. We should not include that ... |
@srajama1 cuBLAS GEMM support has been added. |
Changed PR name for partially addressing #247 for convenience. I will create separate PRs for the remaining BLAS functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the handle only ever gets created with the default CUDA stream, then the right way to deal with it is this:
If it's possible to create the handle with other CUDA streams, then users (such as Tpetra and other downstream software) need a way to create, manage, and destroy handles. |
I was wondering if the users like Tpetra would like to control things like cublasSetMathMode, cublasSetAtomicMode, cublasSetMatrix/VectorAsync, and cuBLASSetStream. KK can leave most of it to default, but making sure we don't force it. The disadvantage of the handle is of course the users have to worry about it even w/ other libraries like MKL (or with no libraries). |
Kokkos::MemoryTraits<Kokkos::Unmanaged> >, \ | ||
1> { enum : bool { value = true }; }; | ||
|
||
KOKKOSBLAS1_AXPBY_TPL_SPEC_AVAIL_CUBLAS( double, Kokkos::LayoutLeft, Kokkos::CudaSpace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you use appropriate kokkos-kernels macros to detect whether those Scalar types (e.g., float) are enabled? Otherwise, you'll be instantiating for types for which the user did not want to instantiate. This will increase build time and library size.
} | ||
|
||
CudaBlasSingleton & CudaBlasSingleton::singleton() | ||
{ static CudaBlasSingleton s ; return s ; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The singleton won't get destroyed until after main(). I'm not comfortable with the idea of CUDA state persisting after Kokkos::finalize. Why not get rid of the destructor, and have the constructor do the following?
CudaBlasSingleton::CudaBlasSingleton()
{
cublasCreate( & handle );
cublasStatus_t stat = cublasCreate(&handle);
if (stat != CUBLAS_STATUS_SUCCESS) {
Kokkos::abort("CUBLAS initialization failed!");
}
Kokkos::push_finalize_hook ([&] () { cublasDestroy (handle); });
}
} | ||
|
||
CudaBlasSingleton & CudaBlasSingleton::singleton() | ||
{ static CudaBlasSingleton s ; return s ; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please never print anything (by default) unless something went wrong. Those printf
statements will be very annoying if running on many nodes.
That leaves setting the stream. I could see users wanting to do that. MAGMA has a handle tied to a CUDA stream, for example. Better exposure of distinct CUDA streams would help us overlap data movement and computation. The main value of hiding the stream is for gradually porting from existing interfaces like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vqd8a Thanks for fixing the CudaBlasSingleton constructor! I appreciate also that you took out the printf statements :-) .
I'm dismissing my review because I'm not sure how kokkos-kernels wants to do ETI. Also, since this is a BLAS interface and we want to use it to replace things like Teuchos::BLAS, it may make sense to instantiate for the four BLAS types, independently of kokkos-kernels' ETI settings.
Thanks @mhoemmen for your suggestions. While waiting for decision, I have just fixed the use of kokkos-kernes macros to detect Scalar types as you suggested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment -- thanks!
KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL_CUBLAS( float, Kokkos::LayoutLeft, Kokkos::CudaSpace) | ||
#endif | ||
#if defined (KOKKOSKERNELS_INST_KOKKOS_COMPLEX_DOUBLE_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come the complex double and complex float macros end with underscores, but the double and float ones don't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use either without or with underscore, since in the KokkosKernels_config.h
, it uses
#if defined(KOKKOSKERNELS_INST_COMPLEX_DOUBLE)
#define KOKKOSKERNELS_INST_KOKKOS_COMPLEX_DOUBLE_
#endif
but I just followed the convention in hpp files in \src\impl\generated_specializations_hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, just checking :)
@crtrott : Can you check this ? Need your opinion on two things, exposing (or not exposing) the handle and the ETI use case mentioned above. |
Agreed. Run the spot check when you can. @crtrott still needs to approve it as it is blocked on his request for changes. |
Looks good to me if you can run the spot-check. But also open a follow up: we may want to have the TPL variants independent of the INST macros. You see the INST macros don't say whether something is available, they say wether we do ETI. As long as you don't also define ETI only, all the other calls can still be made. And I think it is fine in that case to have the TPL calls available. |
Spot check on white:
|
Once you have a KNL spot check we can merge this. |
Ignore the silly request for KNL, this shouldn't affect any KNL, so merging. |
Added cuBLAS TPL Support for nrm2, axpy, dot. More to come.