Cuda kernels for Lower triangular solve #336

pratikvn · 2019-08-15T15:12:19Z

This PR implements the CUDA kernels for the lower triangular solver.

For CUDA versions <=9.1, one can only use csrsm_solve

For CUDA versions >=9.2, one can use csrsm2_solve, and this algorithm is more efficient and the previous csrsm_solve is deprecated from 10.1, so, unfortunately, we have to do a switch using #if defines.

Additionally, CUDA uses different algorithms than the simple algorithms implemented in the reference and omp kernels. Hence I am not sure if it makes sense to compare the kernels as we do for other solvers.

Also, I am not sure how to test the generate kernel properly (or of it is even testable), because all it does it allocate and create from information required for csrsm2_solve calls.

There are additional parameters that can be set for example the algo and the CUSPARSE_SOLVE_POLICY_USE_LEVEL (see documentation), which technically should be exposed to the user, as they may want to tweak it, but I dont do it yet as it can complicate the parameter choices. But I could add these if required.

Note: Currently, only single right hand side version works. I tried but I cannot get the multiple rhs to work. If someone has an idea of how to solve this, I am happy to discuss. The things that I have tried:

Because Cusparse uses col-major rather than row major (as Ginkgo does), I tried (as @yhmtsai suggested) to use CUSPARSE_OPERATION_NON_TRANSPOSE for the rhs. But that does not work as well. I think maybe this is not a problem as the single right hand side seems to work for the CUSPARSE_OPERATION_NON_TRANSPOSE ans as you actually pass in the both dimensions of the rhs the cusparse function can actually figure out and the right thing.
I also tried transposing the rhs and the sol matrix before hand and passing them to the cusparse function with the non-transpose now, but this also has the same problem.

Update: For future references: The multiple RHS as expected now works on CUDA>=9.2, but for CUDA versions <=9.1, multiple rhs solves are handled in a loop with each loop doing 1 rhs solve.

codecov · 2019-08-15T15:49:23Z

Codecov Report

Merging #336 into develop will increase coverage by 0.01%.
The diff coverage is 98.62%.

@@             Coverage Diff             @@
##           develop     #336      +/-   ##
===========================================
+ Coverage    98.22%   98.23%   +0.01%     
===========================================
  Files          237      238       +1     
  Lines        17993    18076      +83     
===========================================
+ Hits         17673    17757      +84     
+ Misses         320      319       -1

Impacted Files	Coverage Δ
cuda/base/cusparse_bindings.hpp	`100% <ø> (ø)`	⬆️
core/test/solver/lower_trs.cpp	`100% <ø> (ø)`	⬆️
reference/test/solver/lower_trs_kernels.cpp	`100% <100%> (ø)`	⬆️
reference/test/solver/lower_trs.cpp	`100% <100%> (ø)`	⬆️
omp/solver/lower_trs_kernels.cpp	`100% <100%> (ø)`	⬆️
cuda/test/solver/lower_trs_kernels.cpp	`100% <100%> (ø)`
include/ginkgo/core/solver/lower_trs.hpp	`100% <100%> (+9.37%)`	⬆️
reference/solver/lower_trs_kernels.cpp	`100% <100%> (ø)`	⬆️
omp/test/solver/lower_trs_kernels.cpp	`100% <100%> (ø)`	⬆️
core/solver/lower_trs.cpp	`93.54% <88.88%> (-6.46%)`	⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 263a455...050c727. Read the comment docs.

yhmtsai

I think it is failed when the b is multiple right hand side or its stride is not equal to 1 because Cuda's dense matrix is col-major but Ginkgo's is row-major.
the possible approach is that
set trans_B = CUSPARSE_OPERATION_TRANSPOSE for cuda version > 9.1
and do the transpose before solving for others

tcojean · 2019-08-22T09:21:28Z

If you want both better coverage result and to check your code for Intel compilers (should be fine though) you could rebase to the latest develop.

cuda/solver/lower_trs_kernels.cu

thoasm

My first scan over this PR. It is not complete, but I found some parts that I would like to be changed.

omp/solver/lower_trs_kernels.cpp

cuda/solver/lower_trs_kernels.cu

core/solver/lower_trs_kernels.hpp

cuda/test/solver/lower_trs_kernels.cpp

include/ginkgo/core/base/executor.hpp

omp/test/solver/lower_trs_kernels.cpp

+ Add a kernel to create and destroy the struct. + Remove the now not-needed clear kernel. + Add an additional transposability check to allocate memory for transpose the temp trans vector, only if needed.

yhmtsai

LGTM. I only have questions about gtest location.

cuda/test/solver/lower_trs_kernels.cpp

tcojean

I'd like to see a few improvements to the code, but otherwise LGTM.

core/solver/lower_trs_kernels.hpp

cuda/base/cusparse_bindings.hpp

cuda/solver/lower_trs_kernels.cu

cuda/test/solver/lower_trs_kernels.cpp

+ Review update. Improve the ifdef checking.

include/ginkgo/core/solver/lower_trs.hpp

thoasm

Looks good, but I have some comments.
Especially, I am worried what happens when the number of right hand sides that are specified with with_num_rhs are not the same as the b you give it.

include/ginkgo/core/solver/lower_trs.hpp

core/solver/lower_trs_kernels.hpp

cuda/base/cusparse_bindings.hpp

cuda/solver/lower_trs_kernels.cu

cuda/test/solver/lower_trs_kernels.cpp

core/solver/lower_trs_kernels.hpp

reference/test/solver/lower_trs.cpp

reference/test/solver/lower_trs_kernels.cpp

thoasm

One unused function, and some comments on documentation.
Looks good!

core/test/solver/lower_trs.cpp

core/solver/lower_trs.cpp

core/solver/lower_trs_kernels.hpp

cuda/base/cusparse_bindings.hpp

cuda/solver/lower_trs_kernels.cu

include/ginkgo/core/solver/lower_trs.hpp

omp/solver/lower_trs_kernels.cpp

reference/solver/lower_trs_kernels.cpp

reference/test/solver/lower_trs.cpp

cuda/test/solver/lower_trs_kernels.cpp

cuda/solver/lower_trs_kernels.cu

thoasm

Small comments.

cuda/solver/lower_trs_kernels.cu

include/ginkgo/core/solver/lower_trs.hpp

tcojean

LGTM.

+ Fix the SolveStruct namespace clarification. + Add a proper free for the workspace. + Some doc clarifications.

thoasm

LGTM!

The Ginkgo team is proud to announce the new minor release of Ginkgo version 1.1.0. This release brings several performance improvements, adds Windows support, adds support for factorizations inside Ginkgo and a new ILU preconditioner based on ParILU algorithm, among other things. For detailed information, check the respective issue. Supported systems and requirements: + For all platforms, cmake 3.9+ + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, 8.1+ + clang: 3.9+ + Intel compiler: 2017+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + Windows + MinGW and CygWin: gcc 5.3+, 6.3+, 7.3+, 8.1+ + Microsoft Visual Studio: VS 2017 15.7+ + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or CygWin. The current known issues can be found in the [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues). Additions: + Upper and lower triangular solvers ([#327](#327), [#336](#336), [#341](#341), [#342](#342)) + New factorization support in Ginkgo, and addition of the ParILU algorithm ([#305](#305), [#315](#315), [#319](#319), [#324](#324)) + New ILU preconditioner ([#348](#348), [#353](#353)) + Windows MinGW and Cygwin support ([#347](#347)) + Windows Visual studio support ([#351](#351)) + New example showing how to use ParILU as a preconditioner ([#358](#358)) + New example on using loggers for debugging ([#360](#360)) + Add two new 9pt and 27pt stencil examples ([#300](#300), [#306](#306)) + Allow benchmarking CuSPARSE spmv formats through Ginkgo's benchmarks ([#303](#303)) + New benchmark for sparse matrix format conversions ([#312](https://github.com/ginkgo-project/ginkgo/issues/312)[#317](https://github.com/ginkgo-project/ginkgo/issues/317)) + Add conversions between CSR and Hybrid formats ([#302](#302), [#310](#310)) + Support for sorting rows in the CSR format by column idices ([#322](#322)) + Addition of a CUDA COO SpMM kernel for improved performance ([#345](#345)) + Addition of a LinOp to handle perturbations of the form (identity + scalar * basis * projector) ([#334](#334)) + New sparsity matrix representation format with Reference and OpenMP kernels ([#349](#349), [#350](#350)) Fixes: + Accelerate GMRES solver for CUDA executor ([#363](#363)) + Fix BiCGSTAB solver convergence ([#359](#359)) + Fix CGS logging by reporting the residual for every sub iteration ([#328](#328)) + Fix CSR,Dense->Sellp conversion's memory access violation ([#295](#295)) + Accelerate CSR->Ell,Hybrid conversions on CUDA ([#313](#313), [#318](#318)) + Fixed slowdown of COO SpMV on OpenMP ([#340](#340)) + Fix gcc 6.4.0 internal compiler error ([#316](#316)) + Fix compilation issue on Apple clang++ 10 ([#322](#322)) + Make Ginkgo able to compile on Intel 2017 and above ([#337](#337)) + Make the benchmarks spmv/solver use the same matrix formats ([#366](#366)) + Fix self-written isfinite function ([#348](#348)) + Fix Jacobi issues shown by cuda-memcheck Tools and ecosystem: + Multiple improvements to the CI system and tools ([#296](#296), [#311](#311), [#365](#365)) + Multiple improvements to the Ginkgo containers ([#328](#328), [#361](#361)) + Add sonarqube analysis to Ginkgo ([#304](#304), [#308](#308), [#309](#309)) + Add clang-tidy and iwyu support to Ginkgo ([#298](#298)) + Improve Ginkgo's support of xSDK M12 policy by adding the `TPL_` arguments to CMake ([#300](#300)) + Add support for the xSDK R7 policy ([#325](#325)) + Fix examples in html documentation ([#367](#367))

The Ginkgo team is proud to announce the new minor release of Ginkgo version 1.1.0. This release brings several performance improvements, adds Windows support, adds support for factorizations inside Ginkgo and a new ILU preconditioner based on ParILU algorithm, among other things. For detailed information, check the respective issue. Supported systems and requirements: + For all platforms, cmake 3.9+ + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, 8.1+ + clang: 3.9+ + Intel compiler: 2017+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, 8.1+ + Microsoft Visual Studio: VS 2017 15.7+ + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. The current known issues can be found in the [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues). ### Additions + Upper and lower triangular solvers ([#327](#327), [#336](#336), [#341](#341), [#342](#342)) + New factorization support in Ginkgo, and addition of the ParILU algorithm ([#305](#305), [#315](#315), [#319](#319), [#324](#324)) + New ILU preconditioner ([#348](#348), [#353](#353)) + Windows MinGW and Cygwin support ([#347](#347)) + Windows Visual Studio support ([#351](#351)) + New example showing how to use ParILU as a preconditioner ([#358](#358)) + New example on using loggers for debugging ([#360](#360)) + Add two new 9pt and 27pt stencil examples ([#300](#300), [#306](#306)) + Allow benchmarking CuSPARSE spmv formats through Ginkgo's benchmarks ([#303](#303)) + New benchmark for sparse matrix format conversions ([#312](https://github.com/ginkgo-project/ginkgo/issues/312)[#317](https://github.com/ginkgo-project/ginkgo/issues/317)) + Add conversions between CSR and Hybrid formats ([#302](#302), [#310](#310)) + Support for sorting rows in the CSR format by column idices ([#322](#322)) + Addition of a CUDA COO SpMM kernel for improved performance ([#345](#345)) + Addition of a LinOp to handle perturbations of the form (identity + scalar * basis * projector) ([#334](#334)) + New sparsity matrix representation format with Reference and OpenMP kernels ([#349](#349), [#350](#350)) ### Fixes + Accelerate GMRES solver for CUDA executor ([#363](#363)) + Fix BiCGSTAB solver convergence ([#359](#359)) + Fix CGS logging by reporting the residual for every sub iteration ([#328](#328)) + Fix CSR,Dense->Sellp conversion's memory access violation ([#295](#295)) + Accelerate CSR->Ell,Hybrid conversions on CUDA ([#313](#313), [#318](#318)) + Fixed slowdown of COO SpMV on OpenMP ([#340](#340)) + Fix gcc 6.4.0 internal compiler error ([#316](#316)) + Fix compilation issue on Apple clang++ 10 ([#322](#322)) + Make Ginkgo able to compile on Intel 2017 and above ([#337](#337)) + Make the benchmarks spmv/solver use the same matrix formats ([#366](#366)) + Fix self-written isfinite function ([#348](#348)) + Fix Jacobi issues shown by cuda-memcheck ### Tools and ecosystem improvements + Multiple improvements to the CI system and tools ([#296](#296), [#311](#311), [#365](#365)) + Multiple improvements to the Ginkgo containers ([#328](#328), [#361](#361)) + Add sonarqube analysis to Ginkgo ([#304](#304), [#308](#308), [#309](#309)) + Add clang-tidy and iwyu support to Ginkgo ([#298](#298)) + Improve Ginkgo's support of xSDK M12 policy by adding the `TPL_` arguments to CMake ([#300](#300)) + Add support for the xSDK R7 policy ([#325](#325)) + Fix examples in html documentation ([#367](#367)) Related PR: #370

pratikvn requested review from thoasm, yhmtsai, hartwiganzt and tcojean August 15, 2019 15:13

pratikvn self-assigned this Aug 15, 2019

pratikvn added mod:cuda This is related to the CUDA module. is:new-feature A request or implementation of a feature that does not exist yet. type:solver This is related to the solvers 1:ST:WIP This PR is a work in progress. Not ready for review. labels Aug 15, 2019

pratikvn force-pushed the trs-cuda-kernels branch 3 times, most recently from 012ceda to 6558a53 Compare August 16, 2019 09:54

pratikvn added mod:cuda This is related to the CUDA module. and removed mod:cuda This is related to the CUDA module. labels Aug 19, 2019

yhmtsai reviewed Aug 19, 2019

View reviewed changes

pratikvn added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels Aug 19, 2019

pratikvn force-pushed the trs-cuda-kernels branch from 22fd1db to 213d3ae Compare August 19, 2019 13:26

pratikvn added 1:ST:WIP This PR is a work in progress. Not ready for review. and removed 1:ST:ready-for-review This PR is ready for review labels Aug 20, 2019

pratikvn force-pushed the trs-cuda-kernels branch 2 times, most recently from d1a715d to a271679 Compare August 23, 2019 12:58

yhmtsai previously requested changes Aug 26, 2019

View reviewed changes

cuda/solver/lower_trs_kernels.cu Outdated Show resolved Hide resolved

cuda/solver/lower_trs_kernels.cu Outdated Show resolved Hide resolved

pratikvn added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels Aug 27, 2019

thoasm previously requested changes Aug 27, 2019

View reviewed changes

pratikvn force-pushed the trs-cuda-kernels branch 2 times, most recently from ed252d5 to 7992508 Compare August 29, 2019 10:30

pratikvn force-pushed the trs-cuda-kernels branch from b529557 to 11b7867 Compare September 2, 2019 08:46

yhmtsai requested changes Sep 2, 2019

View reviewed changes

pratikvn force-pushed the trs-cuda-kernels branch from 11b7867 to 36f475d Compare September 2, 2019 13:45

Move the cusparse struct to LowerTrs class.

d8d1bb8

+ Add a kernel to create and destroy the struct. + Remove the now not-needed clear kernel. + Add an additional transposability check to allocate memory for transpose the temp trans vector, only if needed.

pratikvn force-pushed the trs-cuda-kernels branch from 36f475d to d8d1bb8 Compare September 2, 2019 13:55

thoasm mentioned this pull request Sep 3, 2019

Added ILU preconditioner #338

Merged

7 tasks

tcojean mentioned this pull request Sep 3, 2019

Initial code for the HIP executor #344

Merged

12 tasks

yhmtsai approved these changes Sep 3, 2019

View reviewed changes

cuda/test/solver/lower_trs_kernels.cpp Show resolved Hide resolved

tcojean requested changes Sep 3, 2019

View reviewed changes

Move the SolveStruct creation and destr to the class constr, destr.

ec12e2f

+ Review update. Improve the ifdef checking.

tcojean reviewed Sep 4, 2019

View reviewed changes

include/ginkgo/core/solver/lower_trs.hpp Outdated Show resolved Hide resolved

thoasm requested changes Sep 4, 2019

View reviewed changes

pratikvn force-pushed the trs-cuda-kernels branch 3 times, most recently from fdb1978 to 0b012e7 Compare September 5, 2019 12:23

Move the LowerTrs core tests back to core/test. Review updates.

90b6f0a

pratikvn force-pushed the trs-cuda-kernels branch from 0b012e7 to 90b6f0a Compare September 5, 2019 12:24

thoasm reviewed Sep 5, 2019

View reviewed changes

yhmtsai reviewed Sep 5, 2019

View reviewed changes

cuda/test/solver/lower_trs_kernels.cpp Show resolved Hide resolved

cuda/solver/lower_trs_kernels.cu Show resolved Hide resolved

thoasm reviewed Sep 6, 2019

View reviewed changes

cuda/solver/lower_trs_kernels.cu Outdated Show resolved Hide resolved

include/ginkgo/core/solver/lower_trs.hpp Outdated Show resolved Hide resolved

tcojean self-requested a review September 6, 2019 10:04

tcojean approved these changes Sep 6, 2019

View reviewed changes

pratikvn force-pushed the trs-cuda-kernels branch from f1fceba to 7339ce5 Compare September 6, 2019 10:10

Review updates.

050c727

+ Fix the SolveStruct namespace clarification. + Add a proper free for the workspace. + Some doc clarifications.

pratikvn force-pushed the trs-cuda-kernels branch from 7339ce5 to 050c727 Compare September 6, 2019 11:51

thoasm approved these changes Sep 6, 2019

View reviewed changes

pratikvn merged commit 6d420ff into develop Sep 6, 2019

pratikvn deleted the trs-cuda-kernels branch September 6, 2019 13:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda kernels for Lower triangular solve #336

Cuda kernels for Lower triangular solve #336

pratikvn commented Aug 15, 2019 •

edited

Loading

codecov bot commented Aug 15, 2019 •

edited

Loading

yhmtsai left a comment

tcojean commented Aug 22, 2019

thoasm left a comment

yhmtsai left a comment

tcojean left a comment

thoasm left a comment

thoasm left a comment

thoasm left a comment

tcojean left a comment

thoasm left a comment

Cuda kernels for Lower triangular solve #336

Cuda kernels for Lower triangular solve #336

Conversation

pratikvn commented Aug 15, 2019 • edited Loading

codecov bot commented Aug 15, 2019 • edited Loading

Codecov Report

yhmtsai left a comment

Choose a reason for hiding this comment

tcojean commented Aug 22, 2019

thoasm left a comment

Choose a reason for hiding this comment

yhmtsai left a comment

Choose a reason for hiding this comment

tcojean left a comment

Choose a reason for hiding this comment

thoasm left a comment

Choose a reason for hiding this comment

thoasm left a comment

Choose a reason for hiding this comment

thoasm left a comment

Choose a reason for hiding this comment

tcojean left a comment

Choose a reason for hiding this comment

thoasm left a comment

Choose a reason for hiding this comment

pratikvn commented Aug 15, 2019 •

edited

Loading

codecov bot commented Aug 15, 2019 •

edited

Loading