Updating Ginkgo common kernels to Kokkos integration #1682
Replies: 2 comments 4 replies
-
Hi David, thanks for letting us know! In case you are just using a plain triangular solver with any specific configuration ( |
Beta Was this translation helpful? Give feedback.
-
The cuSPARSE library used for both runs should be the same, so I assume that the "internal changes" you are referring to are Ginkgo-internal changes.
I looked it up again: so my state of the code reports "Ginkgo version 1.5.0", but this means the code is somewhere between 1.4.0 and 1.5.0 zero. However, I couldn't spot any interface changes and the compiler the interface I used worked for both Ginkgo versions. What I understand from your answer is: the performance drop is reasonable and can be explained by internal changes in Ginkgo. I will then replace the cuSPARSE version (from Ginkgo) with a HIP and cuda version, as the required code is really minimal. Thanks for the insight! |
Beta Was this translation helpful? Give feedback.
-
Hey there,
since Ginkgo v1.8.0 has now an official support for Kokkos data types, we updated our Ginkgo integration and moved away from the experimental setup relying on the ginkgo internal kernel launch to the Kokkos integration.
I took now the time to re-run (parts) of the experiments we published with the experimental interface (also with the same setup, i.e., Nvidia A100 and Cuda 11.5.4) and I thought it would be interesting to share them here.
Unfortunately, the comparison is not only about the feature, but also between the referenced commit above (~v1.5.0) and the latest Ginkgo release (v1.8.0).
If we look isolated at the matrix assembly, which is the part now being executed by kokkos, the comparison looks as follows:
comparison-assemblyt.pdf
i.e., for small matrices the kokkos assembly routine seems to perform slightly better. However, both are very efficient and the differences are really only a few milliseconds.
However, comparing the "solve-time" for the qr decomposition with the triangular solver between these two versions gave me the following
comparison-mapt.pdf
which was a bit surprising as they was no change from the user-perspective between both approaches and after a while, I found that the performance of the triangular solver was significantly different
comparison-trsvt.pdf
What we do is certainly not the primary use-case of Ginkgo: we plug a dense matrix into the
UpperTrs
solver class (which is internally converted to a sparse format) and then solve it, i.e., the plot above measures the_triangularSolver->apply(...)
. So I went ahead and used a triangular solver provided by cublas, intended for dense data structures, which you can see in the plot as well. Of course cublas is more efficient, but until now the performance of the 'sparse' triangular solver from Ginkgo was totally fine, performance-wise. I was wondering: do you have a specific reason in mind for these performance differences or is there something to take into account? I skimmed through the changelog, but couldn't find anything noteworthy.Also: the (same) comparison for the iterative solvers (cg) are running right now. I can share them once the run is completed.
Beta Was this translation helpful? Give feedback.
All reactions