Updating Ginkgo common kernels to Kokkos integration #1682

davidscn · 2024-09-17T09:33:26Z

davidscn
Sep 17, 2024

Hey there,

since Ginkgo v1.8.0 has now an official support for Kokkos data types, we updated our Ginkgo integration and moved away from the experimental setup relying on the ginkgo internal kernel launch to the Kokkos integration.

I took now the time to re-run (parts) of the experiments we published with the experimental interface (also with the same setup, i.e., Nvidia A100 and Cuda 11.5.4) and I thought it would be interesting to share them here.

Unfortunately, the comparison is not only about the feature, but also between the referenced commit above (~v1.5.0) and the latest Ginkgo release (v1.8.0).

If we look isolated at the matrix assembly, which is the part now being executed by kokkos, the comparison looks as follows:

comparison-assemblyt.pdf

i.e., for small matrices the kokkos assembly routine seems to perform slightly better. However, both are very efficient and the differences are really only a few milliseconds.

However, comparing the "solve-time" for the qr decomposition with the triangular solver between these two versions gave me the following

comparison-mapt.pdf

which was a bit surprising as they was no change from the user-perspective between both approaches and after a while, I found that the performance of the triangular solver was significantly different

comparison-trsvt.pdf

What we do is certainly not the primary use-case of Ginkgo: we plug a dense matrix into the UpperTrs solver class (which is internally converted to a sparse format) and then solve it, i.e., the plot above measures the _triangularSolver->apply(...). So I went ahead and used a triangular solver provided by cublas, intended for dense data structures, which you can see in the plot as well. Of course cublas is more efficient, but until now the performance of the 'sparse' triangular solver from Ginkgo was totally fine, performance-wise. I was wondering: do you have a specific reason in mind for these performance differences or is there something to take into account? I skimmed through the changelog, but couldn't find anything noteworthy.

Also: the (same) comparison for the iterative solvers (cg) are running right now. I can share them once the run is completed.

upsj · 2024-09-18T07:05:12Z

upsj
Sep 18, 2024
Maintainer

Hi David,

thanks for letting us know! In case you are just using a plain triangular solver with any specific configuration (UpperTrs::build().on(exec)), you are relying on the cuSPARSE triangular solver, which underwent a lot of internal changes in the last few years. It might very well be that you are seeing difference in cuSPARSE's analysis (which has been reported to have become significantly slower) or solve (which might run into edge case behavior with unusually dense matrices). There was also a change of the triangular solver interface itself, but as far as I can tell, that happened with version 1.5.0 already, so you should not be impacted by it.

0 replies

davidscn · 2024-09-20T07:32:35Z

davidscn
Sep 20, 2024
Author

you are relying on the cuSPARSE triangular solver, which underwent a lot of internal changes in the last few years.

The cuSPARSE library used for both runs should be the same, so I assume that the "internal changes" you are referring to are Ginkgo-internal changes.

There was also a change of the triangular solver interface itself, but as far as I can tell, that happened with version 1.5.0 already,

I looked it up again: so my state of the code reports "Ginkgo version 1.5.0", but this means the code is somewhere between 1.4.0 and 1.5.0 zero. However, I couldn't spot any interface changes and the compiler the interface I used worked for both Ginkgo versions.

What I understand from your answer is: the performance drop is reasonable and can be explained by internal changes in Ginkgo. I will then replace the cuSPARSE version (from Ginkgo) with a HIP and cuda version, as the required code is really minimal. Thanks for the insight!

4 replies

upsj Sep 20, 2024
Maintainer

so I assume that the "internal changes" you are referring to are Ginkgo-internal changes.

No, I am referring to cuSPARSE-internal changes. But the switch between the old and new triangular solver interface for cuSPARSE happened with v1.5.0, so your ~~precice~~ precise commit hash matters ;)

As long as you are not using with_algorithm(syncfree) (more recent interface) and your matrices don't specify any classical strategy, you are benchmarking cuSPARSE against cuSPARSE, probably with different interfaces and algorithms.

In short: The performance drop should not be there, and I assume it is most likely due to switching over to the new cuSPARSE interface.

davidscn Sep 23, 2024
Author

due to switching over to the new cuSPARSE interface.

..which was implemented within Ginkgo, right? Otherwise this still doesn't seem to make sense, as the code block is really only _triangularSolver->apply(...) on the user side.

upsj Sep 23, 2024
Maintainer

Yes, we switched at some point before 1.5.0. If you still have the code lying around, you could verify if the common_trs_solvers.cuh file contains an #ifdef switch for CUDA 11.x, otherwise it's just a guessing game. What we could do is add a special dense algorithm for the triangular solver to rely on cuBLAS, then you only need to add a line of code to your existing setup, and profit from the hopefully improved performance ;)

davidscn Sep 24, 2024
Author

I can't find the file you are referring to (maybe common_trs_kernels.cuh), I can't find a switch therein either but it's probably not too much of relevance, as the answer you gave is completely sufficient.

What we could do is add a special dense algorithm for the triangular solver to rely on cuBLAS, then you only need to add a line of code to your existing setup, and profit from the hopefully improved performance ;)

I guess that's more of a political question if you want to add support in this direction. I desired or required, I could contribute the cuBLAS implementation I am using as of now (it's maybe 20 locs).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Ginkgo common kernels to Kokkos integration #1682

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Updating Ginkgo common kernels to Kokkos integration #1682

davidscn Sep 17, 2024

Replies: 2 comments · 4 replies

upsj Sep 18, 2024 Maintainer

davidscn Sep 20, 2024 Author

upsj Sep 20, 2024 Maintainer

davidscn Sep 23, 2024 Author

upsj Sep 23, 2024 Maintainer

davidscn Sep 24, 2024 Author

davidscn
Sep 17, 2024

Replies: 2 comments 4 replies

upsj
Sep 18, 2024
Maintainer

davidscn
Sep 20, 2024
Author

upsj Sep 20, 2024
Maintainer

davidscn Sep 23, 2024
Author

upsj Sep 23, 2024
Maintainer

davidscn Sep 24, 2024
Author