Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add templated implementations of BCSR matrix operations #293

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

A-CGray
Copy link
Contributor

@A-CGray A-CGray commented Feb 13, 2024

I was looking into implementing transpose multiplication for BCSR matrices when I discovered that the BCSR operations have separate implementations for each possible block size. I think we could get the same performance for most of the defined operations by templating on the block size, which would drastically reduce the amount of duplicated code.

So far I've added templated implementations of the 2 basic MatVec operations. I'm opening this PR to get @gjkennedy 's opinion on this before I continue converting more functions.

@A-CGray A-CGray requested a review from gjkennedy February 13, 2024 18:56
@A-CGray
Copy link
Contributor Author

A-CGray commented Feb 23, 2024

Timing suggests that the templated implementation of the MatVec product is as fast as the handwritten version, and both are faster than the existing generic implementation that uses BLAS.

The timings below are for MatVec products computed using the stiffness matrix from one of my wingbox cases with ~420 kDOF on a single core (I'm assuming parallelisation won't make any difference to the results since the code being changed only affects the local block operations).

All compiled with EXTRA_CC_FLAGS = -fPIC -O3 -march=core-avx2 -mtune=core-avx2 -Wall

Using generic implementation (BCSRMatVecMult):

Timed for: 278 loops, best of 5
    time per loop: best=17.039 ms, mean=17.388 ± 0.3 ms

Using blocksize=6 specific implementation (BCSRMatVecMult6):

Timed for: 516 loops, best of 5
    time per loop: best=8.406 ms, mean=9.354 ± 1.0 ms

Using templated function (BCSRBlockMatVecMult<6>):

Timed for: 522 loops, best of 5
    time per loop: best=8.784 ms, mean=9.342 ± 0.2 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant