Performance suboptimal for small matrices #29

xor2k · 2025-02-15T09:59:20Z

Dear AOCL Team,

I'm currently working to improve Numpy's matmul for the strided case and I ran a large grid search with different BLAS frameworks, see

numpy/numpy#23752 (comment)

Here a repost of the plots:

blas_benchmark_v2.pdf

The plots show the improvement of performance of the respective BLAS framework plus copying over naïve matrix multiplication.

AOCL is based on BLIS. It is clearly visible that for the case n=100, AOCL provides a substantial improvement over BLIS (see purple shimmer). However, that is not the case for smaller matrices. Some countermeasures have been taken and left a triangular pattern in the performance chart.

I wonder whether with the help of these plots performance can be improved for smaller matrices. I can do more benchmarks and plots like that if interested and also provide some code.

Best from Berlin, Michael

The text was updated successfully, but these errors were encountered:

kvaragan · 2025-02-17T07:01:43Z

Hi Michael,
Thanks for sharing this data. This is very useful.
What is the data-precision we are dealing with? Is it single precision or double precision?
Which version of AOCL was tried out ? -

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance suboptimal for small matrices #29

Performance suboptimal for small matrices #29

xor2k commented Feb 15, 2025

kvaragan commented Feb 17, 2025 •

edited

Loading

Performance suboptimal for small matrices #29

Performance suboptimal for small matrices #29

Comments

xor2k commented Feb 15, 2025

kvaragan commented Feb 17, 2025 • edited Loading

kvaragan commented Feb 17, 2025 •

edited

Loading