This is an implementation of sgemm_kernel on L1d cache.
Test CPU: AMD Zen2 4750G Single Core @ 4.45GHz
Peak performance = 4.45GHz x (8 x 2 + 8 x 2) = 142.4 GFLOPS
Test case:
./sgemm_l1d 24 64
This means to compute a sgemm_kernel(m = 24, k = 64), while n is fixed to 24.
Test output:
sgemm_kernel_x64_fma(24, 24, 64): time = 0.520822 us, perf = 141.560930 GFLOPS. sgemm_naive result: naive.bin sgemm_kernel_x64_fma_m4n24 result: tuned.bin Use fp_diff(https://github.com/pigirons/fp_diff) to compare the results.
Efficiency = 141.56093 / 142.4 = 99.4%
Diff results:
$./fp_diff naive.bin tuned.bin Array Count: 576 Average Absolute Error: 4.387564e-07 Max Absolute Error Position: 102 Max Absolute Error: 3.814697e-06 Average Relative Error: 2.836071e-08 Max Relative Error Position: 102 Max Relative Error: 2.285095e-07 Number of Relative Error(>1e-1): 0 Number of Relative Error(>1e-2): 0 Number of Relative Error(>1e-3): 0 Number of Relative Error(>1e-4): 0 Number of Relative Error(>1e-5): 0