[Bug]: gemm_ex performance with a_type,b_type,c_type,d_type f16_r and compute_type f32_r #1543

IMbackK · 2025-01-25T10:56:19Z

Rocm 6.3

On MI100

rocblas-bench -f gemm_ex --transposeA T --transposeB N -m 4096 -n 512 -k 4096 --alpha 1 --a_type f16_r --lda 4096 --b_type f16_r --ldb 4096 --beta 0 --c_type f32_r --ldc 4096 --d_type f32_r --ldd 4096 --compute_type f32_r

Achieves about 90 TFlops. Meanwhile

rocblas-bench -f gemm_ex --transposeA T --transposeB N -m 4096 -n 512 -k 4096 --alpha 1 --a_type f16_r --lda 4096 --b_type f16_r --ldb 4096 --beta 0 --c_type f16_r --ldc 4096 --d_type f16_r --ldd 4096 --compute_type f16_r

Achives about 14 TFlops which, while seaming a bit low given the throughput V_PK_FMA_F16, makes sense as we have V_MFMA_F32_16X16X16F16 but V_MFMA_F16_16X16X16F16 dose not exist.

However

rocblas-bench -f gemm_ex --transposeA T --transposeB N -m 4096 -n 512 -k 4096 --alpha 1 --a_type f16_r --lda 4096 --b_type f16_r --ldb 4096 --beta 0 --c_type f16_r --ldc 4096 --d_type f16_r --ldd 4096 --compute_type f32_r

is also only about 14 TFlops which is slower than using c_type ,d_type f32_r and down casting after, which dosent make much sense as rocblas should just do that.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: gemm_ex performance with a_type,b_type,c_type,d_type f16_r and compute_type f32_r #1543

[Bug]: gemm_ex performance with a_type,b_type,c_type,d_type f16_r and compute_type f32_r #1543

IMbackK commented Jan 25, 2025 •

edited

Loading

[Bug]: gemm_ex performance with a_type,b_type,c_type,d_type f16_r and compute_type f32_r #1543

[Bug]: gemm_ex performance with a_type,b_type,c_type,d_type f16_r and compute_type f32_r #1543

Comments

IMbackK commented Jan 25, 2025 • edited Loading

IMbackK commented Jan 25, 2025 •

edited

Loading