Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: gemm_ex performance with a_type,b_type,c_type,d_type f16_r and compute_type f32_r #1543

Open
IMbackK opened this issue Jan 25, 2025 · 0 comments

Comments

@IMbackK
Copy link
Contributor

IMbackK commented Jan 25, 2025

Rocm 6.3

On MI100

rocblas-bench -f gemm_ex --transposeA T --transposeB N -m 4096 -n 512 -k 4096 --alpha 1 --a_type f16_r --lda 4096 --b_type f16_r --ldb 4096 --beta 0 --c_type f32_r --ldc 4096 --d_type f32_r --ldd 4096 --compute_type f32_r

Achieves about 90 TFlops. Meanwhile

rocblas-bench -f gemm_ex --transposeA T --transposeB N -m 4096 -n 512 -k 4096 --alpha 1 --a_type f16_r --lda 4096 --b_type f16_r --ldb 4096 --beta 0 --c_type f16_r --ldc 4096 --d_type f16_r --ldd 4096 --compute_type f16_r

Achives about 14 TFlops which, while seaming a bit low given the throughput V_PK_FMA_F16, makes sense as we have V_MFMA_F32_16X16X16F16 but V_MFMA_F16_16X16X16F16 dose not exist.

However

rocblas-bench -f gemm_ex --transposeA T --transposeB N -m 4096 -n 512 -k 4096 --alpha 1 --a_type f16_r --lda 4096 --b_type f16_r --ldb 4096 --beta 0 --c_type f16_r --ldc 4096 --d_type f16_r --ldd 4096 --compute_type f32_r

is also only about 14 TFlops which is slower than using c_type ,d_type f32_r and down casting after, which dosent make much sense as rocblas should just do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant