You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Achives about 14 TFlops which, while seaming a bit low given the throughput V_PK_FMA_F16, makes sense as we have V_MFMA_F32_16X16X16F16 but V_MFMA_F16_16X16X16F16 dose not exist.
is also only about 14 TFlops which is slower than using c_type ,d_type f32_r and down casting after, which dosent make much sense as rocblas should just do that.
The text was updated successfully, but these errors were encountered:
Rocm 6.3
On MI100
Achieves about 90 TFlops. Meanwhile
Achives about 14 TFlops which, while seaming a bit low given the throughput V_PK_FMA_F16, makes sense as we have V_MFMA_F32_16X16X16F16 but V_MFMA_F16_16X16X16F16 dose not exist.
However
is also only about 14 TFlops which is slower than using c_type ,d_type f32_r and down casting after, which dosent make much sense as rocblas should just do that.
The text was updated successfully, but these errors were encountered: