kernel optimized for A100 #25

lisuying214 · 2024-10-12T06:46:36Z

Thank you for the great work and experiment. We want to test the throughput on A100 with bachsize=16. Do you try kernel optimized for A100, or what can I refer?

happierpig · 2024-10-17T17:33:08Z

Hi @lisuying214 ,

Thanks for your question! The kernels in Atom are specifically optimized for Ada GPUs. Its performance on A100 will degrade a lot due to A100's poor CUDA core throughput. I suppose optimizing the dequantization process in Atom kernel will be crucial for A100 performance. Please refer to this recent work to see A100 evaluations (https://arxiv.org/pdf/2405.04532)

lisuying214 · 2024-10-18T11:16:39Z

@happierpig
Dear author, thanks for you reply and recommeded paper!
However, I checked the A100 CUDA core throughtput, A100's FP32/FP16 is worse than RTX4090, but why not use TensorCore in the dequantization process in Atom kernel? Tensor Core in both A100 and RTX4090 support FP32/FP16 and INT8.
Looking forward to your reply, thanks again!

happierpig · 2024-10-18T18:35:59Z

@lisuying214 ,

As ref in

Atom/kernels/include/GEMM/Dense_layer_gemm_i4_o16.cuh

Line 404 in 7e3618b

__device__ __forceinline__ void dequant(

, the dequantization process in Atom is an element-wise operations, which needs an outer-product of scales and can't fit into the tensor core.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel optimized for A100 #25

kernel optimized for A100 #25

lisuying214 commented Oct 12, 2024

happierpig commented Oct 17, 2024

lisuying214 commented Oct 18, 2024

happierpig commented Oct 18, 2024

kernel optimized for A100 #25

kernel optimized for A100 #25

Comments

lisuying214 commented Oct 12, 2024

happierpig commented Oct 17, 2024

lisuying214 commented Oct 18, 2024

happierpig commented Oct 18, 2024