Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel optimized for A100 #25

Open
lisuying214 opened this issue Oct 12, 2024 · 3 comments
Open

kernel optimized for A100 #25

lisuying214 opened this issue Oct 12, 2024 · 3 comments

Comments

@lisuying214
Copy link

Thank you for the great work and experiment. We want to test the throughput on A100 with bachsize=16. Do you try kernel optimized for A100, or what can I refer?

@happierpig
Copy link
Collaborator

Hi @lisuying214 ,

Thanks for your question! The kernels in Atom are specifically optimized for Ada GPUs. Its performance on A100 will degrade a lot due to A100's poor CUDA core throughput. I suppose optimizing the dequantization process in Atom kernel will be crucial for A100 performance. Please refer to this recent work to see A100 evaluations (https://arxiv.org/pdf/2405.04532)

@lisuying214
Copy link
Author

@happierpig
Dear author, thanks for you reply and recommeded paper!
However, I checked the A100 CUDA core throughtput, A100's FP32/FP16 is worse than RTX4090, but why not use TensorCore in the dequantization process in Atom kernel? Tensor Core in both A100 and RTX4090 support FP32/FP16 and INT8.
Looking forward to your reply, thanks again!

@happierpig
Copy link
Collaborator

@lisuying214 ,

As ref in

__device__ __forceinline__ void dequant(
, the dequantization process in Atom is an element-wise operations, which needs an outer-product of scales and can't fit into the tensor core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants