-
Notifications
You must be signed in to change notification settings - Fork 971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Hopper mixed precision gemm always worse than FP8 #1549
Comments
Could you share more nfo on what exact c++ kernel is being picked in both cases ? You may have to pick a custom tile size instead of what the builder provides by default. The default ones are more optimized for compute bound cases. |
@IonThruster full code is here. I've played w/ tiles. This is the best config.
|
@IonThruster for fp8 version, you can just look at my old post #1139 |
@divchenko This behavior is expected with the current implementation. I not done a deep dive into the performance, but I have a theory that may explain the behavior you observe. If we take a compute bound case, we typically have a MMA tile of In the memory bound case, we have much smaller tiles. In your example, it is My theory is that the conversion cost is exposed in the memory bound case. DISCLAIMER: I don't have data supporting what I've said above. It could be completely wrong, but it is just a hunch :) |
Thanks @rawnhenry . The memory-bound case for fp8 (where I have 64x16x256 tiles) actually works quite well reaching closed to 60% memory b/w. It's the mixed precision case w/ tile 128x16x64 (k tile is restricted to be at most 64 == scaling group size), which doesn't work well.
|
This issue has been labeled |
This issue has been labeled |
I'm doing A 4 bit x B fp16 matmul w/ large A and small B. I expect it to beat fp8 matmul (it should be memory-bound).
In reality, it seems to be always worse.
Example:
Kernel code is here: https://gist.github.com/divchenko/9b02f40ae109e8dc8549afbde059d32e
it's called from python:
The best perf I can get is using streamk scheduler (k is large indeed). But it's still very low on memory b/w (~20%).
Persistent tile scheduler is way worse for both TMA and TMACooperative kernel schedulers.
Fp8 implementation can reach ~60% of memory b/w and hence is faster although it reads ~2x more bytes.
Am I missing anything? Thank you!
The text was updated successfully, but these errors were encountered: