Implementing matrix multiplication based on lookup tables #1851
Unanswered
vlasenkoalexey
asked this question in
Q&A
Replies: 1 comment 2 replies
-
Did you manage to do it? I was interested in it but didn't find anything other than this. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The idea is simple, for quantized networks where int8 or int4 weights are used instead of doing matrix multiplication, we can do a table lookup. The lookup table size would not be too big, for int4xbfloat16 it is going to be just 16x65536=1MB, and 16MB for int8xbfloat16. There are a couple of publications proposing similar idea with promising results: https://arxiv.org/pdf/2206.09557.pdf and https://arxiv.org/pdf/2005.09904.pdf
The benefit is that GPUs don't natively support int4xbfloat16 and int8xbfloat16 matmuls, and if implemented efficiently such kernel might give a better performance.
The sample code for int8xint8 could look like this:
I tried to hack matmul sample, but realized that in order to make it work, I need direct access to tensor elements (for getting A[i,:] and B[:,j]) which is not supported in Triton, and I can't figure out how do this blocks.
Any suggestions how to code this in Triton?
Would it be possible to write an efficient implementation for this approach?
Any pointers are appreciated.
Beta Was this translation helpful? Give feedback.
All reactions