-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llamafile_sgemm API - INT8 implementation #10912
Conversation
85c5280
to
d70f5fc
Compare
Hi @ggerganov, |
We will need to merge #10714 first, since there may be some conflicts. |
Sure. Please let me know once that PR is merged. I will fix mine, if any conflicts and resubmit my PR. |
I'll try to submit it today. 🤞 |
d70f5fc
to
a8d3700
Compare
I have made the changes suggested by @Djip007 and pushed. @slaren / @Djip007 / @ggerganov Please review the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are quick personal comments wait for @slaren @ggerganov before change.
And I read it very quickly.
@amritahs-ibm |
dc23ee5
to
4147962
Compare
All comments are addressed expect for the last MMA one. The updated patch has been committed. |
@Djip007 Also in case of PowerPC's MMA for int8 data type, MMA engine requires the data to be packed in a different way. So I came up with a specific function for int8(ie packNormal) to do the packing. Please find below the MMA guide: |
@Djip007 Could you please help me on my latest comment? |
This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for quantised int8 datatype. This change results in 10% - 70% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
I have addressed all the comments except for the MMA repacking one. As it is an optimization over the current code, I can take it up in the follow on patch. |
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.
This change results in 10%-70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.