Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llamafile_sgemm API - INT8 implementation #10912

Merged
merged 1 commit into from
Jan 8, 2025

Conversation

amritahs-ibm
Copy link
Contributor

@amritahs-ibm amritahs-ibm commented Dec 20, 2024

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10%-70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 20, 2024
@amritahs-ibm amritahs-ibm force-pushed the sgemm_q8 branch 2 times, most recently from 85c5280 to d70f5fc Compare December 20, 2024 06:22
@amritahs-ibm
Copy link
Contributor Author

Hi @ggerganov,
Can you please help reviewing this PR. Or suggest any missing actions required from me to get this patch reviewed.

@slaren
Copy link
Collaborator

slaren commented Dec 23, 2024

We will need to merge #10714 first, since there may be some conflicts.

@amritahs-ibm
Copy link
Contributor Author

Sure. Please let me know once that PR is merged. I will fix mine, if any conflicts and resubmit my PR.

@Djip007
Copy link
Contributor

Djip007 commented Dec 24, 2024

I'll try to submit it today. 🤞

@amritahs-ibm
Copy link
Contributor Author

I have made the changes suggested by @Djip007 and pushed. @slaren / @Djip007 / @ggerganov Please review the changes.

Copy link
Contributor

@Djip007 Djip007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are quick personal comments wait for @slaren @ggerganov before change.
And I read it very quickly.

ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved
ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved
ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved
ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved
ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved
@Djip007
Copy link
Contributor

Djip007 commented Dec 28, 2024

@amritahs-ibm
I didn't realize that MMA was "Matrix Multiply Accelerate".
Have you see what was done with amx or "aarch64" kernel? There is now "simple" arch to create kernel that can "repack" the weight, so it fit to the structure needed for such Matrix OP?
That way A is repack at load time once, and only B need to be repack at runtime.

@amritahs-ibm amritahs-ibm force-pushed the sgemm_q8 branch 2 times, most recently from dc23ee5 to 4147962 Compare December 30, 2024 07:27
@amritahs-ibm
Copy link
Contributor Author

All comments are addressed expect for the last MMA one. The updated patch has been committed.
I will look into MMA comment and get back to you.

@amritahs-ibm
Copy link
Contributor Author

amritahs-ibm commented Jan 2, 2025

@amritahs-ibm I didn't realize that MMA was "Matrix Multiply Accelerate". Have you see what was done with amx or "aarch64" kernel? There is now "simple" arch to create kernel that can "repack" the weight, so it fit to the structure needed for such Matrix OP? That way A is repack at load time once, and only B need to be repack at runtime.

@Djip007
Are you referring to gemm4xN and gemmMx4 functions in tinyBLAS_Q0_AVX?

Also in case of PowerPC's MMA for int8 data type, MMA engine requires the data to be packed in a different way. So I came up with a specific function for int8(ie packNormal) to do the packing.
image

Please find below the MMA guide:
https://www.redbooks.ibm.com/redpapers/pdfs/redp5612.pdf

@amritahs-ibm
Copy link
Contributor Author

@Djip007 Could you please help me on my latest comment?

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
@amritahs-ibm
Copy link
Contributor Author

I have addressed all the comments except for the MMA repacking one. As it is an optimization over the current code, I can take it up in the follow on patch.
@slaren @Djip007 @ggerganov
Can anyone of you approve the patch and get it merged?

@ggerganov ggerganov merged commit 8cef75c into ggerganov:master Jan 8, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants