llamafile_sgemm API - INT8 implementation #10912

amritahs-ibm · 2024-12-20T05:33:43Z

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10%-70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

amritahs-ibm · 2024-12-20T07:11:30Z

Hi @ggerganov,
Can you please help reviewing this PR. Or suggest any missing actions required from me to get this patch reviewed.

slaren · 2024-12-23T00:24:14Z

We will need to merge #10714 first, since there may be some conflicts.

amritahs-ibm · 2024-12-23T05:07:20Z

Sure. Please let me know once that PR is merged. I will fix mine, if any conflicts and resubmit my PR.

Djip007 · 2024-12-24T14:12:39Z

I'll try to submit it today. 🤞

ggml/src/ggml-cpu/llamafile/sgemm.cpp

amritahs-ibm · 2024-12-26T10:15:53Z

I have made the changes suggested by @Djip007 and pushed. @slaren / @Djip007 / @ggerganov Please review the changes.

Djip007

These are quick personal comments wait for @slaren @ggerganov before change.
And I read it very quickly.

ggml/src/ggml-cpu/llamafile/sgemm.cpp

Djip007 · 2024-12-28T01:40:59Z

@amritahs-ibm
I didn't realize that MMA was "Matrix Multiply Accelerate".
Have you see what was done with amx or "aarch64" kernel? There is now "simple" arch to create kernel that can "repack" the weight, so it fit to the structure needed for such Matrix OP?
That way A is repack at load time once, and only B need to be repack at runtime.

amritahs-ibm · 2024-12-30T08:50:29Z

All comments are addressed expect for the last MMA one. The updated patch has been committed.
I will look into MMA comment and get back to you.

ggml/src/ggml-cpu/llamafile/sgemm.cpp

amritahs-ibm · 2025-01-02T04:30:13Z

@amritahs-ibm I didn't realize that MMA was "Matrix Multiply Accelerate". Have you see what was done with amx or "aarch64" kernel? There is now "simple" arch to create kernel that can "repack" the weight, so it fit to the structure needed for such Matrix OP? That way A is repack at load time once, and only B need to be repack at runtime.

@Djip007
Are you referring to gemm4xN and gemmMx4 functions in tinyBLAS_Q0_AVX?

Also in case of PowerPC's MMA for int8 data type, MMA engine requires the data to be packed in a different way. So I came up with a specific function for int8(ie packNormal) to do the packing.

Please find below the MMA guide:
https://www.redbooks.ibm.com/redpapers/pdfs/redp5612.pdf

amritahs-ibm · 2025-01-06T09:47:35Z

@Djip007 Could you please help me on my latest comment?

This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for quantised int8 datatype. This change results in 10% - 70% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>

amritahs-ibm · 2025-01-08T04:30:20Z

I have addressed all the comments except for the MMA repacking one. As it is an optimization over the current code, I can take it up in the follow on patch.
@slaren @Djip007 @ggerganov
Can anyone of you approve the patch and get it merged?

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 20, 2024

amritahs-ibm force-pushed the sgemm_q8 branch 2 times, most recently from 85c5280 to d70f5fc Compare December 20, 2024 06:22

Djip007 reviewed Dec 24, 2024

View reviewed changes