ggml-cpu: enable IBM NNPA Vector Intrinsics #14303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

taronaeo wants to merge 22 commits into ggml-org:master from taronaeo:feat/nnpa-fp16

Contributor

taronaeo commented Jun 20, 2025

This pull request aims to enable the IBM NNPA instruction set for IBM z16 mainframes and later on the s390x platform. This code change is mainly targeted at FP16 -> FP32 or FP32 -> FP16 data conversions.

Verification

To ensure that this implementation did not break anything, the NNPA instruction set has been tested on the following models:

Tested IBM Granite 3.3 (F32, F16, Q4_0, Q4_1, Q3_K, Q4_K, Q5_K)
Kindly request additional models for testing in this PR

Performance Results

I will be using IBM Granite 3.3 for the performance tests. We notice a performance improvement of roughly 31.21% for F16 Token Generation, which is the expected outcome.

Before NNPA Instruction Set

model	size	params	backend	threads	test	t/s
granite 3B all F32	9.44 GiB	2.53 B	BLAS	4	pp512	31.56 ± 0.23
granite 3B all F32	9.44 GiB	2.53 B	BLAS	4	tg128	1.75 ± 0.01
granite 3B F16	4.72 GiB	2.53 B	BLAS	4	pp512	30.94 ± 0.20
granite 3B F16	4.72 GiB	2.53 B	BLAS	4	tg128	1.46 ± 0.01

After NNPA Instruction Set

model	size	params	backend	threads	test	t/s
granite 3B all F32	9.44 GiB	2.53 B	BLAS	4	pp512	30.56 ± 2.42
granite 3B all F32	9.44 GiB	2.53 B	BLAS	4	tg128	1.77 ± 0.01
granite 3B F16	4.72 GiB	2.53 B	BLAS	4	pp512	30.86 ± 0.23
granite 3B F16	4.72 GiB	2.53 B	BLAS	4	tg128	2.00 ± 0.01

Note

Tests were conducted on an IBM z16 Mainframe with 2 IFLs (4 vCores) and 64 GB Memory on z/VM (Type-2)

Please review this pull request and consider merging into the main repository. Thank you!

taronaeo added 11 commits

June 20, 2025 21:44


          ggml-cpu: add nnpa compile flag

4a9f60c

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: add fp16->fp32 nnpa first

8d4a798

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: add fp32->fp16

0ff0d65

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: attempt direct reference

a316d1b

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          Revert "ggml-cpu: attempt direct reference"

ff70b3a

This reverts commit 23f3f5e.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: better variable names

2f58bbc

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: add ggml fp16->fp32 and fp32->fp16 scalar simd

ae9c5f9

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: switch fp16->fp32 to inline asm and test

a88843a

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          Revert "ggml-cpu: switch fp16->fp32 to inline asm and test"

70ff4e6

This reverts commit a88843a.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: chore: remove todo comments about inline asm

2b4892e

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          docs: update s390x docs

01b9294

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Member

slaren commented Jun 20, 2025

There are also the functions ggml_cpu_fp32_to_fp16 and ggml_cpu_fp16_to_fp32 that are used to convert vectors, and may benefit from this instruction set.

Contributor Author

taronaeo commented Jun 20, 2025

There are also the functions ggml_cpu_fp32_to_fp16 and ggml_cpu_fp16_to_fp32 that are used to convert vectors, and may benefit from this instruction set.

Good catch! Let me look into it and merge it with this PR where possible :)


          ggml-cpu: add nnpa intrinsics for batched fp32->fp16

dca6c74

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

github-actions bot added documentation ggml labels

taronaeo added 4 commits

June 20, 2025 22:46


          ggml-cpu: fix wrong displacement

6b4469b

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: change vector load displacement too

c9d0f36

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: fix wrong vector intrinsic func

22669f3

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: add sigint for gdb to break

5530bec

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

taronaeo marked this pull request as draft

June 20, 2025 17:56

taronaeo added 6 commits

June 21, 2025 02:01


          wip: move vector store to tmp variable for debugging

5d478c7

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          wip: change vector to scalar data type

dc29eed

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          wip: vec_round_from_fp32 seem to be throwing rounding errors

1be4514

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          Revert "wip: vec_round_from_fp32 seem to be throwing rounding errors"

733066b

This reverts commit 1be4514.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          wip: double check original impl

5d84579

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          wip: add missing import

3f0cbf7

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Contributor Author

taronaeo commented Jun 21, 2025 •

edited

Loading

Hey @slaren, I've successfully implemented our SIMD into the suggested functions but it resides within ggml-impl.h which means it does not include s390x's self-defined macros during the build process, and do not have a way of targeting specific architecture unless it is inside ggml-cpu.

By any chance do you have an idea if those 2 functions are supposed to be moved into ggml-cpu instead? I don't think it would be nice for me to double-write s390x's specific macros in ggml/src/ggml-cpu/CMakeLists.txt and into ggml/src/CMakeLists.txt as well.

Edit: Ignore what I wrote above, I was looking at the wrong functions all these while. My bad.

taronaeo mentioned this pull request

ggml-cpu: enable IBM NNPA Vector Intrinsics #14317

Open

Contributor Author

taronaeo commented Jun 21, 2025

Closing, superseded by #14317

taronaeo closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation ggml