ggml-cpu: enable IBM NNPA Vector Intrinsics #14317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

taronaeo wants to merge 63 commits into ggml-org:master from taronaeo:feat/nnpa-fp16-rework

Contributor

taronaeo commented Jun 21, 2025 •

edited

Loading

This pull request aims to enable the IBM NNPA instruction set for IBM z16 mainframes and later on the s390x platform. This code change is mainly targeted at FP16 -> FP32 or FP32 -> FP16 data conversions.

Note: This PR supersedes #14303 because that implementation was wrong.

Verification

To ensure that this implementation did not break anything, the NNPA instruction set has been tested on the following models:

Tested IBM Granite 3.3 (F32, F16, Q4_0, Q4_1, Q3_K, Q4_K, Q5_K)
Kindly request additional models for testing in this PR

Performance Results

I will be using IBM Granite 3.3 for the performance tests. We notice a performance improvement of roughly 0.80% for F16 Prompt Processing, and 29.73% for F16 Token Generation, which is the expected outcome.

Before NNPA Instruction Set

model	size	params	backend	threads	test	t/s
granite 3B all F32	9.44 GiB	2.53 B	BLAS	4	pp512	31.56 ± 0.23
granite 3B all F32	9.44 GiB	2.53 B	BLAS	4	tg128	1.75 ± 0.01
granite 3B F16	4.72 GiB	2.53 B	BLAS	4	pp512	30.94 ± 0.20
granite 3B F16	4.72 GiB	2.53 B	BLAS	4	tg128	1.46 ± 0.01

After NNPA Instruction Set

model	size	params	backend	threads	test	t/s
granite 3B all F32	9.44 GiB	2.53 B	BLAS	4	pp512	31.49 ± 0.69
granite 3B all F32	9.44 GiB	2.53 B	BLAS	4	tg128	1.74 ± 0.02
granite 3B F16	4.72 GiB	2.53 B	BLAS	4	pp512	31.19 ± 0.20
granite 3B F16	4.72 GiB	2.53 B	BLAS	4	tg128	1.97 ± 0.06

Note

Tests were conducted on an IBM z16 Mainframe with 2 IFLs (4 vCores) and 64 GB Memory on z/VM (Type-2)

ggml_compute_fp16_to_fp32 and ggml_compute_fp32_to_fp16 SIMD activations are ready. However, I was unable to find a way to make the s390x platform detection macros usable in ggml-impl.h, thus leaving the correct implementation inside first until we can correct it.

Please review this pull request and consider merging into the main repository. Thank you!

taronaeo added 30 commits

June 21, 2025 14:46


          ggml-cpu: add nnpa compile flag

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 4a9f60c)


          ggml-cpu: add fp16->fp32 nnpa first

45a4cf6

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 8d4a798)


          ggml-cpu: add fp32->fp16

ebf9f34

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0ff0d65)


          ggml-cpu: better variable names

ffe2964

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 2f58bbc)


          docs: update s390x docs

0394a00

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 01b9294)


          ggml-cpu: add debugging prints to see if dlf16 is correct

48b820d

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: fix print vs printf

d9cc63a

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: fix float placeholder

94f10ca

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: ensure fp16 and fp32 load and stores are called

8f3a5af

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: fp16 load ensured to hit

575ea9f

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: remove sigint from fp16 store

for some reason, the function is not getting a hit when debugged with
    gdb. we will need to investigate further

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32

ebc1d19

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements

6a25fd8

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: nnpa switch to vec_xst test

f9f6c7e

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: switch to vec_xst for 4 element loops also

6d507bb

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: rework noop

8312adc

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: remove noop, general code cleanup

27b4c3f

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: clarify variable naming

e0f8fb9

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16

bb9345c

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: add breakpoint for debugging

5424d9e

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: test fix for conversion failure

4f017d7

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: disable fp32->fp16 nnpa conversions for now

27131e5

there are some conversion failures in nnpa that requires the eyes of an
ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: switch to elif macro

946c78e

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: reattempt fp32->fp16

433d587

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: fix typo

54811fc

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: reattempt fp32->fp16

e12e9fe

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: fix compiler types

7413dab

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: change to typedef vector types

373fa28

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: add 4 element loops for fp32->fp16

4621a23

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: clarified vector naming

987d169

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

taronaeo added 23 commits

June 21, 2025 19:00


          ggml-cpu: move s390x typedef to own header file

157f856

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          Revert "ggml-cpu: move s390x typedef to own header file"

48df977

This reverts commit 157f856.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: switch to importing ggml-cpu-impl instead

3004a79

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: fix macro declaration

1cacdd9

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: test more macros

fadc138

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: add debug prints

ed76ff6

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: bruteforce macro definitions

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: move macro definitions

72c9143

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: add ggml-impl.h to cmakelists

a91c3ab

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: switch to private macros

ba3513e

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: move s390x typedef to own header file

18d79e1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 157f856)


          ggml-cpu: move things around

781c263

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: bring back compile macros

263b820

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: switch to quotes for import

04a395e

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: add compiler error macro

c8b3b89

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: add s390x detection in ggml-src

ebb8489

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: bring back compile definitions

3ec0bdc

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: undo cmakelists work

e43dc82

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          Revert "ggml-cpu: move s390x typedef to own header file"

5c9b083

This reverts commit 18d79e1.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: remove typedefs.h

1b4dbf4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: remove typedef from cmakelists

46227c6

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: add ggml-impl.h future notes

72965ea

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: add todo comment for future reference

07de57c

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

github-actions bot added documentation ggml labels

taronaeo mentioned this pull request

ggml-cpu: enable IBM NNPA Vector Intrinsics #14303

Closed

taronaeo added 2 commits

June 21, 2025 23:34


          ggml-cpu: clarify naming of dlf16

489cdf4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>


          ggml-cpu: remove unnecessary target compile definitions

5004e43

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Member

slaren commented Jun 22, 2025

ggml_compute_fp16_to_fp32 and ggml_compute_fp32_to_fp16 SIMD activations are ready. However, I was unable to find a way to make the s390x platform detection macros usable in ggml-impl.h, thus leaving the correct implementation inside first until we can correct it.

This code is a bit messy because in the past there was no separation between the ggml core and the CPU backend. I think what we should do is keep only the basic C implementation of these functions/macros in ggml-base, and move the optimized versions to ggml-cpu.


          ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings

5834dee

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation ggml