Skip to content

ggml-cpu: enable IBM NNPA Vector Intrinsics #14303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from

Conversation

taronaeo
Copy link
Contributor

This pull request aims to enable the IBM NNPA instruction set for IBM z16 mainframes and later on the s390x platform. This code change is mainly targeted at FP16 -> FP32 or FP32 -> FP16 data conversions.

Verification

To ensure that this implementation did not break anything, the NNPA instruction set has been tested on the following models:

  • Tested IBM Granite 3.3 (F32, F16, Q4_0, Q4_1, Q3_K, Q4_K, Q5_K)
  • Kindly request additional models for testing in this PR

Performance Results

I will be using IBM Granite 3.3 for the performance tests. We notice a performance improvement of roughly 31.21% for F16 Token Generation, which is the expected outcome.

Before NNPA Instruction Set

model size params backend threads test t/s
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 pp512 31.56 ± 0.23
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 tg128 1.75 ± 0.01
granite 3B F16 4.72 GiB 2.53 B BLAS 4 pp512 30.94 ± 0.20
granite 3B F16 4.72 GiB 2.53 B BLAS 4 tg128 1.46 ± 0.01

After NNPA Instruction Set

model size params backend threads test t/s
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 pp512 30.56 ± 2.42
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 tg128 1.77 ± 0.01
granite 3B F16 4.72 GiB 2.53 B BLAS 4 pp512 30.86 ± 0.23
granite 3B F16 4.72 GiB 2.53 B BLAS 4 tg128 2.00 ± 0.01

Note

Tests were conducted on an IBM z16 Mainframe with 2 IFLs (4 vCores) and 64 GB Memory on z/VM (Type-2)

Please review this pull request and consider merging into the main repository. Thank you!

taronaeo added 11 commits June 20, 2025 21:44
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
This reverts commit 23f3f5e.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
This reverts commit a88843a.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@slaren
Copy link
Member

slaren commented Jun 20, 2025

There are also the functions ggml_cpu_fp32_to_fp16 and ggml_cpu_fp16_to_fp32 that are used to convert vectors, and may benefit from this instruction set.

@taronaeo
Copy link
Contributor Author

There are also the functions ggml_cpu_fp32_to_fp16 and ggml_cpu_fp16_to_fp32 that are used to convert vectors, and may benefit from this instruction set.

Good catch! Let me look into it and merge it with this PR where possible :)

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning labels Jun 20, 2025
taronaeo added 4 commits June 20, 2025 22:46
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@taronaeo taronaeo marked this pull request as draft June 20, 2025 17:56
taronaeo added 6 commits June 21, 2025 02:01
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
This reverts commit 1be4514.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@taronaeo
Copy link
Contributor Author

taronaeo commented Jun 21, 2025

Hey @slaren, I've successfully implemented our SIMD into the suggested functions but it resides within ggml-impl.h which means it does not include s390x's self-defined macros during the build process, and do not have a way of targeting specific architecture unless it is inside ggml-cpu.

By any chance do you have an idea if those 2 functions are supposed to be moved into ggml-cpu instead? I don't think it would be nice for me to double-write s390x's specific macros in ggml/src/ggml-cpu/CMakeLists.txt and into ggml/src/CMakeLists.txt as well.

Edit: Ignore what I wrote above, I was looking at the wrong functions all these while. My bad.

@taronaeo
Copy link
Contributor Author

Closing, superseded by #14317

@taronaeo taronaeo closed this Jun 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants