Skip to content

ggml-cpu: enable IBM NNPA Vector Intrinsics #14317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 63 commits into
base: master
Choose a base branch
from

Conversation

taronaeo
Copy link
Contributor

@taronaeo taronaeo commented Jun 21, 2025

This pull request aims to enable the IBM NNPA instruction set for IBM z16 mainframes and later on the s390x platform. This code change is mainly targeted at FP16 -> FP32 or FP32 -> FP16 data conversions.

Note: This PR supersedes #14303 because that implementation was wrong.

Verification

To ensure that this implementation did not break anything, the NNPA instruction set has been tested on the following models:

  • Tested IBM Granite 3.3 (F32, F16, Q4_0, Q4_1, Q3_K, Q4_K, Q5_K)
  • Kindly request additional models for testing in this PR

Performance Results

I will be using IBM Granite 3.3 for the performance tests. We notice a performance improvement of roughly 0.80% for F16 Prompt Processing, and 29.73% for F16 Token Generation, which is the expected outcome.

Before NNPA Instruction Set

model size params backend threads test t/s
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 pp512 31.56 ± 0.23
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 tg128 1.75 ± 0.01
granite 3B F16 4.72 GiB 2.53 B BLAS 4 pp512 30.94 ± 0.20
granite 3B F16 4.72 GiB 2.53 B BLAS 4 tg128 1.46 ± 0.01

After NNPA Instruction Set

model size params backend threads test t/s
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 pp512 31.49 ± 0.69
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 tg128 1.74 ± 0.02
granite 3B F16 4.72 GiB 2.53 B BLAS 4 pp512 31.19 ± 0.20
granite 3B F16 4.72 GiB 2.53 B BLAS 4 tg128 1.97 ± 0.06

Note

Tests were conducted on an IBM z16 Mainframe with 2 IFLs (4 vCores) and 64 GB Memory on z/VM (Type-2)

ggml_compute_fp16_to_fp32 and ggml_compute_fp32_to_fp16 SIMD activations are ready. However, I was unable to find a way to make the s390x platform detection macros usable in ggml-impl.h, thus leaving the correct implementation inside first until we can correct it.

Please review this pull request and consider merging into the main repository. Thank you!

taronaeo added 30 commits June 21, 2025 14:46
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 4a9f60c)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 8d4a798)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0ff0d65)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 2f58bbc)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 01b9294)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
for some reason, the function is not getting a hit when debugged with
    gdb. we will need to investigate further

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
there are some conversion failures in nnpa that requires the eyes of an
ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
taronaeo added 23 commits June 21, 2025 19:00
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
This reverts commit 157f856.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 157f856)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
This reverts commit 18d79e1.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning labels Jun 21, 2025
taronaeo added 2 commits June 21, 2025 23:34
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@slaren
Copy link
Member

slaren commented Jun 22, 2025

ggml_compute_fp16_to_fp32 and ggml_compute_fp32_to_fp16 SIMD activations are ready. However, I was unable to find a way to make the s390x platform detection macros usable in ggml-impl.h, thus leaving the correct implementation inside first until we can correct it.

This code is a bit messy because in the past there was no separation between the ggml core and the CPU backend. I think what we should do is keep only the basic C implementation of these functions/macros in ggml-base, and move the optimized versions to ggml-cpu.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants