Skip to content

ggml-cpu: enable IBM NNPA Vector Intrinsics #14303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
4a9f60c
ggml-cpu: add nnpa compile flag
taronaeo Jun 20, 2025
8d4a798
ggml-cpu: add fp16->fp32 nnpa first
taronaeo Jun 20, 2025
0ff0d65
ggml-cpu: add fp32->fp16
taronaeo Jun 20, 2025
a316d1b
ggml-cpu: attempt direct reference
taronaeo Jun 20, 2025
ff70b3a
Revert "ggml-cpu: attempt direct reference"
taronaeo Jun 20, 2025
2f58bbc
ggml-cpu: better variable names
taronaeo Jun 20, 2025
ae9c5f9
ggml-cpu: add ggml fp16->fp32 and fp32->fp16 scalar simd
taronaeo Jun 20, 2025
a88843a
ggml-cpu: switch fp16->fp32 to inline asm and test
taronaeo Jun 20, 2025
70ff4e6
Revert "ggml-cpu: switch fp16->fp32 to inline asm and test"
taronaeo Jun 20, 2025
2b4892e
ggml-cpu: chore: remove todo comments about inline asm
taronaeo Jun 20, 2025
01b9294
docs: update s390x docs
taronaeo Jun 20, 2025
dca6c74
ggml-cpu: add nnpa intrinsics for batched fp32->fp16
taronaeo Jun 20, 2025
6b4469b
ggml-cpu: fix wrong displacement
taronaeo Jun 20, 2025
c9d0f36
ggml-cpu: change vector load displacement too
taronaeo Jun 20, 2025
22669f3
ggml-cpu: fix wrong vector intrinsic func
taronaeo Jun 20, 2025
5530bec
ggml-cpu: add sigint for gdb to break
taronaeo Jun 20, 2025
5d478c7
wip: move vector store to tmp variable for debugging
taronaeo Jun 20, 2025
dc29eed
wip: change vector to scalar data type
taronaeo Jun 20, 2025
1be4514
wip: vec_round_from_fp32 seem to be throwing rounding errors
taronaeo Jun 20, 2025
733066b
Revert "wip: vec_round_from_fp32 seem to be throwing rounding errors"
taronaeo Jun 20, 2025
5d84579
wip: double check original impl
taronaeo Jun 20, 2025
3f0cbf7
wip: add missing import
taronaeo Jun 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 28 additions & 11 deletions docs/build-s390x.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,9 @@ cmake --build build --config Release -j $(nproc)
```

**Notes**:
- For faster repeated compilation, install [ccache](https://ccache.dev/)
- By default, VXE/VXE2 is enabled. To disable it (not recommended):

- For faster repeated compilation, install [ccache](https://ccache.dev/)
- By default, VXE/VXE2 is enabled. To disable it (not recommended):

```bash
cmake -S . -B build \
Expand All @@ -41,18 +42,29 @@ cmake --build build --config Release -j $(nproc)
cmake --build build --config Release -j $(nproc)
```

- For debug builds:
- By default, NNPA is enabled when available. To disable it (not recommended):

```bash
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS \
-DGGML_NNPA=OFF

cmake --build build --config Release -j $(nproc)
```

- For debug builds:

```bash
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Debug \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS

cmake --build build --config Debug -j $(nproc)
```

- For static builds, add `-DBUILD_SHARED_LIBS=OFF`:
- For static builds, add `-DBUILD_SHARED_LIBS=OFF`:

```bash
cmake -S . -B build \
Expand Down Expand Up @@ -101,27 +113,33 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
```

For example,

```bash
python3 gguf-py/gguf/scripts/gguf_convert_endian.py granite-3.3-2b-instruct-le.f16.gguf BIG
mv granite-3.3-2b-instruct-le.f16.gguf granite-3.3-2b-instruct-be.f16.gguf
```

**Notes:**

- The GGUF endian conversion script may not support all data types at the moment and may fail for some models/quantizations. When that happens, please try manually converting the safetensors model to GGUF Big-Endian via Step 2.

## IBM Accelerators

### 1. SIMD Acceleration

Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14 or EC13. In such systems, the APIs can still run but will use a scalar implementation.
Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.

### 2. NNPA Vector Intrinsics Acceleration

### 2. zDNN Accelerator
Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned on when available) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.

*Only available in IBM z16 or later system. No direction at the moment.*
### 3. zDNN Accelerator

### 3. Spyre Accelerator
_Only available in IBM z16 or later system. No direction at the moment._

*No direction at the moment.*
### 4. Spyre Accelerator

_No direction at the moment._

## Performance Tuning

Expand Down Expand Up @@ -154,4 +172,3 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
2. **Other Questions**

Please reach out directly to [aionz@us.ibm.com](mailto:aionz@us.ibm.com).

4 changes: 4 additions & 0 deletions docs/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -557,6 +557,10 @@ ninja

To read documentation for how to build on Android, [click here](./android.md)

## IBM Z & LinuxONE

To read documentation for how to build on IBM Z & LinuxONE, [click here](./build-s390x.md)

## Notes about GPU-accelerated backends

The GPU may still be used to accelerate some parts of the computation even when using the `-ngl 0` option. You can fully disable GPU acceleration by using `--device none`.
Expand Down
1 change: 1 addition & 0 deletions ggml/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ option(GGML_RVV "ggml: enable rvv" ON)
option(GGML_RV_ZFH "ggml: enable riscv zfh" OFF)
option(GGML_XTHEADVECTOR "ggml: enable xtheadvector" OFF)
option(GGML_VXE "ggml: enable vxe" ON)
option(GGML_NNPA "ggml: enable nnpa" ON)

option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")
Expand Down
1 change: 1 addition & 0 deletions ggml/include/ggml-cpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ extern "C" {
GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
GGML_BACKEND_API int ggml_cpu_has_vsx (void);
GGML_BACKEND_API int ggml_cpu_has_vxe (void);
GGML_BACKEND_API int ggml_cpu_has_nnpa (void);
GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
GGML_BACKEND_API int ggml_cpu_has_llamafile (void);

Expand Down
7 changes: 7 additions & 0 deletions ggml/src/ggml-cpu/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -427,6 +427,7 @@ function(ggml_add_cpu_backend_variant_impl tag_name)

# TODO: Separation to determine activation of VX/VXE/VXE2
if (${S390X_M} MATCHES "8561|8562")
set(GGML_NNPA OFF)
message(STATUS "z15 target")
list(APPEND ARCH_FLAGS -march=z15)
elseif (${S390X_M} MATCHES "3931")
Expand All @@ -443,8 +444,14 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
endif()

if (GGML_VXE)
message(STATUS "VX/VXE/VXE2 enabled")
list(APPEND ARCH_FLAGS -mvx -mzvector)
endif()

if (GGML_NNPA)
target_compile_definitions(${GGML_CPU_NAME} PRIVATE GGML_NNPA)
message(STATUS "NNPA enabled")
endif()
elseif (CMAKE_SYSTEM_PROCESSOR MATCHES "wasm")
message(STATUS "Wasm detected")
list (APPEND GGML_CPU_SOURCES ggml-cpu/arch/wasm/quants.c)
Expand Down
12 changes: 9 additions & 3 deletions ggml/src/ggml-cpu/ggml-cpu-impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -62,11 +62,17 @@ struct ggml_compute_params {
#if defined(__s390x__) && defined(__VEC__)
#ifndef __VXE__
#define __VXE__
#endif
#endif // __VXE__
#ifndef __VXE2__
#define __VXE2__
#endif
#endif
#endif // __VXE2__
#endif // __s390x__ && __VEC__

#if defined(__s390x__) && defined(GGML_NNPA)
#ifndef __NNPA__
#define __NNPA__
#endif // __NNPA__
#endif // __s390x__ && GGML_NNPA

#if defined(__ARM_FEATURE_SVE)
#include <sys/prctl.h>
Expand Down
27 changes: 27 additions & 0 deletions ggml/src/ggml-cpu/ggml-cpu.c
Original file line number Diff line number Diff line change
Expand Up @@ -3137,6 +3137,25 @@ void ggml_cpu_fp32_to_fp16(const float * x, ggml_fp16_t * y, int64_t n) {
_mm_storel_epi64((__m128i *)(y + i), y_vec);
}
#endif

#if defined(__NNPA__)
for (; i + 7 < n; i += 8) {
uint16_t tmp[8];
float32x4_t v_x1 = vec_xl(0, x + i + 0);
float32x4_t v_x2 = vec_xl(0, x + i + 4);
uint16x8_t v_dlf16 = vec_round_from_fp32(v_x1, v_x2, 0);
vec_xst(v_dlf16, 0, tmp);
// raise(SIGINT);
}
// TODO: Enable bottom code once checks are done
// for (; i + 3 < n; i += 4) {
// float32x4_t v_x = vec_xl(i, x);
// float32x4_t v_zero = vec_splats(0.0f);
// uint16x4_t v_dlf16 = vec_round_from_fp32(v_x, v_zero, 0);
// vec_xst(v_dlf16, i, (uint16_t *)y);
// }
#endif

for (; i < n; ++i) {
y[i] = GGML_FP32_TO_FP16(x[i]);
}
Expand Down Expand Up @@ -3364,6 +3383,14 @@ int ggml_cpu_has_vxe(void) {
#endif
}

int ggml_cpu_has_nnpa(void) {
#if defined(GGML_NNPA)
return 1;
#else
return 0;
#endif
}

int ggml_cpu_has_neon(void) {
#if defined(__ARM_ARCH) && defined(__ARM_NEON)
return 1;
Expand Down
3 changes: 3 additions & 0 deletions ggml/src/ggml-cpu/ggml-cpu.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -578,6 +578,9 @@ static ggml_backend_feature * ggml_backend_cpu_get_features(ggml_backend_reg_t r
if (ggml_cpu_has_vxe()) {
features.push_back({ "VXE", "1" });
}
if (ggml_cpu_has_nnpa()) {
features.push_back({ "NNPA", "1" });
}
if (ggml_cpu_has_wasm_simd()) {
features.push_back({ "WASM_SIMD", "1" });
}
Expand Down
25 changes: 21 additions & 4 deletions ggml/src/ggml-cpu/simd-mappings.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#pragma once

#include "ggml-cpu-impl.h"
#include <signal.h>

//
// simd mappings
Expand Down Expand Up @@ -922,7 +923,7 @@ static inline void __lsx_f16x4_store(ggml_fp16_t * x, __m128 y) {
#define GGML_F32_STEP 32
#define GGML_F32_EPR 4

#define GGML_F32x4 __vector float
#define GGML_F32x4 float32x4_t
#define GGML_F32x4_ZERO vec_splats(0.0f)
#define GGML_F32x4_SET1 vec_splats
#define GGML_F32x4_LOAD(p) vec_xl(0, p)
Expand Down Expand Up @@ -962,7 +963,12 @@ static inline void __lsx_f16x4_store(ggml_fp16_t * x, __m128 y) {
#define GGML_F16_STEP GGML_F32_STEP
#define GGML_F16_EPR GGML_F32_EPR

static inline __vector float __lzs_f16cx4_load(const ggml_fp16_t * x) {
static inline float32x4_t __lzs_f16cx4_load(const ggml_fp16_t * x) {
#ifdef __NNPA__
uint16x8_t v_x = vec_xl(0, (const ggml_fp16_t *)x);
uint16x8_t nnpa_dlf16 = vec_convert_from_fp16(v_x, 0);
return vec_extend_to_fp32_hi(nnpa_dlf16, 0);
#else
float tmp[4];

for (int i = 0; i < 4; i++) {
Expand All @@ -972,18 +978,29 @@ static inline __vector float __lzs_f16cx4_load(const ggml_fp16_t * x) {
// note: keep type-cast here to prevent compiler bugs
// see: https://github.com/ggml-org/llama.cpp/issues/12846
return vec_xl(0, (const float *)(tmp));
#endif
}

static inline void __lzs_f16cx4_store(ggml_fp16_t * x, __vector float y) {
static inline void __lzs_f16cx4_store(ggml_fp16_t * x, float32x4_t v_y) {
#ifdef __NNPA__
float32x4_t v_zero = vec_splats(0.0f);
uint16x8_t v_x = vec_round_from_fp32(v_y, v_zero, 0);
x[0] = vec_extract(v_x, 0);
x[1] = vec_extract(v_x, 1);
x[2] = vec_extract(v_x, 2);
x[3] = vec_extract(v_x, 3);
raise(SIGINT);
#else
float arr[4];

// note: keep type-cast here to prevent compiler bugs
// see: https://github.com/ggml-org/llama.cpp/issues/12846
vec_xst(y, 0, (float *)(arr));
vec_xst(v_y, 0, (float *)(arr));

for (int i = 0; i < 4; i++) {
x[i] = GGML_FP32_TO_FP16(arr[i]);
}
#endif
}

#define GGML_F16_VEC GGML_F32x4
Expand Down
22 changes: 22 additions & 0 deletions ggml/src/ggml-impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -322,6 +322,7 @@ GGML_API void ggml_aligned_free(void * ptr, size_t size);
// 16-bit float
// on Arm, we use __fp16
// on x86, we use uint16_t
// on s390x, we use ZDNN_DLFLOAT16 with NNPA
//
// for old CUDA compilers (<= 11), we use uint16_t: ref https://github.com/ggml-org/llama.cpp/pull/10616
// for MUSA compilers , we use uint16_t: ref https://github.com/ggml-org/llama.cpp/pull/11843
Expand Down Expand Up @@ -417,6 +418,27 @@ GGML_API void ggml_aligned_free(void * ptr, size_t size);
#define GGML_FP16_TO_FP32(x) GGML_COMPUTE_FP16_TO_FP32(x)
#define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)

#elif defined(__NNPA__)

#define GGML_COMPUTE_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
#define GGML_COMPUTE_FP32_TO_FP16(x) ggml_compute_fp32_to_fp16(x)

#define GGML_FP16_TO_FP32(x) GGML_COMPUTE_FP16_TO_FP32(x)
#define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)

static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
uint16x8_t v_h = vec_splats(h);
uint16x8_t nnpa_dlf16 = vec_convert_from_fp16(v_h, 0);
return vec_extend_to_fp32_hi(nnpa_dlf16, 0)[0];
}

static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
float32x4_t v_f = vec_splats(f);
float32x4_t v_zero = vec_splats(0.0f);
uint16x8_t v_h = vec_round_from_fp32(v_f, v_zero, 0);
return vec_extract(v_h, 0);
}

#else

// FP16 <-> FP32
Expand Down
Loading