ggml-cpu: Support s390x SIMD Instruction Set #12019

taronaeo · 2025-02-22T08:34:02Z

This pull request aims to integrate the SIMD instruction set via vecintrin.h into llama.cpp on the s390x platform.
Currently the SIMD instruction set is included in the following ggml_vec_dot functions:

Function	Implementation	Remarks
ggml_vec_dot_f32	IMPLEMENTED	Notice a hotspot for Assembly call vector load. Will fix in another PR.
ggml_vec_dot_f16	IMPLEMENTED	Notice a hotspot for Assembly call vector load. Will fix in another PR.
ggml_vec_dot_q4_0_q8_0	IMPLEMENTED
ggml_vec_dot_q4_1_q8_1	IMPLEMENTED
ggml_vec_dot_q8_0_q8_0	IMPLEMENTED
ggml_vec_dot_q4_K_q8_K	IMPLEMENTED
ggml_vec_dot_q5_K_q8_K	IMPLEMENTED
ggml_vec_dot_q6_K_q8_K	IMPLEMENTED
ggml_vec_dot_iq4_nl_q8_0	IMPLEMENTED
ggml_vec_dot_iq4_xs_q8_K	IMPLEMENTED

Verification

To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:

Tested IBM Granite 3.0 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
Tested IBM Granite 3.1 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
Kindly request additional models for testing in this PR

Performance Results

I will be using IBM Granite 3.1 for the performance results as it has better neural network than 3.0.

Before SIMD Instruction Set

model	size	parameters	backend	threads	test	t/s
Granite-3.1-1B-A400M-Instruct-BE-F32	4.97 GiB	1.33 B	BLAS	8	pp512	16.66 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-F16	2.49 GiB	1.33 B	BLAS	8	pp512	16.30 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q4_0	731.07 MiB	1.33 B	BLAS	8	pp512	23.31 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q4_1	807.57 MiB	1.33 B	BLAS	8	pp512	26.52 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q8_0	1.32 GiB	1.33 B	BLAS	8	pp512	29.73 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q4_K	782.12 MiB	1.33 B	BLAS	8	pp512	23.91 ± 0.05
Granite-3.1-1B-A400M-Instruct-BE-Q5_K	910.37 MiB	1.33 B	BLAS	8	pp512	16.73 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q6_K	1.02 GiB	1.33 B	BLAS	8	pp512	12.62 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL	737.07 MiB	1.33 B	BLAS	8	pp512	23.88 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS	700.32 MiB	1.33 B	BLAS	8	pp512	21.59 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-F32	4.97 GiB	1.33 B	BLAS	8	tg128	8.20 ± 0.07
Granite-3.1-1B-A400M-Instruct-BE-F16	2.49 GiB	1.33 B	BLAS	8	tg128	9.70 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-Q4_0	731.07 MiB	1.33 B	BLAS	8	tg128	14.48 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q4_1	807.57 MiB	1.33 B	BLAS	8	tg128	15.95 ± 0.06
Granite-3.1-1B-A400M-Instruct-BE-Q8_0	1.32 GiB	1.33 B	BLAS	8	tg128	19.80 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-Q4_K	782.12 MiB	1.33 B	BLAS	8	tg128	14.89 ± 0.06
Granite-3.1-1B-A400M-Instruct-BE-Q5_K	910.37 MiB	1.33 B	BLAS	8	tg128	10.94 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-Q6_K	1.02 GiB	1.33 B	BLAS	8	tg128	8.53 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL	737.07 MiB	1.33 B	BLAS	8	tg128	14.38 ± 0.07
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS	700.32 MiB	1.33 B	BLAS	8	tg128	13.22 ± 0.02

After SIMD Instruction Set

model	size	parameters	backend	threads	test	t/s
Granite-3.1-1B-A400M-Instruct-BE-F32	4.97 GiB	1.33 B	BLAS	8	pp512	85.46 ± 0.09
Granite-3.1-1B-A400M-Instruct-BE-F16	2.49 GiB	1.33 B	BLAS	8	pp512	35.39 ± 0.13
Granite-3.1-1B-A400M-Instruct-BE-Q4_0	731.07 MiB	1.33 B	BLAS	8	pp512	121.46 ± 0.81
Granite-3.1-1B-A400M-Instruct-BE-Q4_1	807.57 MiB	1.33 B	BLAS	8	pp512	123.79 ± 0.40
Granite-3.1-1B-A400M-Instruct-BE-Q8_0	1.32 GiB	1.33 B	BLAS	8	pp512	137.36 ± 0.52
Granite-3.1-1B-A400M-Instruct-BE-Q4_K	782.12 MiB	1.33 B	BLAS	8	pp512	118.88 ± 0.56
Granite-3.1-1B-A400M-Instruct-BE-Q5_K	910.37 MiB	1.33 B	BLAS	8	pp512	111.65 ± 0.38
Granite-3.1-1B-A400M-Instruct-BE-Q6_K	1.02 GiB	1.33 B	BLAS	8	pp512	101.94 ± 0.59
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL	737.07 MiB	1.33 B	BLAS	8	pp512	94.28 ± 0.18
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS	700.32 MiB	1.33 B	BLAS	8	pp512	99.43 ± 0.87
Granite-3.1-1B-A400M-Instruct-BE-F32	4.97 GiB	1.33 B	BLAS	8	tg128	14.27 ± 0.29
Granite-3.1-1B-A400M-Instruct-BE-F16	2.49 GiB	1.33 B	BLAS	8	tg128	13.97 ± 0.11
Granite-3.1-1B-A400M-Instruct-BE-Q4_0	731.07 MiB	1.33 B	BLAS	8	tg128	69.33 ± 1.41
Granite-3.1-1B-A400M-Instruct-BE-Q4_1	807.57 MiB	1.33 B	BLAS	8	tg128	65.97 ± 1.71
Granite-3.1-1B-A400M-Instruct-BE-Q8_0	1.32 GiB	1.33 B	BLAS	8	tg128	57.82 ± 0.60
Granite-3.1-1B-A400M-Instruct-BE-Q4_K	782.12 MiB	1.33 B	BLAS	8	tg128	72.14 ± 0.70
Granite-3.1-1B-A400M-Instruct-BE-Q5_K	910.37 MiB	1.33 B	BLAS	8	tg128	70.34 ± 0.69
Granite-3.1-1B-A400M-Instruct-BE-Q6_K	1.02 GiB	1.33 B	BLAS	8	tg128	63.45 ± 0.68
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL	737.07 MiB	1.33 B	BLAS	8	tg128	60.09 ± 1.33
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS	700.32 MiB	1.33 B	BLAS	8	tg128	66.48 ± 1.29

Note

Tests were conducted on an IBM z15 Mainframe with 8 IFLs (cores) and 64 GB Memory on an LPAR.

Please review this pull request and consider merging into the main repository. Thank you!

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

SIMD is activated for: * ggml_vec_dot_f32 * ggml_vec_dot_f16 * ggml_vec_mad_f32 * ggml_vec_mad_f16 * ggml_vec_mad_f32_unroll * ggml_vec_scale_f32 * ggml_vec_scale_f16 SIMD is NOT activated for: * ggml_vec_dot_f16_unroll (pending bugfix) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* add __VXE__ and __VXE2__ macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

256 is the cache line size for s390x platforms Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

taronaeo · 2025-02-22T09:48:48Z

I have fixed all problems and have re-tested the implementation to ensure that it is working as intended. No problems so far, do let me know how should I proceed with this PR.

ggml/src/ggml-cpu/CMakeLists.txt

ericcurtin

LGTM... Looking forward to llama.cpp on mainframe!

@ericcurtin

thank you @ericcurtin Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

taronaeo · 2025-02-22T15:18:36Z

It appears that these failing unit tests point towards not being able to download a model from HuggingFace. In run number 2, the following error was thrown by the server unit test which points directly to the test not being able to download a model for testing.

Run number 2 error details

==================================== ERRORS ====================================
________________ ERROR at setup of test_with_and_without_draft _________________

    @pytest.fixture(scope="module", autouse=True)
    def fixture_create_server():
>       return create_server()

unit/test_speculative.py:21: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
unit/test_speculative.py:14: in create_server
    server.model_draft = download_file(MODEL_DRAFT_FILE_URL)
utils.py:410: in download_file
    wget.download(url, out=output_file)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/wget.py:526: in download
    (tmpfile, headers) = ulib.urlretrieve(binurl, tmpfile, callback)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:241: in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:216: in urlopen
    return opener.open(url, data, timeout)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:525: in open
    response = meth(req, response)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:634: in http_response
    response = self.parent.error(
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:563: in error
    return self._call_chain(*args)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:496: in _call_chain
    result = func(*args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <urllib.request.HTTPDefaultErrorHandler object at 0x7f861c9f0f50>
req = <urllib.request.Request object at 0x7f861ca31ed0>
fp = <http.client.HTTPResponse object at 0x7f861ca2ab90>, code = 504
msg = 'Gateway Time-out'
hdrs = <http.client.HTTPMessage object at 0x7f861ca32c10>

    def http_error_default(self, req, fp, code, msg, hdrs):
>       raise HTTPError(req.full_url, code, msg, hdrs, fp)
E       urllib.error.HTTPError: HTTP Error 504: Gateway Time-out

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:643: HTTPError
---------------------------- Captured stdout setup -----------------------------
Downloading https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf to ./tmp/stories15M-q4_0.gguf
=========================== short test summary info ============================
ERROR unit/test_speculative.py::test_with_and_without_draft - urllib.error.HTTPError: HTTP Error 504: Gateway Time-out
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
====== 122 passed, 3 skipped, 108 deselected, 1 error in 94.14s (0:01:34) ======
Error: Process completed with exit code 1.

In run number 3, the following error was thrown by server-windows unit test and it appears to be the same problem where it is unable to download a model for testing.

Run number 3 error details

  0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0
100  3035  100  3035    0     0    300      0  0:00:10  0:00:10 --:--:--   744

0.10.309.401 E common_download_file: invalid http status code received: 504

0.10.314.217 E common_iniWaiting for server to start...
-------------------------- Captured stdout teardown ---------------------------
Stopping server with pid=6332
=========================== short test summary info ===========================
FAILED unit/test_basic.py::test_server_start_simple - TimeoutError: Server did not start within 12 seconds
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!
===================== 1 failed, 108 deselected in 12.99s ======================
Error: Process completed with exit code 1.

Both of which have the HuggingFace server return a 504 status code. I believe this does not have any relation to my code unless I am missing something here.

Do let me know how this PR can proceed with these sporadic errors occurring on the unit tests.

ggerganov · 2025-02-22T15:35:57Z

It appears that these failing unit tests point towards not being able to download a model from HuggingFace.

Yes, these runs fail from time to time for some reason - not related to this PR.

@ericcurtin

* ggml: add s390x ARCH_FLAGS for compilation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add SIMD for s390x using vector intrinsics SIMD is activated for: * ggml_vec_dot_f32 * ggml_vec_dot_f16 * ggml_vec_mad_f32 * ggml_vec_mad_f16 * ggml_vec_mad_f32_unroll * ggml_vec_scale_f32 * ggml_vec_scale_f16 SIMD is NOT activated for: * ggml_vec_dot_f16_unroll (pending bugfix) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix missing escape character in GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix s390x GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: full SIMD activation for F32,F16 s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add option to disable s390x VXE/VXE2 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: change vecintrin.h include to ggml-cpu-impl * add __VXE__ and __VXE2__ macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * cmake: add s390x target detection for VX/VXE/VXE2 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move s390x vector intrinsics to ggml-cpu-impl.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x Q8_0 SIMD Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: correct documentation for Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x reduce code complexity Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x bugfix typo Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activated for Q4_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x inline vec_reve Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q4_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add VXE backend feature Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: remove test.py Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for quantize_row_q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for quantize_row_q8_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for iq4_xs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: bugfix iq4_xs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for iq4_nl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add float, double, and long vector data type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: clean up iq4_xs SIMD Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix improper use of restrict keyword Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: update warning message for ggml_vec_tbl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: switch to restrict for iq4_nl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: slight dot product speed improvement for q4_1_q8_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for q6_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add missing `_t` to ggml_int8x16x4_t Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix missing `_t` for ggml_vec_xl_s8x4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix more missing `_t` Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add unroll and prefetch to Q8_0 increase of 3.86% for prompt processing and 32.22% for token generation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: patch Q8_0 to use proper vector sizes Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: optimise Q8_0 dot prod compute kernel further Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add unroll and prefetch to Q4_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor Q6_K variable naming for readability Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q6_K typos Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q5_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix wrong char*x16_t naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: Q5_K y0 wrong signness Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q4_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q4_K invalid vector intrinsics Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: simplify ggml_padd_s16 compute kernel Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: correct ggml-cpu vxe wording Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: change ggml_aligned_malloc alignment to 256 256 is the cache line size for s390x platforms Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: resolve pr merge via cherry-pick 225bbbf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml : fix LoongArch compile error with 128-bit SIMD (ggml-org#11701) * ggml: resolve pr merge via cherry-pick 4571953 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: cmake remove fork when determining s390x machine type thank you @ericcurtin Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Jinyang He <hejinyang@loongson.cn> Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>

@ericcurtin

* ggml: add s390x ARCH_FLAGS for compilation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add SIMD for s390x using vector intrinsics SIMD is activated for: * ggml_vec_dot_f32 * ggml_vec_dot_f16 * ggml_vec_mad_f32 * ggml_vec_mad_f16 * ggml_vec_mad_f32_unroll * ggml_vec_scale_f32 * ggml_vec_scale_f16 SIMD is NOT activated for: * ggml_vec_dot_f16_unroll (pending bugfix) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix missing escape character in GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix s390x GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: full SIMD activation for F32,F16 s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add option to disable s390x VXE/VXE2 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: change vecintrin.h include to ggml-cpu-impl * add __VXE__ and __VXE2__ macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * cmake: add s390x target detection for VX/VXE/VXE2 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move s390x vector intrinsics to ggml-cpu-impl.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x Q8_0 SIMD Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: correct documentation for Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x reduce code complexity Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x bugfix typo Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activated for Q4_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x inline vec_reve Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q4_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add VXE backend feature Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: remove test.py Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for quantize_row_q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for quantize_row_q8_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for iq4_xs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: bugfix iq4_xs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for iq4_nl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add float, double, and long vector data type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: clean up iq4_xs SIMD Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix improper use of restrict keyword Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: update warning message for ggml_vec_tbl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: switch to restrict for iq4_nl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: slight dot product speed improvement for q4_1_q8_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for q6_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add missing `_t` to ggml_int8x16x4_t Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix missing `_t` for ggml_vec_xl_s8x4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix more missing `_t` Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add unroll and prefetch to Q8_0 increase of 3.86% for prompt processing and 32.22% for token generation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: patch Q8_0 to use proper vector sizes Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: optimise Q8_0 dot prod compute kernel further Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add unroll and prefetch to Q4_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor Q6_K variable naming for readability Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q6_K typos Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q5_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix wrong char*x16_t naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: Q5_K y0 wrong signness Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q4_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q4_K invalid vector intrinsics Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: simplify ggml_padd_s16 compute kernel Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: correct ggml-cpu vxe wording Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: change ggml_aligned_malloc alignment to 256 256 is the cache line size for s390x platforms Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: resolve pr merge via cherry-pick 225bbbf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml : fix LoongArch compile error with 128-bit SIMD (ggml-org#11701) * ggml: resolve pr merge via cherry-pick 4571953 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: cmake remove fork when determining s390x machine type thank you @ericcurtin Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Jinyang He <hejinyang@loongson.cn> Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>

taronaeo added 30 commits January 1, 2025 14:05

ggml: add s390x ARCH_FLAGS for compilation

17d6f54

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: fix missing escape character in GGML_F32x4_REDUCE

32c1e11

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR

518faff

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: fix s390x GGML_F32x4_REDUCE

b377968

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: full SIMD activation for F32,F16 s390x

2dd768e

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: add option to disable s390x VXE/VXE2

0fdbc72

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: change vecintrin.h include to ggml-cpu-impl

a44fba2

* add __VXE__ and __VXE2__ macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

cmake: add s390x target detection for VX/VXE/VXE2

77696c9

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: move s390x vector intrinsics to ggml-cpu-impl.h

47ca047

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: s390x Q8_0 SIMD

2d06192

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: correct documentation for Q8_0

33ea1d0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: s390x reduce code complexity Q8_0

82e045d

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: s390x bugfix typo Q8_0

261689d

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: s390x SIMD activated for Q4_1

4212c46

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: s390x inline vec_reve

44402b7

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: s390x SIMD activation for Q4_0

68760a8

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: add VXE backend feature

ecdf6f0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: remove test.py

fd993b2

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: s390x SIMD activation for quantize_row_q8_0

0f1e7a0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: s390x SIMD activation for quantize_row_q8_1

cd707a7

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: s390x SIMD activation for iq4_xs

e1f939f

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: bugfix iq4_xs

37a0a62

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: s390x SIMD activation for iq4_nl

8df0269

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: add float, double, and long vector data type

ee750c9

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: clean up iq4_xs SIMD

2073291

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: fix improper use of restrict keyword

0c6e6d6

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: update warning message for ggml_vec_tbl

109be7f

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K

ed6487c

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs

eb3fa5d

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

taronaeo added 13 commits February 6, 2025 17:59

ggml: add unroll and prefetch to Q4_1

b11ffbd

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: refactor Q6_K variable naming for readability

dac5d9e

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: fix Q6_K typos

8fe0803

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: s390x SIMD activation for Q5_K

333e1a2

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: fix wrong char*x16_t naming

c2794e8

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: Q5_K y0 wrong signness

2606ddc

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: fix Q5_K invalid uchar type

809dac1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: fix Q5_K invalid uchar type

c8f9538

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: s390x SIMD activation for Q4_K

3dd7144

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: fix Q4_K invalid vector intrinsics

9b01b64

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: simplify ggml_padd_s16 compute kernel

84ee8b0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: correct ggml-cpu vxe wording

8ced2ab

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml: change ggml_aligned_malloc alignment to 256

5796caf

256 is the cache line size for s390x platforms Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 22, 2025

MQ-mengqing and others added 4 commits February 22, 2025 16:49

ggml: resolve pr merge via cherry-pick 225bbbf

b4b2214

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml : fix LoongArch compile error with 128-bit SIMD (ggml-org#11701)

cfc2603

ggml: resolve pr merge via cherry-pick 4571953

f263ec3

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Merge branch 'master' into master

751528d

ericcurtin reviewed Feb 22, 2025

View reviewed changes

ggml/src/ggml-cpu/CMakeLists.txt Outdated Show resolved Hide resolved

ericcurtin approved these changes Feb 22, 2025

View reviewed changes

ggml: cmake remove fork when determining s390x machine type

3a42a05

thank you @ericcurtin Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ericcurtin merged commit af7747c into ggml-org:master Feb 22, 2025
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: Support s390x SIMD Instruction Set #12019

ggml-cpu: Support s390x SIMD Instruction Set #12019

Uh oh!

taronaeo commented Feb 22, 2025 •

edited

Loading

Uh oh!

taronaeo commented Feb 22, 2025

Uh oh!

Uh oh!

ericcurtin left a comment

Uh oh!

taronaeo commented Feb 22, 2025

Uh oh!

ggerganov commented Feb 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggml-cpu: Support s390x SIMD Instruction Set #12019

ggml-cpu: Support s390x SIMD Instruction Set #12019

Uh oh!

Conversation

taronaeo commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Verification

Performance Results

Uh oh!

taronaeo commented Feb 22, 2025

Uh oh!

Uh oh!

ericcurtin left a comment

Choose a reason for hiding this comment

Uh oh!

taronaeo commented Feb 22, 2025

Uh oh!

ggerganov commented Feb 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

taronaeo commented Feb 22, 2025 •

edited

Loading