Skip to content

Conversation

@taronaeo
Copy link
Collaborator

@taronaeo taronaeo commented Feb 22, 2025

This pull request aims to integrate the SIMD instruction set via vecintrin.h into llama.cpp on the s390x platform.
Currently the SIMD instruction set is included in the following ggml_vec_dot functions:

Function Implementation Remarks
ggml_vec_dot_f32 IMPLEMENTED Notice a hotspot for Assembly call vector load. Will fix in another PR.
ggml_vec_dot_f16 IMPLEMENTED Notice a hotspot for Assembly call vector load. Will fix in another PR.
ggml_vec_dot_q4_0_q8_0 IMPLEMENTED
ggml_vec_dot_q4_1_q8_1 IMPLEMENTED
ggml_vec_dot_q8_0_q8_0 IMPLEMENTED
ggml_vec_dot_q4_K_q8_K IMPLEMENTED
ggml_vec_dot_q5_K_q8_K IMPLEMENTED
ggml_vec_dot_q6_K_q8_K IMPLEMENTED
ggml_vec_dot_iq4_nl_q8_0 IMPLEMENTED
ggml_vec_dot_iq4_xs_q8_K IMPLEMENTED

Verification

To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:

  • Tested IBM Granite 3.0 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
  • Tested IBM Granite 3.1 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
  • Kindly request additional models for testing in this PR

Performance Results

I will be using IBM Granite 3.1 for the performance results as it has better neural network than 3.0.

Before SIMD Instruction Set

model size parameters backend threads test t/s
Granite-3.1-1B-A400M-Instruct-BE-F32 4.97 GiB 1.33 B BLAS 8 pp512 16.66 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-F16 2.49 GiB 1.33 B BLAS 8 pp512 16.30 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 731.07 MiB 1.33 B BLAS 8 pp512 23.31 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 807.57 MiB 1.33 B BLAS 8 pp512 26.52 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 1.32 GiB 1.33 B BLAS 8 pp512 29.73 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q4_K 782.12 MiB 1.33 B BLAS 8 pp512 23.91 ± 0.05
Granite-3.1-1B-A400M-Instruct-BE-Q5_K 910.37 MiB 1.33 B BLAS 8 pp512 16.73 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q6_K 1.02 GiB 1.33 B BLAS 8 pp512 12.62 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL 737.07 MiB 1.33 B BLAS 8 pp512 23.88 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS 700.32 MiB 1.33 B BLAS 8 pp512 21.59 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-F32 4.97 GiB 1.33 B BLAS 8 tg128 8.20 ± 0.07
Granite-3.1-1B-A400M-Instruct-BE-F16 2.49 GiB 1.33 B BLAS 8 tg128 9.70 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 731.07 MiB 1.33 B BLAS 8 tg128 14.48 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 807.57 MiB 1.33 B BLAS 8 tg128 15.95 ± 0.06
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 1.32 GiB 1.33 B BLAS 8 tg128 19.80 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-Q4_K 782.12 MiB 1.33 B BLAS 8 tg128 14.89 ± 0.06
Granite-3.1-1B-A400M-Instruct-BE-Q5_K 910.37 MiB 1.33 B BLAS 8 tg128 10.94 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-Q6_K 1.02 GiB 1.33 B BLAS 8 tg128 8.53 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL 737.07 MiB 1.33 B BLAS 8 tg128 14.38 ± 0.07
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS 700.32 MiB 1.33 B BLAS 8 tg128 13.22 ± 0.02

After SIMD Instruction Set

model size parameters backend threads test t/s
Granite-3.1-1B-A400M-Instruct-BE-F32 4.97 GiB 1.33 B BLAS 8 pp512 85.46 ± 0.09
Granite-3.1-1B-A400M-Instruct-BE-F16 2.49 GiB 1.33 B BLAS 8 pp512 35.39 ± 0.13
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 731.07 MiB 1.33 B BLAS 8 pp512 121.46 ± 0.81
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 807.57 MiB 1.33 B BLAS 8 pp512 123.79 ± 0.40
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 1.32 GiB 1.33 B BLAS 8 pp512 137.36 ± 0.52
Granite-3.1-1B-A400M-Instruct-BE-Q4_K 782.12 MiB 1.33 B BLAS 8 pp512 118.88 ± 0.56
Granite-3.1-1B-A400M-Instruct-BE-Q5_K 910.37 MiB 1.33 B BLAS 8 pp512 111.65 ± 0.38
Granite-3.1-1B-A400M-Instruct-BE-Q6_K 1.02 GiB 1.33 B BLAS 8 pp512 101.94 ± 0.59
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL 737.07 MiB 1.33 B BLAS 8 pp512 94.28 ± 0.18
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS 700.32 MiB 1.33 B BLAS 8 pp512 99.43 ± 0.87
Granite-3.1-1B-A400M-Instruct-BE-F32 4.97 GiB 1.33 B BLAS 8 tg128 14.27 ± 0.29
Granite-3.1-1B-A400M-Instruct-BE-F16 2.49 GiB 1.33 B BLAS 8 tg128 13.97 ± 0.11
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 731.07 MiB 1.33 B BLAS 8 tg128 69.33 ± 1.41
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 807.57 MiB 1.33 B BLAS 8 tg128 65.97 ± 1.71
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 1.32 GiB 1.33 B BLAS 8 tg128 57.82 ± 0.60
Granite-3.1-1B-A400M-Instruct-BE-Q4_K 782.12 MiB 1.33 B BLAS 8 tg128 72.14 ± 0.70
Granite-3.1-1B-A400M-Instruct-BE-Q5_K 910.37 MiB 1.33 B BLAS 8 tg128 70.34 ± 0.69
Granite-3.1-1B-A400M-Instruct-BE-Q6_K 1.02 GiB 1.33 B BLAS 8 tg128 63.45 ± 0.68
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL 737.07 MiB 1.33 B BLAS 8 tg128 60.09 ± 1.33
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS 700.32 MiB 1.33 B BLAS 8 tg128 66.48 ± 1.29

Note

Tests were conducted on an IBM z15 Mainframe with 8 IFLs (cores) and 64 GB Memory on an LPAR.

Please review this pull request and consider merging into the main repository. Thank you!

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16

SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* add __VXE__ and __VXE2__ macros

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
256 is the cache line size for s390x platforms

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 22, 2025
@taronaeo
Copy link
Collaborator Author

I have fixed all problems and have re-tested the implementation to ensure that it is working as intended. No problems so far, do let me know how should I proceed with this PR.

Copy link
Collaborator

@ericcurtin ericcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM... Looking forward to llama.cpp on mainframe!

thank you @ericcurtin

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@taronaeo
Copy link
Collaborator Author

It appears that these failing unit tests point towards not being able to download a model from HuggingFace. In run number 2, the following error was thrown by the server unit test which points directly to the test not being able to download a model for testing.

Run number 2 error details
==================================== ERRORS ====================================
________________ ERROR at setup of test_with_and_without_draft _________________

    @pytest.fixture(scope="module", autouse=True)
    def fixture_create_server():
>       return create_server()

unit/test_speculative.py:21: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
unit/test_speculative.py:14: in create_server
    server.model_draft = download_file(MODEL_DRAFT_FILE_URL)
utils.py:410: in download_file
    wget.download(url, out=output_file)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/wget.py:526: in download
    (tmpfile, headers) = ulib.urlretrieve(binurl, tmpfile, callback)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:241: in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:216: in urlopen
    return opener.open(url, data, timeout)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:525: in open
    response = meth(req, response)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:634: in http_response
    response = self.parent.error(
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:563: in error
    return self._call_chain(*args)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:496: in _call_chain
    result = func(*args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <urllib.request.HTTPDefaultErrorHandler object at 0x7f861c9f0f50>
req = <urllib.request.Request object at 0x7f861ca31ed0>
fp = <http.client.HTTPResponse object at 0x7f861ca2ab90>, code = 504
msg = 'Gateway Time-out'
hdrs = <http.client.HTTPMessage object at 0x7f861ca32c10>

    def http_error_default(self, req, fp, code, msg, hdrs):
>       raise HTTPError(req.full_url, code, msg, hdrs, fp)
E       urllib.error.HTTPError: HTTP Error 504: Gateway Time-out

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:643: HTTPError
---------------------------- Captured stdout setup -----------------------------
Downloading https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf to ./tmp/stories15M-q4_0.gguf
=========================== short test summary info ============================
ERROR unit/test_speculative.py::test_with_and_without_draft - urllib.error.HTTPError: HTTP Error 504: Gateway Time-out
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
====== 122 passed, 3 skipped, 108 deselected, 1 error in 94.14s (0:01:34) ======
Error: Process completed with exit code 1.

In run number 3, the following error was thrown by server-windows unit test and it appears to be the same problem where it is unable to download a model for testing.

Run number 3 error details
  0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0
100  3035  100  3035    0     0    300      0  0:00:10  0:00:10 --:--:--   744

0.10.309.401 E common_download_file: invalid http status code received: 504

0.10.314.217 E common_iniWaiting for server to start...
-------------------------- Captured stdout teardown ---------------------------
Stopping server with pid=6332
=========================== short test summary info ===========================
FAILED unit/test_basic.py::test_server_start_simple - TimeoutError: Server did not start within 12 seconds
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!
===================== 1 failed, 108 deselected in 12.99s ======================
Error: Process completed with exit code 1.

Both of which have the HuggingFace server return a 504 status code. I believe this does not have any relation to my code unless I am missing something here.

Do let me know how this PR can proceed with these sporadic errors occurring on the unit tests.

@ggerganov
Copy link
Member

It appears that these failing unit tests point towards not being able to download a model from HuggingFace.

Yes, these runs fail from time to time for some reason - not related to this PR.

@ericcurtin ericcurtin merged commit af7747c into ggml-org:master Feb 22, 2025
47 checks passed
orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025
* ggml: add s390x ARCH_FLAGS for compilation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add SIMD for s390x using vector intrinsics

SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16

SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix missing escape character in GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix s390x GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: full SIMD activation for F32,F16 s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add option to disable s390x VXE/VXE2

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: change vecintrin.h include to ggml-cpu-impl

* add __VXE__ and __VXE2__ macros

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* cmake: add s390x target detection for VX/VXE/VXE2

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: move s390x vector intrinsics to ggml-cpu-impl.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x Q8_0 SIMD

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: correct documentation for Q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x reduce code complexity Q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x bugfix typo Q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activated for Q4_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x inline vec_reve

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for Q4_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add VXE backend feature

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: remove test.py

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for quantize_row_q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for quantize_row_q8_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for iq4_xs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: bugfix iq4_xs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for iq4_nl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add float, double, and long vector data type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: clean up iq4_xs SIMD

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix improper use of restrict keyword

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: update warning message for ggml_vec_tbl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: switch to restrict for iq4_nl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: slight dot product speed improvement for q4_1_q8_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for q6_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add missing `_t` to ggml_int8x16x4_t

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix missing `_t` for ggml_vec_xl_s8x4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix more missing `_t`

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add unroll and prefetch to Q8_0

increase of 3.86% for prompt processing and 32.22% for token generation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: patch Q8_0 to use proper vector sizes

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: optimise Q8_0 dot prod compute kernel further

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add unroll and prefetch to Q4_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: refactor Q6_K variable naming for readability

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q6_K typos

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for Q5_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix wrong char*x16_t naming

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: Q5_K y0 wrong signness

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for Q4_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q4_K invalid vector intrinsics

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: simplify ggml_padd_s16 compute kernel

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: correct ggml-cpu vxe wording

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: change ggml_aligned_malloc alignment to 256

256 is the cache line size for s390x platforms

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: resolve pr merge via cherry-pick 225bbbf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml : fix LoongArch compile error with 128-bit SIMD (ggml-org#11701)

* ggml: resolve pr merge via cherry-pick 4571953

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: cmake remove fork when determining s390x machine type

thank you @ericcurtin

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Jinyang He <hejinyang@loongson.cn>
Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
* ggml: add s390x ARCH_FLAGS for compilation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add SIMD for s390x using vector intrinsics

SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16

SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix missing escape character in GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix s390x GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: full SIMD activation for F32,F16 s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add option to disable s390x VXE/VXE2

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: change vecintrin.h include to ggml-cpu-impl

* add __VXE__ and __VXE2__ macros

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* cmake: add s390x target detection for VX/VXE/VXE2

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: move s390x vector intrinsics to ggml-cpu-impl.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x Q8_0 SIMD

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: correct documentation for Q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x reduce code complexity Q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x bugfix typo Q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activated for Q4_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x inline vec_reve

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for Q4_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add VXE backend feature

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: remove test.py

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for quantize_row_q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for quantize_row_q8_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for iq4_xs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: bugfix iq4_xs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for iq4_nl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add float, double, and long vector data type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: clean up iq4_xs SIMD

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix improper use of restrict keyword

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: update warning message for ggml_vec_tbl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: switch to restrict for iq4_nl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: slight dot product speed improvement for q4_1_q8_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for q6_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add missing `_t` to ggml_int8x16x4_t

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix missing `_t` for ggml_vec_xl_s8x4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix more missing `_t`

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add unroll and prefetch to Q8_0

increase of 3.86% for prompt processing and 32.22% for token generation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: patch Q8_0 to use proper vector sizes

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: optimise Q8_0 dot prod compute kernel further

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add unroll and prefetch to Q4_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: refactor Q6_K variable naming for readability

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q6_K typos

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for Q5_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix wrong char*x16_t naming

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: Q5_K y0 wrong signness

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for Q4_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q4_K invalid vector intrinsics

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: simplify ggml_padd_s16 compute kernel

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: correct ggml-cpu vxe wording

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: change ggml_aligned_malloc alignment to 256

256 is the cache line size for s390x platforms

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: resolve pr merge via cherry-pick 225bbbf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml : fix LoongArch compile error with 128-bit SIMD (ggml-org#11701)

* ggml: resolve pr merge via cherry-pick 4571953

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: cmake remove fork when determining s390x machine type

thank you @ericcurtin

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Jinyang He <hejinyang@loongson.cn>
Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants