Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: add blis and other BLAS implementation support #1502

Merged
merged 4 commits into from
May 20, 2023

Conversation

zenixls2
Copy link
Contributor

@zenixls2 zenixls2 commented May 17, 2023

Library: https://github.com/flame/blis (master)

Compare to openblas (focal-updates,now 0.3.8+ds-1ubuntu0.20.04.1 amd64) in benchmark:

hardware setup:

  • 12th Gen Intel(R) Core(TM) i7-12700 (12 core 20 threads)
  • 16GiB DIMM Synchronous 4800 MHz * 2

openblas-openmp (OMP_NUM_THREADS=14)

Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;       1; 11008;  4096;   128;    11542724608;             81141;    142.26
        1;       1; 11008;  4096;   128;    11542724608;             56359;    204.81
        2;       1; 11008;  4096;   128;    11542724608;             55481;    208.05
        3;       1; 11008;  4096;   128;    11542724608;             56389;    204.70
        4;       1; 11008;  4096;   128;    11542724608;             55802;    206.85
        5;       1; 11008;  4096;   128;    11542724608;             56172;    205.49
        6;       1; 11008;  4096;   128;    11542724608;             56069;    205.87
        7;       1; 11008;  4096;   128;    11542724608;             56091;    205.79
        8;       1; 11008;  4096;   128;    11542724608;             56036;    205.99
        9;       1; 11008;  4096;   128;    11542724608;             56178;    205.47

blis (openmp, BLIS_NUM_THREADS=14)

cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;       1; 11008;  4096;   128;    11542724608;             67620;    170.70
        1;       1; 11008;  4096;   128;    11542724608;             37163;    310.60
        2;       1; 11008;  4096;   128;    11542724608;             34049;    339.00
        3;       1; 11008;  4096;   128;    11542724608;             37200;    310.29
        4;       1; 11008;  4096;   128;    11542724608;             37662;    306.48
        5;       1; 11008;  4096;   128;    11542724608;             37220;    310.12
        6;       1; 11008;  4096;   128;    11542724608;             37129;    310.88
        7;       1; 11008;  4096;   128;    11542724608;             37251;    309.86
        8;       1; 11008;  4096;   128;    11542724608;             37341;    309.12
        9;       1; 11008;  4096;   128;    11542724608;             37738;    305.86

Average                                                                        298.29
=====================================================================================

mkl blas (Intel10_64lp) (OMP_NUM_THREADS=14)


Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;       1; 11008;  4096;   128;    11542724608;             64675;    178.47
        1;       1; 11008;  4096;   128;    11542724608;             38275;    301.57
        2;       1; 11008;  4096;   128;    11542724608;             38270;    301.61
        3;       1; 11008;  4096;   128;    11542724608;             38418;    300.45
        4;       1; 11008;  4096;   128;    11542724608;             38323;    301.20
        5;       1; 11008;  4096;   128;    11542724608;             38446;    300.23
        6;       1; 11008;  4096;   128;    11542724608;             38306;    301.33
        7;       1; 11008;  4096;   128;    11542724608;             38396;    300.62
        8;       1; 11008;  4096;   128;    11542724608;             38417;    300.46
        9;       1; 11008;  4096;   128;    11542724608;             38363;    300.88

Average                                                                        288.68
=====================================================================================

Also looking forward if someone else could make use of it's typed api to do further optimization.

@zenixls2
Copy link
Contributor Author

GGML_USE_BLIS is not actually been used.
BLIS specific optimisation might be applied based on this.

@zenixls2 zenixls2 force-pushed the feature/blis branch 2 times, most recently from a1eba8e to 4cb20e5 Compare May 19, 2023 06:17
@zenixls2
Copy link
Contributor Author

I've noticed that ggerganov/whisper.cpp#927 has some implementation to let other options possible
However, the cmake_minimum_required version might have to be upgraded, as

  • BLA_PKGCONFIG_BLAS New in version 3.25.
  • BLA_SIZEOF_INTEGER New in version 3.22.

Will change the title and modify the changes to align with whisper's pr.
cc @ggerganov

@zenixls2 zenixls2 changed the title feature: add blis support feature: add blis and other BLAS implementation support May 19, 2023
CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@zenixls2 zenixls2 requested a review from ggerganov May 20, 2023 08:36
@ggerganov ggerganov merged commit 07e9ace into ggerganov:master May 20, 2023
ggerganov added a commit that referenced this pull request May 20, 2023
@ggerganov
Copy link
Owner

Sorry, I forgot to run the CI - we need to fix the builds before merging
Please reopen a new PR

ggerganov added a commit to JohannesGaessler/llama.cpp that referenced this pull request May 20, 2023
)

* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ggerganov added a commit to JohannesGaessler/llama.cpp that referenced this pull request May 20, 2023
ggerganov added a commit that referenced this pull request May 20, 2023
…oadcasting for ggml_mul (#1483)

* Broadcasting for ggml_mul

* CUDA kernel for ggml_mul, norms in VRAM

* GPU weights not in RAM, direct loading with cuFile

* fixup! GPU weights not in RAM, direct loading with cuFile

* fixup! GPU weights not in RAM, direct loading with cuFile

* define default model path once, sync path with readme (#1366)

* ~7% faster Q5_1 AVX2 code (#1477)

* convert.py: Support models which are stored in a single pytorch_model.bin (#1469)

* Support models in a single pytorch_model.bin

* Remove spurious line with typo

* benchmark-matmul: Print the average of the test results (#1490)

* Remove unused n_parts parameter (#1509)

* Fixes #1511 lambda issue for w64devkit (mingw) (#1513)

* Fix for w64devkit and mingw

* make kv_f16 the default for api users (#1517)

* minor : fix compile warnings

* readme : adds WizardLM to the list of supported models (#1485)

* main : make reverse prompt option act as a stop token in non-interactive mode (#1032)

* Make reverse prompt option act as a stop token in non-interactive scenarios

* Making requested review changes

* Update gpt_params_parse and fix a merge error

* Revert "Update gpt_params_parse and fix a merge error"

This reverts commit 2bb2ff1.

* Update gpt_params_parse and fix a merge error take 2

* examples : add persistent chat (#1495)

* examples : add persistent chat

* examples : fix whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* tests : add missing header

* ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508)

* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0

* llama : bump LLAMA_FILE_VERSION to 3

* cuda : update Q4 and Q8 dequantize kernels

* ggml : fix AVX dot products

* readme : update performance table + hot topics

* ggml : fix scalar implementation of Q4_1 dot

* llama : fix compile warnings in llama_set_state_data()

* llama : fix name shadowing and C4146 (#1526)

* Fix name shadowing and C4146

* Fix if macros not using defined when required

* Update llama-util.h

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update llama-util.h

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Code style

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix for mingw (#1462)

* llama : add llama_init_backend() API (close #1527)

* feature : add blis and other BLAS implementation support (#1502)

* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Revert "feature : add blis and other BLAS implementation support (#1502)"

This reverts commit 07e9ace.

* GPU weights not in RAM, direct loading with cuFile

* llama : code style fixes + progress print fix

* ggml : ggml_mul better broadcast support

* cmake : workarounds for cufile when CMake version < 3.25

* gg rebase fixup

* Loop in llama.cpp, fixed progress callback

* Attempt clang-tidy fix

* llama : fix vram size computation

* Add forgotten fclose()

---------

Co-authored-by: András Salamon <ott2@users.noreply.github.com>
Co-authored-by: Ilya Kurdyukov <59548320+ilyakurdyukov@users.noreply.github.com>
Co-authored-by: Tom Jobbins <784313+TheBloke@users.noreply.github.com>
Co-authored-by: rankaiyx <rankaiyx@rankaiyx.com>
Co-authored-by: Stephan Walter <stephan@walter.name>
Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
Co-authored-by: Erik Scholz <Green-Sky@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: David Kennedy <dakennedyd@gmail.com>
Co-authored-by: Jason McCartney <jmac@theroot.org>
Co-authored-by: Evan Jones <evan.q.jones@gmail.com>
Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Zenix <zenixls2@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants