[BesTLA] Refactor quantization-related kernels #209

luoyu-intel · 2024-04-09T05:30:29Z

Type of Change

This PR will speedup next-token inference on all platforms. It will support asymmetric weights for comp_int8(Before asymmetric will fall back to comp_fp32). Also support low-end devices which does not have AVX_VNNI instructions. 2bit+asymmetric weight will also be supported in this PR.

Plan these tasks to the next PR:

support compute_dtype=f16, for GNR
Int5, Int6, Int7

To get the best performance:
GCC>=11
MSVC>=1930(VS2022)
DPCPP>=2024.0

Highlights

alg=asym, >40% faster for the next tokens.
AVX2 devices are >40% faster for the next tokens.
AVX2 devices without AVX_VNNI are >80% faster for the next tokens.

NBits Integer Support Matrix

Weight Dtype	Compute Dtype	ISA
Int4 sym+asym	Fp32	AVX2,AVX512F
Int3 sym+asym	Fp32	AVX2,AVX512F
Int2 sym+asym	Fp32	AVX2,AVX512F
Int4 sym+asym	BF16	AVX512_BF16,AMX_BF16
Int3 sym+asym	BF16	AVX512_BF16,AMX_BF16
Int2 sym+asym	BF16	AVX512_BF16,AMX_BF16
Int4 sym+asym	FP16	AVX512_FP16,AMX_FP16
Int3 sym+asym	FP16	AVX512_FP16,AMX_FP16
Int2 sym+asym	FP16	AVX512_FP16,AMX_FP16
Int4 sym+asym	Int8 (u8s8 & s8s8)	AVX2,AVX_VNNI,AVX512_VNNI,AMX_INT8
Int3 sym+asym	Int8 (u8s8 & s8s8)	AVX2,AVX_VNNI,AVX512_VNNI,AMX_INT8
Int2 sym+asym	Int8 (u8s8 & s8s8)	AVX2,AVX_VNNI,AVX512_VNNI,AMX_INT8

NBits Float Support Matrix

Weight Dtype	Compute Dtype	ISA
NF4	Fp32	AVX2,AVX512F
FP4	Fp32	AVX2,AVX512F
FP8	Fp32	AVX2,AVX512F
NF4	BF16	AVX512_BF16,AMX_BF16
FP4	BF16	AVX512_BF16,AMX_BF16
FP8	BF16	AVX512_BF16,AMX_BF16
NF4	FP16	AVX512_FP16,AMX_FP16
FP4	FP16	AVX512_FP16,AMX_FP16
FP8	FP16	AVX512_FP16,AMX_FP16

luoyu-intel · 2024-04-17T01:47:30Z

llama2-7b int4 on 12900K : this PR is ~40% faster for prompt=32, and ~20% faster for prompt=1024
llama2-7b int4 on MTL-155H: this PR is ~50% faster for prompt=32, and ~42% faster for prompt=1024

luoyu-intel · 2024-04-17T03:53:39Z

For group_size=32, weight_dtype=int4, scale_dtype=bf16, compute_dtype=int8, BesTLA GEMV kernels have the same performance as GGML q40 kernels.

luoyu-intel · 2024-04-17T08:26:55Z

Ready for int3 and int4 weight on AVX2 devices, supported quantization parameters:
int3:

.\quant_llama.exe --model_file llama2-f16.bin --out_file llama2-q3j-g128-int8_bf16.bin --nthread 16 --group_size
 128 --compute_dtype int8 --scale_dtype bf16 --weight_dtype int3

fastest int4:

.\quant_llama.exe --model_file path\llama2-f16.bin --out_file llama2-q4j-g-1-int8_bf16.bin --nthread 16 --group_size -1 --compute_dtype int8 --scale_dtype bf16

same to llama.cpp q40:

.\quant_llama.exe --model_file path\llama2-f16.bin --out_file llama2-q4j-g32-int8_bf16.bin --nthread 16 --group_size 32 --compute_dtype int8 --scale_dtype bf16

Recommended runtime threads number for hybrid CPUs: P+E, or P*2+E.
E.g. 12900K: 16, or 24. 155H 14, or 20.

for more information, see https://pre-commit.ci

intellinjun · 2024-04-30T07:28:18Z

https://inteltf-jenk.sh.intel.com/job/neural_speed_extension/114/
GPTQ models test

ThanatosShinji · 2024-04-30T10:12:54Z

CPU: Intel(R) Core(TM) Ultra 7 155H model: llama2-7b

weight_dtype=int2, alg=asym, group_size=128, scale_dtype=bf16, compute_dtype=int8， 27.00 tokens/s:

 Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun too!
The only problem was that she had a baby brother who really needed the attention of her mamma and papa all the time every day of course she
model_print_timings:        load time =   274.56 ms
model_print_timings:      sample time =     6.78 ms /    32 runs   (    0.21 ms per token)
model_print_timings: prompt eval time =   273.43 ms /    32 tokens (    8.54 ms per token)
model_print_timings:        eval time =  1147.80 ms /    31 runs   (   37.03 ms per token)
model_print_timings:       total time =  1450.85 ms
========== eval time log of each prediction ==========
prediction   0, time: 273.43ms
prediction   1, time: 36.41ms
prediction   2, time: 36.27ms
prediction   3, time: 36.59ms
prediction   4, time: 36.30ms

weight_dtype=int4, alg=sym, group_size=128, scale_dtype=bf16, compute_dtype=int8， 21.72 tokens/s:

 Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing it!
When I was very young, my parents would take me all over the place for adventures. We went camping, hiking,
model_print_timings:        load time =   312.94 ms
model_print_timings:      sample time =     6.31 ms /    32 runs   (    0.20 ms per token)
model_print_timings: prompt eval time =   309.58 ms /    32 tokens (    9.67 ms per token)
model_print_timings:        eval time =  1427.12 ms /    31 runs   (   46.04 ms per token)
model_print_timings:       total time =  1775.39 ms
========== eval time log of each prediction ==========
prediction   0, time: 309.58ms
prediction   1, time: 45.49ms
prediction   2, time: 46.81ms
prediction   3, time: 45.38ms
prediction   4, time: 46.10ms

ThanatosShinji · 2024-04-30T10:24:02Z

CPU: 13900 model: llama2-7b

weight_dtype=int2, alg=asym, group_size=128, scale_dtype=bf16, compute_dtype=int8， 28.05 tokens/s:

 Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun without any supervision or super strict rules on her waistcoat length hours only allowed to be outdoors between the months of October through December where the
model_print_timings:        load time =   151.77 ms
model_print_timings:      sample time =     6.71 ms /    32 runs   (    0.21 ms per token)
model_print_timings: prompt eval time =   148.32 ms /    32 tokens (    4.64 ms per token)
model_print_timings:        eval time =  1105.23 ms /    31 runs   (   35.65 ms per token)
model_print_timings:       total time =  1284.27 ms
========== eval time log of each prediction ==========
prediction   0, time: 148.32ms
prediction   1, time: 35.17ms
prediction   2, time: 34.63ms
prediction   3, time: 34.49ms
prediction   4, time: 34.40ms

weight_dtype=int4, alg=sym, group_size=128, scale_dtype=bf16, compute_dtype=int8， 19.69 tokens/s:

Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing it.
As she grew into an adult, she decided that the best way to do this was as a travel writer. She loved writing, and she
model_print_timings:        load time =   171.36 ms
model_print_timings:      sample time =     6.66 ms /    32 runs   (    0.21 ms per token)
model_print_timings: prompt eval time =   164.40 ms /    32 tokens (    5.14 ms per token)
model_print_timings:        eval time =  1574.00 ms /    31 runs   (   50.77 ms per token)
model_print_timings:       total time =  1772.23 ms
========== eval time log of each prediction ==========
prediction   0, time: 164.40ms
prediction   1, time: 49.76ms
prediction   2, time: 51.45ms
prediction   3, time: 50.28ms
prediction   4, time: 49.44ms

This reverts commit 7dc4dd8.

ThanatosShinji · 2024-04-30T14:12:17Z

CPU: AMD Ryzen 7 3700X 8-Core Processor(AVX2 without AVX_VNNI) model: llama2-7b

llama.cpp q4_0

.\main.exe -m llama-2-7b.Q4_0.gguf -t 16 -n 16 -p "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun"
llama_print_timings:        load time =     895.97 ms
llama_print_timings:      sample time =       0.57 ms /    16 runs   (    0.04 ms per token, 28268.55 tokens per second)
llama_print_timings: prompt eval time =    1612.76 ms /    32 tokens (   50.40 ms per token,    19.84 tokens per second)
llama_print_timings:        eval time =    2791.14 ms /    15 runs   (  186.08 ms per token,     5.37 tokens per second)
llama_print_timings:       total time =    4410.18 ms /    47 tokens

this PR group_size=32, weight_dtype=int4, scale_dtype=bf16, compute_dtype=int8, alg=sym

 .\run_llama.exe -m llama-q4j.bin -t 16 -n 16 -p "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun"
model_print_timings:        load time =   692.79 ms
model_print_timings:      sample time =     5.74 ms /    16 runs   (    0.36 ms per token)
model_print_timings: prompt eval time =   692.38 ms /    32 tokens (   21.64 ms per token)
model_print_timings:        eval time =  2421.93 ms /    15 runs   (  161.46 ms per token)
model_print_timings:       total time =  3128.16 ms

group_size=128

model_print_timings:        load time =   608.46 ms
model_print_timings:      sample time =     6.46 ms /    16 runs   (    0.40 ms per token)
model_print_timings: prompt eval time =   607.75 ms /    32 tokens (   18.99 ms per token)
model_print_timings:        eval time =  2103.54 ms /    15 runs   (  140.24 ms per token)
model_print_timings:       total time =  2725.94 ms

weight_dtype=int2, alg=asym

model_print_timings:        load time =   637.32 ms
model_print_timings:      sample time =     5.67 ms /    16 runs   (    0.35 ms per token)
model_print_timings: prompt eval time =   636.89 ms /    32 tokens (   19.90 ms per token)
model_print_timings:        eval time =  1530.53 ms /    15 runs   (  102.04 ms per token)
model_print_timings:       total time =  2180.79 ms

bestla/CMakeLists.txt

bestla/bestla/bestla_prologue_b.h

neural_speed/core/layers/bestla_common.hpp

zhewang1-intc · 2024-05-06T01:26:02Z

will we add a bestla Relase tag for this pr? i think QBits should adapt with this great refactor.

luoyu-intel · 2024-05-06T03:28:45Z

will we add a bestla Relase tag for this pr? i think QBits should adapt with this great refactor.

Yes, a release tag will be created after this PR.

luoyu-intel marked this pull request as draft April 9, 2024 05:35

luoyu-intel mentioned this pull request Apr 9, 2024

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator #174

Open

luoyu-intel mentioned this pull request Apr 17, 2024

Question about Thread pool and GEMV #221

Closed

luoyu-intel force-pushed the vector_dot branch from f14d59d to d4a419b Compare April 17, 2024 03:51

luoyu-intel force-pushed the vector_dot branch 2 times, most recently from 270e6d5 to dbccb9c Compare April 17, 2024 08:19

luoyu-intel force-pushed the vector_dot branch from 2bec865 to 6062158 Compare April 18, 2024 09:15

luoyu-intel changed the title ~~[BesTLA] Add matrix-vector kernels~~ [BesTLA] Refactor quantization-related kernels Apr 24, 2024

luoyu-intel mentioned this pull request Apr 24, 2024

i saw how beautiful this repo is, in terms of parallelism / numa stuff etc. #231

Open

luoyu-intel added 18 commits April 29, 2024 10:17

add instrinsic for s4_clip

f99d29c

remove unstable time data

be59a64

add q4fp32 kernel

7b11d14

add avx2 version of u8s8

f878c86

add hybrid support

65e3e96

use hybrid scheduler for ggml and fp32 kernels.

30a2286

fix typo

16563a9

fix err

86101a5

debug overflow of non-vnni instruction

1deff17

add ref for gemv

7d4ab6d

pass UT

6f341a3

add S8S8S32 and S8S8Fp32 for AVX_VNNI

1d1b97e

add benchmark S8S8 is 50% of U8S8

32731bd

add benchmark and model test

3096f72

model check

6ad88ed

disable dynamic PE ratio

a9b7fcf

add s8s8 code and benchmark case

bcbda2b

add s3 weight ref

7e24a80

luoyu-intel and others added 10 commits April 30, 2024 11:20

clang-format

b1557e8

compiled with dpcpp

842babe

optimize on 1185g7

17590eb

add avx2_vnni for int3&int2

177488c

clang-format

3096862

fix code errors

a2de1e7

compile with gcc9

a861d4e

revert rope parallel

7dc4dd8

refactor quantization data process in python

23f3b0a

[pre-commit.ci] auto fixes from pre-commit.com hooks

7bb4912

for more information, see https://pre-commit.ci

luoyu-intel requested review from zhewang1-intc, zhenwei-intel, yuchengliu1, zhentaoyu and airMeng April 30, 2024 07:24

add mtile for epilogue in gemv

ea84cdb

luoyu-intel added BesTLA ready to review Ready to review labels Apr 30, 2024

clang-format

32a116b

Revert "revert rope parallel"

00be66e

This reverts commit 7dc4dd8.

zhewang1-intc reviewed May 6, 2024

View reviewed changes

bestla/CMakeLists.txt Show resolved Hide resolved

bestla/bestla/bestla_prologue_b.h Show resolved Hide resolved

neural_speed/core/layers/bestla_common.hpp Show resolved Hide resolved

zhewang1-intc approved these changes May 6, 2024

View reviewed changes

luoyu-intel merged commit 7d49516 into main May 7, 2024
12 checks passed

luoyu-intel deleted the vector_dot branch May 21, 2024 03:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BesTLA] Refactor quantization-related kernels #209

[BesTLA] Refactor quantization-related kernels #209

luoyu-intel commented Apr 9, 2024 •

edited

Loading

luoyu-intel commented Apr 17, 2024 •

edited

Loading

luoyu-intel commented Apr 17, 2024

luoyu-intel commented Apr 17, 2024 •

edited

Loading

intellinjun commented Apr 30, 2024

ThanatosShinji commented Apr 30, 2024 •

edited

Loading

ThanatosShinji commented Apr 30, 2024 •

edited

Loading

ThanatosShinji commented Apr 30, 2024 •

edited

Loading

zhewang1-intc commented May 6, 2024

luoyu-intel commented May 6, 2024

[BesTLA] Refactor quantization-related kernels #209

[BesTLA] Refactor quantization-related kernels #209

Conversation

luoyu-intel commented Apr 9, 2024 • edited Loading

Type of Change

Highlights

NBits Integer Support Matrix

NBits Float Support Matrix

luoyu-intel commented Apr 17, 2024 • edited Loading

luoyu-intel commented Apr 17, 2024

luoyu-intel commented Apr 17, 2024 • edited Loading

intellinjun commented Apr 30, 2024

ThanatosShinji commented Apr 30, 2024 • edited Loading

ThanatosShinji commented Apr 30, 2024 • edited Loading

ThanatosShinji commented Apr 30, 2024 • edited Loading

zhewang1-intc commented May 6, 2024

luoyu-intel commented May 6, 2024

luoyu-intel commented Apr 9, 2024 •

edited

Loading

luoyu-intel commented Apr 17, 2024 •

edited

Loading

luoyu-intel commented Apr 17, 2024 •

edited

Loading

ThanatosShinji commented Apr 30, 2024 •

edited

Loading

ThanatosShinji commented Apr 30, 2024 •

edited

Loading

ThanatosShinji commented Apr 30, 2024 •

edited

Loading