Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

[BesTLA] Refactor quantization-related kernels #209

Merged
merged 111 commits into from
May 7, 2024
Merged

[BesTLA] Refactor quantization-related kernels #209

merged 111 commits into from
May 7, 2024

Conversation

luoyu-intel
Copy link
Contributor

@luoyu-intel luoyu-intel commented Apr 9, 2024

Type of Change

This PR will speedup next-token inference on all platforms. It will support asymmetric weights for comp_int8(Before asymmetric will fall back to comp_fp32). Also support low-end devices which does not have AVX_VNNI instructions. 2bit+asymmetric weight will also be supported in this PR.

  • add avx512 sgemv and igemv kernels
  • add s8s8 kernels for AVX2 devices
  • add avx2 sgemv and igemv kernels
  • auto dispatch problem to gemm or gemv kernel
  • *gemv fusion support
  • add 2bit kernels
  • store low bits instead of high bits ( to enable u8s8 dot product without AVX_VNNI)
  • support asym weight for u8s8 and s8s8
  • refactor int3 quantization with sequential compression
  • add AVX2 version of AVX_VNNI, for both GEMM and GEMV

Plan these tasks to the next PR:

  • support compute_dtype=f16, for GNR
  • Int5, Int6, Int7

To get the best performance:
GCC>=11
MSVC>=1930(VS2022)
DPCPP>=2024.0

Highlights

  1. alg=asym, >40% faster for the next tokens.
  2. AVX2 devices are >40% faster for the next tokens.
  3. AVX2 devices without AVX_VNNI are >80% faster for the next tokens.

NBits Integer Support Matrix

Weight Dtype Compute Dtype ISA
Int4 sym+asym Fp32 AVX2,AVX512F
Int3 sym+asym Fp32 AVX2,AVX512F
Int2 sym+asym Fp32 AVX2,AVX512F
Int4 sym+asym BF16 AVX512_BF16,AMX_BF16
Int3 sym+asym BF16 AVX512_BF16,AMX_BF16
Int2 sym+asym BF16 AVX512_BF16,AMX_BF16
Int4 sym+asym FP16 AVX512_FP16,AMX_FP16
Int3 sym+asym FP16 AVX512_FP16,AMX_FP16
Int2 sym+asym FP16 AVX512_FP16,AMX_FP16
Int4 sym+asym Int8 (u8s8 & s8s8) AVX2,AVX_VNNI,AVX512_VNNI,AMX_INT8
Int3 sym+asym Int8 (u8s8 & s8s8) AVX2,AVX_VNNI,AVX512_VNNI,AMX_INT8
Int2 sym+asym Int8 (u8s8 & s8s8) AVX2,AVX_VNNI,AVX512_VNNI,AMX_INT8

NBits Float Support Matrix

Weight Dtype Compute Dtype ISA
NF4 Fp32 AVX2,AVX512F
FP4 Fp32 AVX2,AVX512F
FP8 Fp32 AVX2,AVX512F
NF4 BF16 AVX512_BF16,AMX_BF16
FP4 BF16 AVX512_BF16,AMX_BF16
FP8 BF16 AVX512_BF16,AMX_BF16
NF4 FP16 AVX512_FP16,AMX_FP16
FP4 FP16 AVX512_FP16,AMX_FP16
FP8 FP16 AVX512_FP16,AMX_FP16

@luoyu-intel
Copy link
Contributor Author

luoyu-intel commented Apr 17, 2024

llama2-7b int4 on 12900K : this PR is ~40% faster for prompt=32, and ~20% faster for prompt=1024
llama2-7b int4 on MTL-155H: this PR is ~50% faster for prompt=32, and ~42% faster for prompt=1024

@luoyu-intel
Copy link
Contributor Author

For group_size=32, weight_dtype=int4, scale_dtype=bf16, compute_dtype=int8, BesTLA GEMV kernels have the same performance as GGML q40 kernels.

@luoyu-intel luoyu-intel force-pushed the vector_dot branch 2 times, most recently from 270e6d5 to dbccb9c Compare April 17, 2024 08:19
@luoyu-intel
Copy link
Contributor Author

luoyu-intel commented Apr 17, 2024

Ready for int3 and int4 weight on AVX2 devices, supported quantization parameters:
int3:

.\quant_llama.exe --model_file llama2-f16.bin --out_file llama2-q3j-g128-int8_bf16.bin --nthread 16 --group_size
 128 --compute_dtype int8 --scale_dtype bf16 --weight_dtype int3

fastest int4:

.\quant_llama.exe --model_file path\llama2-f16.bin --out_file llama2-q4j-g-1-int8_bf16.bin --nthread 16 --group_size -1 --compute_dtype int8 --scale_dtype bf16

same to llama.cpp q40:

.\quant_llama.exe --model_file path\llama2-f16.bin --out_file llama2-q4j-g32-int8_bf16.bin --nthread 16 --group_size 32 --compute_dtype int8 --scale_dtype bf16

Recommended runtime threads number for hybrid CPUs: P+E, or P*2+E.
E.g. 12900K: 16, or 24. 155H 14, or 20.

@luoyu-intel luoyu-intel changed the title [BesTLA] Add matrix-vector kernels [BesTLA] Refactor quantization-related kernels Apr 24, 2024
@intellinjun
Copy link
Contributor

@ThanatosShinji
Copy link
Contributor

ThanatosShinji commented Apr 30, 2024

CPU: Intel(R) Core(TM) Ultra 7 155H model: llama2-7b

weight_dtype=int2, alg=asym, group_size=128, scale_dtype=bf16, compute_dtype=int8, 27.00 tokens/s:

 Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun too!
The only problem was that she had a baby brother who really needed the attention of her mamma and papa all the time every day of course she
model_print_timings:        load time =   274.56 ms
model_print_timings:      sample time =     6.78 ms /    32 runs   (    0.21 ms per token)
model_print_timings: prompt eval time =   273.43 ms /    32 tokens (    8.54 ms per token)
model_print_timings:        eval time =  1147.80 ms /    31 runs   (   37.03 ms per token)
model_print_timings:       total time =  1450.85 ms
========== eval time log of each prediction ==========
prediction   0, time: 273.43ms
prediction   1, time: 36.41ms
prediction   2, time: 36.27ms
prediction   3, time: 36.59ms
prediction   4, time: 36.30ms

weight_dtype=int4, alg=sym, group_size=128, scale_dtype=bf16, compute_dtype=int8, 21.72 tokens/s:

 Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing it!
When I was very young, my parents would take me all over the place for adventures. We went camping, hiking,
model_print_timings:        load time =   312.94 ms
model_print_timings:      sample time =     6.31 ms /    32 runs   (    0.20 ms per token)
model_print_timings: prompt eval time =   309.58 ms /    32 tokens (    9.67 ms per token)
model_print_timings:        eval time =  1427.12 ms /    31 runs   (   46.04 ms per token)
model_print_timings:       total time =  1775.39 ms
========== eval time log of each prediction ==========
prediction   0, time: 309.58ms
prediction   1, time: 45.49ms
prediction   2, time: 46.81ms
prediction   3, time: 45.38ms
prediction   4, time: 46.10ms

@ThanatosShinji
Copy link
Contributor

ThanatosShinji commented Apr 30, 2024

CPU: 13900 model: llama2-7b

weight_dtype=int2, alg=asym, group_size=128, scale_dtype=bf16, compute_dtype=int8, 28.05 tokens/s:

 Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun without any supervision or super strict rules on her waistcoat length hours only allowed to be outdoors between the months of October through December where the
model_print_timings:        load time =   151.77 ms
model_print_timings:      sample time =     6.71 ms /    32 runs   (    0.21 ms per token)
model_print_timings: prompt eval time =   148.32 ms /    32 tokens (    4.64 ms per token)
model_print_timings:        eval time =  1105.23 ms /    31 runs   (   35.65 ms per token)
model_print_timings:       total time =  1284.27 ms
========== eval time log of each prediction ==========
prediction   0, time: 148.32ms
prediction   1, time: 35.17ms
prediction   2, time: 34.63ms
prediction   3, time: 34.49ms
prediction   4, time: 34.40ms

weight_dtype=int4, alg=sym, group_size=128, scale_dtype=bf16, compute_dtype=int8, 19.69 tokens/s:

Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing it.
As she grew into an adult, she decided that the best way to do this was as a travel writer. She loved writing, and she
model_print_timings:        load time =   171.36 ms
model_print_timings:      sample time =     6.66 ms /    32 runs   (    0.21 ms per token)
model_print_timings: prompt eval time =   164.40 ms /    32 tokens (    5.14 ms per token)
model_print_timings:        eval time =  1574.00 ms /    31 runs   (   50.77 ms per token)
model_print_timings:       total time =  1772.23 ms
========== eval time log of each prediction ==========
prediction   0, time: 164.40ms
prediction   1, time: 49.76ms
prediction   2, time: 51.45ms
prediction   3, time: 50.28ms
prediction   4, time: 49.44ms

@ThanatosShinji
Copy link
Contributor

ThanatosShinji commented Apr 30, 2024

CPU: AMD Ryzen 7 3700X 8-Core Processor(AVX2 without AVX_VNNI) model: llama2-7b

llama.cpp q4_0

.\main.exe -m llama-2-7b.Q4_0.gguf -t 16 -n 16 -p "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun"
llama_print_timings:        load time =     895.97 ms
llama_print_timings:      sample time =       0.57 ms /    16 runs   (    0.04 ms per token, 28268.55 tokens per second)
llama_print_timings: prompt eval time =    1612.76 ms /    32 tokens (   50.40 ms per token,    19.84 tokens per second)
llama_print_timings:        eval time =    2791.14 ms /    15 runs   (  186.08 ms per token,     5.37 tokens per second)
llama_print_timings:       total time =    4410.18 ms /    47 tokens

this PR group_size=32, weight_dtype=int4, scale_dtype=bf16, compute_dtype=int8, alg=sym

 .\run_llama.exe -m llama-q4j.bin -t 16 -n 16 -p "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun"
model_print_timings:        load time =   692.79 ms
model_print_timings:      sample time =     5.74 ms /    16 runs   (    0.36 ms per token)
model_print_timings: prompt eval time =   692.38 ms /    32 tokens (   21.64 ms per token)
model_print_timings:        eval time =  2421.93 ms /    15 runs   (  161.46 ms per token)
model_print_timings:       total time =  3128.16 ms

group_size=128

model_print_timings:        load time =   608.46 ms
model_print_timings:      sample time =     6.46 ms /    16 runs   (    0.40 ms per token)
model_print_timings: prompt eval time =   607.75 ms /    32 tokens (   18.99 ms per token)
model_print_timings:        eval time =  2103.54 ms /    15 runs   (  140.24 ms per token)
model_print_timings:       total time =  2725.94 ms

weight_dtype=int2, alg=asym

model_print_timings:        load time =   637.32 ms
model_print_timings:      sample time =     5.67 ms /    16 runs   (    0.35 ms per token)
model_print_timings: prompt eval time =   636.89 ms /    32 tokens (   19.90 ms per token)
model_print_timings:        eval time =  1530.53 ms /    15 runs   (  102.04 ms per token)
model_print_timings:       total time =  2180.79 ms

@zhewang1-intc
Copy link
Contributor

will we add a bestla Relase tag for this pr? i think QBits should adapt with this great refactor.

@luoyu-intel
Copy link
Contributor Author

will we add a bestla Relase tag for this pr? i think QBits should adapt with this great refactor.

Yes, a release tag will be created after this PR.

@luoyu-intel luoyu-intel merged commit 7d49516 into main May 7, 2024
12 checks passed
@luoyu-intel luoyu-intel deleted the vector_dot branch May 21, 2024 03:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants