-
Notifications
You must be signed in to change notification settings - Fork 38
[BesTLA] Refactor quantization-related kernels #209
Conversation
llama2-7b int4 on 12900K : this PR is ~40% faster for prompt=32, and ~20% faster for prompt=1024 |
For |
270e6d5
to
dbccb9c
Compare
Ready for int3 and int4 weight on AVX2 devices, supported quantization parameters:
fastest int4:
same to llama.cpp q40:
Recommended runtime threads number for hybrid CPUs: P+E, or P*2+E. |
for more information, see https://pre-commit.ci
https://inteltf-jenk.sh.intel.com/job/neural_speed_extension/114/ |
CPU: Intel(R) Core(TM) Ultra 7 155H model: llama2-7b weight_dtype=int2, alg=asym, group_size=128, scale_dtype=bf16, compute_dtype=int8, 27.00 tokens/s:
weight_dtype=int4, alg=sym, group_size=128, scale_dtype=bf16, compute_dtype=int8, 21.72 tokens/s:
|
CPU: 13900 model: llama2-7b weight_dtype=int2, alg=asym, group_size=128, scale_dtype=bf16, compute_dtype=int8, 28.05 tokens/s:
weight_dtype=int4, alg=sym, group_size=128, scale_dtype=bf16, compute_dtype=int8, 19.69 tokens/s:
|
This reverts commit 7dc4dd8.
CPU: AMD Ryzen 7 3700X 8-Core Processor(AVX2 without AVX_VNNI) model: llama2-7b llama.cpp q4_0
this PR group_size=32, weight_dtype=int4, scale_dtype=bf16, compute_dtype=int8, alg=sym
group_size=128
weight_dtype=int2, alg=asym
|
will we add a bestla Relase tag for this pr? i think QBits should adapt with this great refactor. |
Yes, a release tag will be created after this PR. |
Type of Change
This PR will speedup next-token inference on all platforms. It will support asymmetric weights for comp_int8(Before asymmetric will fall back to comp_fp32). Also support low-end devices which does not have AVX_VNNI instructions. 2bit+asymmetric weight will also be supported in this PR.
Plan these tasks to the next PR:
To get the best performance:
GCC>=11
MSVC>=1930(VS2022)
DPCPP>=2024.0
Highlights
NBits Integer Support Matrix
NBits Float Support Matrix