Our kernels are based on x64 template library BesTLA.
Limited by the graph framework, we only add kernels which accept float tensor as input and output tensor.
input dtype | output dtype | compute type | compute ISA |
---|---|---|---|
float32 | float32 | float32 | AVX2 |
float32 | float32 | float32 | AVX512F |
float321 | float322 | int8 | AVX512_VNNI |
float321 | float322 | int8 | AVX512BW |
float321 | float322 | int8 | AVX_VNNI |
float321 | float322 | int8 | AMX_INT8 |
float321 | float322 | int8 | AVX2 |
float32/bf16 | float32/bf16 | bf16 | AMX_BF16 |
float32/fp16 | float32/fp16 | fp16 | AVX512_FP16 |
1: per-batch and per-K group-wise dynamic quantization for input tensor, where per-K group-wise also applies to weight quantization group size of weight tensor; support both symmetric and asymmetric quantization. 2: per-batch dynamic dequantization for output tensor.
dtype | algo | group size |
---|---|---|
int4 | symmetric or asymmetric | multiplier of 8, -11 |
int3 | symmetric or asymmetric | multiplier of 8, -11 |
int2 | symmetric or asymmetric | multiplier of 8, -11 |
int5 | symmetric or asymmetric | multiplier of 8, -11 |
int6 | symmetric or asymmetric | multiplier of 8, -11 |
int7 | symmetric or asymmetric2 | multiplier of 8, -11 |
int1 | symmetric or asymmetric | multiplier of 8, -11 |
int83 | symmetric | multiplier of 8, -11 |
fp4 | multiplier of 8 | |
nf4 | multiplier of 8 |
1: group size=-1 means per channel quantization on output channel (or group size equals to input channel size).
2: int7 + asymmetric may cause numeric overflow if the device only has AVX2 without AVX_VNNI or uses AVX512BW to compute.
3: It may cause numeric overflow if the device only has AVX2 without AVX_VNNI or uses AVX512BW to compute.
NOTE:
- AMX_INT8 requires group size is aligend to 128 (best hardware efficiency)
- int1, int2 and int3 have accuracy loss using RTN quantization.
We can support the hybrid quantization combination. E.g. int4 x int2 mixed quantization.
Each model can have a unique quantization configuration. This configuration can tell the engine what quantization parameter will be applied to each weight. This allows layers can have different quantization
bits, algorithms and group sizes. Referring llama int2&int4 mixed L252
We support three kinds of kernel fusion for transformer models: QKV, MHA (multi-head attention), and FFN (feed-forward network) fusion.
fusion type | models | runtime ISA |
---|---|---|
QKV | GPT-J LLaMA |
AMX_INT8, AVX512_VNNI, AVX512BW, AVX512F, AMX_BF16, AVX_VNNI, AVX2 |
FFN | GPT-J LLaMA BLOOM ChatGLM Falcon MPT |
AMX_INT8, AVX512_VNNI, AVX512BW, AVX512F, AMX_BF16, AVX_VNNI, AVX2 |
MHA |
Referring the fused-attention doc for details |
codename | weight config | runtime ISA |
---|---|---|
Sapphire Rapids Emerald Rapids |
symmetric int4 group size=128 compute type=int8 |
AMX_INT8 |
Ice Lake Cascade Lake Cooper Lake Tiger Lake Rocket Lake |
symmetric int4 group size=128 compute type=int8 |
AVX512_VNNI |
Skylake Cannon Lake |
symmetric int4 group size=128 compute type=int8 |
AVX512BW |
Alder Lake (12th Gen) Raptor Lake (13th and 14th Gen) |
symmetric int4 group size=128 compute type=int8 |
AVX_VNNI |
Older architecture (before 12th Gen) | symmetric int4 group size=128 compute type=int8 |
AVX2 |
sym int4 group=128 comp_dtype=int8
has almost the same accuracy as group=32
, but is much faster (validated with LLaMa2-7B).
sym int5 group=-1 comp_dtype=int8
is the fastest configuration for the first-token with good accuracy (validated with LLaMa2-7B).
sym int3 group=128 comp_dtype=int8
is the fastest configuration for the next-token with good accuracy (validated with LLaMa2-7B).
NOTE:
- group_size=-1 has the smallest model size, and the best performance. But it requires the INC's finetuned model, or it may have lower accuracy than small group sizes.
- group_size=128 is a balance of accuracy and speed if you want RTN quantization only.
- group_size=32, scale_dtype=bf16, compute_dtype=int8, alg=sym equals llama.cpp's Q4_0.