diff --git a/doc/graph/fusion_patterns/fusion_patterns.md b/doc/graph/fusion_patterns/fusion_patterns.md index b39374e1571..b76446667df 100644 --- a/doc/graph/fusion_patterns/fusion_patterns.md +++ b/doc/graph/fusion_patterns/fusion_patterns.md @@ -73,6 +73,7 @@ ReduceProd | ReduceSum] |:--------|:-----------------------------| | Scaled Dot-Product Attention | Refer to @ref dev_guide_graph_sdpa for more details. | | Grouped Query Attention | Refer to @ref dev_guide_graph_gqa for more details. | +| Scaled Dot-Product Attention with Compressed Key/Value | Refer to @ref dev_guide_graph_sdpa_compressed_kv for more details. | | Gated Multi-Layer Perceptron (Gated-MLP) | Refer to @ref dev_guide_graph_gated_mlp for more details. | | Convolution + BiasAdd\f$^?\f$ + BatchNormInference\f$^?\f$ + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks, for example ResNet, ResNext, SSD, etc. | | ConvTranspose + BiasAdd\f$^?\f$ + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Generative Adversarial Networks. | diff --git a/doc/graph/fusion_patterns/images/compressed_sdpa_pattern.png b/doc/graph/fusion_patterns/images/compressed_sdpa_pattern.png new file mode 100644 index 00000000000..b0563e7fb0e Binary files /dev/null and b/doc/graph/fusion_patterns/images/compressed_sdpa_pattern.png differ diff --git a/doc/graph/fusion_patterns/sdpa_with_compressed_kv.md b/doc/graph/fusion_patterns/sdpa_with_compressed_kv.md new file mode 100644 index 00000000000..e7a55ef571c --- /dev/null +++ b/doc/graph/fusion_patterns/sdpa_with_compressed_kv.md @@ -0,0 +1,119 @@ +SDPA with Compressed Key and Value {#dev_guide_graph_sdpa_compressed_kv} +======================================================================== + +## Overview + +int4 and int8 compressions for Key and Value are exploited in fused Scaled +Dot-Product Attention (SDPA)[1] to reduce the memory footprint of generative +inference of LLM, especially when KV cache mechanism is adopted. Specifically, +Key and Value tensors are stored using lower precision data types like int4 and +int8 to reduce memory usage, and are subsequently de-quantized to wider floating +point data types such as f16 and bf16 for computation. + +Note that grouped quantization is required to improve the model accuracy, +especially for int4 data types. In this case, group size is needed as an +attribute for quantization, which indicates the number of elements that share +the same scaling factor and zero-points in each quantization group. + +The notations used in this topic are: + +- N: The mini-batch size. +- H: The head number. +- S: The sequence length. +- D: The size of each head. +- G: The group size. + +## SDPA Pattern + +The SDPA pattern with compressed Key and Value is defined as a directional +acyclic graph (DAG) using oneDNN Graph API. oneDNN extends +[SDPA pattern](@ref dev_guide_graph_sdpa) to support the following three kinds +of compressed SDPA patterns: + +1. SDPA with compressed Key and Value. +2. SDPA with floating-point Key and compressed Value. +3. SDPA with compressed Key and floating-point Value. + +The floating-point data types include f32, f16 and bf16, and the compressed +data type refers to low-precision integral data types, including int4 (u4/s4) +and int8 (u8/s8) data types. + +In oneDNN Graph API, we support quantization through a pattern with quantization +operations such as [DynamicDequantize](@ref dev_guide_op_dynamicdequantize) and +[DynamicQuantize](@ref dev_guide_op_dynamicquantize). The supported pattern is +as follows. The blue nodes are required while the brown nodes are optional. + +![compressed SDPA pattern](images/compressed_sdpa_pattern.png) + +Compared to a typical SDPA pattern, there are a few differences: + +1. Two additional DynamicDequantize operations are applied to the input Key and +Value to convert the integral values to floating-point values. +2. Apart from the Query, Key and Value inputs, the pattern requires additional +quantization information such as scale and zero-points for the dequantization of +Key and Value tensors. Currently, oneDNN only supports grouped quantization +on one dimension; specifically, the shapes of scale and zero-points for Key and +Value de-quantization should be (N, H, S, D/G). +3. Additionally, the `group_shape` attribute of the quantization operations must +be specified as (1, 1, 1, G) for Key and Value dequantization. + +## Data Types + +oneDNN supports the following combinations of data types for Query, Key, Value, +output, scale for Key, zero-points for Key, scale for Value and zero-points for +Value: + +| Query | Key | Scale_K | Zp_K | Value | Scale_V | Zp_V | Output | +|:--------|:--------|:--------|:----------------|:-------|:--------|:----------------|:-------| +| dt_fp | dt_int | dt_fp | u4,s4,u8,s8,s32 | dt_int | dt_fp | u4,s4,u8,s8,s32 | dt_fp | +| dt_fp | dt_int | dt_fp | u4,s4,u8,s8,s32 | dt_fp | N/A | N/A | dt_fp | +| dt_fp | dt_fp | N/A | N/A | dt_int | dt_fp | u4,s4,u8,s8,s32 | dt_fp | + +Notes: +- dt_fp can be: f16, bf16 or f32. +- dt_int can be: u8, s8, u4 or s4. +- zero-point inputs are optional. + +You can specify the data type via the input and output data type fields of +logical tensors for each operation. The definition of the data types and support +status on different CPU and GPU platforms follow the general description in +@ref dev_guide_data_types. + +### Floating-point Math Mode + +You should set the floating-point math mode +(@ref dev_guide_attributes_fpmath_mode) when using SDPA with compressed Key and +Value. Generally, the math mode should align with the data type of the Query, +which indicates the computation data type. Additionally, the second boolean +flag, `apply_to_int`, should be set to true. You can configure these attribute +values using the `set_fpmath_mode` API +(@ref dnnl::graph::graph::set_fpmath_mode) on the graph object. + +## Implementation Limitations + +- oneDNN primitive-based SDPA with compressed Key and Value is implemented as +a reference implementation on both Intel Architecture Processors and Intel +Graphics Products. The reference implementation requires memory to store the +intermediate results of the dot products between Query and Key which takes +\f$O(S^2)\f$ memory. It may lead to Out-of-Memory error when computing long +sequence length inputs on platforms with limited memory. +- The compressed SDPA patterns functionally support all input shapes meeting +the shape requirements of each operation in the graph. +- CPU + - oneDNN does not provide optimized implementation on CPU currently. All + executions will be implemented with the primitive-based reference + computation. +- GPU + - Optimized implementation is available for 4D Q/K/V tensors with the shape + defined as (N, H, S, D) for Query and Value, (N, H, D, S) for Key, + (N, H, D/G, S) for scales and zero-points of Key (if available) and + (N, H, S, D/G) for scales and zero-points of Value (if available). + - Optimized implementation is available for compressed SDPA with `f16` + computation data type on Intel Graphics Products with Intel(R) Xe Matrix + Extensions (Intel(R) XMX) support. + - If int4 zero-points are specified, optimized implementation will be only + available when the group size equals 16. + +## References + +[1] Attention is all you need, https://arxiv.org/abs/1706.03762v7