enable qkv concat layer #958

jiqing-feng · 2024-10-21T09:13:25Z

Enable QKV concat linear in llama which brings 10% speed-up in CPU

* add page attention implementation remove jit logic Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add support in transformers 4.45 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix congif (#935) * move patch model to init Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * refine class IPEXPagedCache's update method (#945) * refine class IPEXPagedCache's update method Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * replace tensor on xpu to List to avoid memory copy Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * split IPEXPagedCache's update function into `update_for_prefill` and `update_for_decode` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix bug when doing beam search (#954) Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * enable qkv concat layer (#958) * enable qkv * split key value into 2 lists * add xpu cache optimiztion Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * xpu mlp optimization Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * optimize cache ops in xpu, improve for beam search Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable gpt2, falcon has core dump error in PagedAttention.single_quer… (#979) * enable gpt2, falcon has core dump error in PagedAttention.single_query_cached_kv_attention * enable new_decoder_arch falcon * only keep 1 config * rm autocast * fix unit test case, CPU part is OK; Enable Falcon7b for XPU (#992) * fix bug when run IPEXCausalModel forward directly; fix bug when using `save_pretrain` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add LinearGelu Op support for XPU Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix unit test error Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust unit test case Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix bug Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * skip assited decoding unit test for models using paged attention (#998) * skip assited decoding unit test for models using paged attention Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * XPU CI tests get almost all passed Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix ci config (#1010) Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Fix tests versions (#1011) * fix ci config * fix test versions * fix ipex version Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix torch test version (#1012) Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * use python3.9 test (#1013) * use python3.9 test Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * change ipex transformers limited verison in setup (#1015) * change ipex transformers limited verison in setup * fix inc tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * add XPU LinearAddAdd op (#1017) Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix bert and vit patch (#1022) * fix bert and vit patch * fix vit and bert save Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Paged attn (#1024) * fix reorder cache for non-patch models Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * disable torch < 2.3 tests, we won't use torch < 2.4 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix test beam serach Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix cache selection Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * upgrad to transformers4.46 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * change ipex test yaml transformers version to 4.46 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * set device as the same as origin model (#1031) * set device as the same as origin model * fix device Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Simplify IPEXModel (#1032) * simplify forward and save pretrained since no jit support * fix format * rm warmup because no jit mode anymore * simplify forward for causal lm model * fix paged pkv forward * disable use_cache when just run forward --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * nice code (#1035) Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * Paged attn (#1036) * nice code * device type adjustment Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * Enable torch.compile for non-generation tasks in CPU (#1037) * enable compile for non-generation tasks * add no_grad in forward * warmup compiled model * disable compile not ready models * set system level optimize for torch.compile * fix typo * add comments * set torch minimum version for compiling Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Fix ipex upload and update readme. (#1045) * fix readme and push to hub support Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * rm export in tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * test with torch 2.5.* Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Fix tests (#1047) * fix tests * fix typo * add patched tests * change forward to generate * fix tests * fix test model name --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Patch gpt2 block forward for passing input_lens. (#1050) * fix forward without pkv * patch gpt2 block forward * fix typo * revert causal lm tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> Signed-off-by: jiqing-feng <jiqing.feng@intel.com> Co-authored-by: jiqing-feng <jiqing.feng@intel.com> Co-authored-by: kaixuanliu <kaixuan.liu@intel.com> Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>

jiqing-feng marked this pull request as ready for review October 21, 2024 09:13

enable qkv

07b5058

sywangyi merged commit 184faea into huggingface:paged_attn Oct 23, 2024
1 check passed

split key value into 2 lists

9f02208

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable qkv concat layer #958

enable qkv concat layer #958

jiqing-feng commented Oct 21, 2024 •

edited

Loading

enable qkv concat layer #958

enable qkv concat layer #958

Conversation

jiqing-feng commented Oct 21, 2024 • edited Loading

jiqing-feng commented Oct 21, 2024 •

edited

Loading