-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable qkv concat layer #958
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
IlyasMoutawwakil
added a commit
that referenced
this pull request
Dec 5, 2024
* add page attention implementation remove jit logic Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add support in transformers 4.45 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix congif (#935) * move patch model to init Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * refine class IPEXPagedCache's update method (#945) * refine class IPEXPagedCache's update method Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * replace tensor on xpu to List to avoid memory copy Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * split IPEXPagedCache's update function into `update_for_prefill` and `update_for_decode` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix bug when doing beam search (#954) Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * enable qkv concat layer (#958) * enable qkv * split key value into 2 lists * add xpu cache optimiztion Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * xpu mlp optimization Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * optimize cache ops in xpu, improve for beam search Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable gpt2, falcon has core dump error in PagedAttention.single_quer… (#979) * enable gpt2, falcon has core dump error in PagedAttention.single_query_cached_kv_attention * enable new_decoder_arch falcon * only keep 1 config * rm autocast * fix unit test case, CPU part is OK; Enable Falcon7b for XPU (#992) * fix bug when run IPEXCausalModel forward directly; fix bug when using `save_pretrain` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add LinearGelu Op support for XPU Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix unit test error Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust unit test case Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix bug Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * skip assited decoding unit test for models using paged attention (#998) * skip assited decoding unit test for models using paged attention Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * XPU CI tests get almost all passed Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix ci config (#1010) Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Fix tests versions (#1011) * fix ci config * fix test versions * fix ipex version Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix torch test version (#1012) Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * use python3.9 test (#1013) * use python3.9 test Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * change ipex transformers limited verison in setup (#1015) * change ipex transformers limited verison in setup * fix inc tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * add XPU LinearAddAdd op (#1017) Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix bert and vit patch (#1022) * fix bert and vit patch * fix vit and bert save Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Paged attn (#1024) * fix reorder cache for non-patch models Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * disable torch < 2.3 tests, we won't use torch < 2.4 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix test beam serach Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix cache selection Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * upgrad to transformers4.46 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * change ipex test yaml transformers version to 4.46 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * set device as the same as origin model (#1031) * set device as the same as origin model * fix device Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Simplify IPEXModel (#1032) * simplify forward and save pretrained since no jit support * fix format * rm warmup because no jit mode anymore * simplify forward for causal lm model * fix paged pkv forward * disable use_cache when just run forward --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * nice code (#1035) Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * Paged attn (#1036) * nice code * device type adjustment Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * Enable torch.compile for non-generation tasks in CPU (#1037) * enable compile for non-generation tasks * add no_grad in forward * warmup compiled model * disable compile not ready models * set system level optimize for torch.compile * fix typo * add comments * set torch minimum version for compiling Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Fix ipex upload and update readme. (#1045) * fix readme and push to hub support Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * rm export in tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * test with torch 2.5.* Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Fix tests (#1047) * fix tests * fix typo * add patched tests * change forward to generate * fix tests * fix test model name --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Patch gpt2 block forward for passing input_lens. (#1050) * fix forward without pkv * patch gpt2 block forward * fix typo * revert causal lm tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> Signed-off-by: jiqing-feng <jiqing.feng@intel.com> Co-authored-by: jiqing-feng <jiqing.feng@intel.com> Co-authored-by: kaixuanliu <kaixuan.liu@intel.com> Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Enable QKV concat linear in llama which brings 10% speed-up in CPU