-
Notifications
You must be signed in to change notification settings - Fork 533
[2/N][Refactor] Refactor V1 attention for better extensibility #1995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1995 +/- ##
==========================================
- Coverage 76.31% 76.28% -0.03%
==========================================
Files 116 117 +1
Lines 13238 13278 +40
==========================================
+ Hits 10102 10129 +27
- Misses 3136 3149 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e5eff76 to
8915a39
Compare
|
let's wait this PR merged first. #1979 |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
@wangxiyuan The CI has passed. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: shen-shanshan <467638484@qq.com>
|
@wangxiyuan I have adjusted this PR, and only kept the methods that needed to be overwritten by child class. I will do other things in later PRs. |
|
@wangxiyuan The CI has passed. Does this can be merged? |
… MoE layers (#3) * feat(performance): support `GroupedMatmulSwigluQuant` in `W8A8_DYNAMIC` quantized MoE layers Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(bug): fix bug Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * feat(ops): enable grouped_matmul_swiglu_quant by default Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(test): fix broken test Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(test): temporally skip broken test due to oom Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(test): change bias1 to tensor Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(bug): update group_list handling and weight scale in dynamic methods Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * feat(ops): replace all splited gmm and swiglu Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * feat(quantization): split w4a8 and w8a8 apply Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(test): replace w8a8 function in apply Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * feat(cumsum): add cumsum_group_list function for group list processing Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * [Doc] Add container image save/load FAQ for offline environments (vllm-project#2347) ### What this PR does / why we need it? Add Docker export/import guide for air-gapped environments ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@d16aa3d Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> * [Bugfix] fix the oom when chunkprefill with long context like 64k (vllm-project#2319) The attn mask was declared in the mla.py,we don't need the splitfuse mask when mla chunkprefill, and this mask will cause memory problem when long context like 64k or 128k - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 --------- Signed-off-by: haojiangzheng <justineric096@gmail.com> * [Quickfix] Add the missing `apply_router_weight_on_input` in FusedMoE init (vllm-project#2348) ### What this PR does / why we need it? Add the missing `apply_router_weight_on_input` in FusedMoE init Quick fix on vllm-project#2268 (comment) ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@6807af8 Signed-off-by: MengqingCao <cmq0113@163.com> * [2/N][Refactor] Refactor V1 attention for better extensibility (vllm-project#1995) ### What this PR does / why we need it? Refactor V1 Attention for better extensibility (prepared for torchair attention refactor). **Main changes:** - Move different kinds of foward into their method respectively, e.g., `_forward_prefill_no_cache()`, `_forward_prefill_cache_hit()`, `_forward_decode_only()`, `_forward_v1_style()`. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: shen-shanshan <467638484@qq.com> * [Misc] Remove redundant imported `envs`, using `envs_ascend` instead (vllm-project#2193) ### What this PR does / why we need it? Remove redundant imported `envs`, using `envs_ascend` instead. ```python import vllm.envs as envs_vllm import vllm_ascend.envs as envs_ascend ``` - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@71683ca --------- Signed-off-by: shen-shanshan <467638484@qq.com> * feat(torchair): consider not using gmmswigluquant when torchair enabled Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(dtype): unify `w1_scale` dtype Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> --------- Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: haojiangzheng <justineric096@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: jack <QwertyJack@users.noreply.github.com> Co-authored-by: zhenghaojiang <zhjoneson@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Shanshan Shen <467638484@qq.com>
…project#1995) ### What this PR does / why we need it? Refactor V1 Attention for better extensibility (prepared for torchair attention refactor). **Main changes:** - Move different kinds of foward into their method respectively, e.g., `_forward_prefill_no_cache()`, `_forward_prefill_cache_hit()`, `_forward_decode_only()`, `_forward_v1_style()`. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: shen-shanshan <467638484@qq.com>
…project#1995) ### What this PR does / why we need it? Refactor V1 Attention for better extensibility (prepared for torchair attention refactor). **Main changes:** - Move different kinds of foward into their method respectively, e.g., `_forward_prefill_no_cache()`, `_forward_prefill_cache_hit()`, `_forward_decode_only()`, `_forward_v1_style()`. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: shen-shanshan <467638484@qq.com>
What this PR does / why we need it?
Refactor V1 Attention for better extensibility (prepared for torchair attention refactor).
Main changes:
_forward_prefill_no_cache(),_forward_prefill_cache_hit(),_forward_decode_only(),_forward_v1_style().Does this PR introduce any user-facing change?
No.
How was this patch tested?
unit test:
e2e test:
outputs