-
Notifications
You must be signed in to change notification settings - Fork 538
vllm-ascend support chunked prefill #1172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: fems14 <1804143737@qq.com>
vllm_ascend/attention/mla_v1.py
Outdated
| max_context_chunk = (self.chunked_prefill_workspace_size // | ||
| num_prefills_with_context_cpu) | ||
| max_context_chunk = round_down( | ||
| max_context_chunk, self.block_size) #待确认是否需要是block_size的倍数 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not use chinese in code
vllm_ascend/ascend_config.py
Outdated
| self.expert_tensor_parallel_size = int( | ||
| additional_config.get("expert_tensor_parallel_size", 0)) | ||
| self.expert_map_path = additional_config.get("expert_map_path", None) | ||
| self.new_chunked = additional_config.get("new_chunked", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to something like chunked_prefill_for_mla. And the doc for additional_config should be updated as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vLLM use chunked prefill to schedule the request by default, maybe we should remove this.
| attn_state = AscendAttentionState.SpecDecoding | ||
| # splitfuse | ||
| elif not ascend_config.ascend_scheduler_config.enabled or self.chunked_prefill_enabled: | ||
| elif self.chunked_prefill_enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of code still have problems, my suggest is remain unchange over this part and remove the flash_attention and vanilla_mla path at mla_backend.
| dtype=query.dtype, | ||
| device=query.device) | ||
| # current requests is chunked in prefill, disable flash attention with chunked prefill | ||
| vanilla_chunked_prefill_mla( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove vanilla_chunked_prefill_mla and flash_attention path, we no longer need them
| scale=self.scale, | ||
| alibi_slopes=None, | ||
| causal=True) | ||
| elif attn_metadata.attn_state in [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need attn_state any more you can just ignore all the attn_state, and just execute ring attention in forward_prefill by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll remove it in the future once pta is upgraded.
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
| @pytest.mark.skipif(os.getenv("VLLM_USE_V1") == "0", | ||
| reason="new chunked only support on v1") | ||
| @pytest.mark.parametrize("model", MODELS) | ||
| @pytest.mark.parametrize("max_tokens", [1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please set max_tokens greater to make this test meaningful
| additional_config={ | ||
| 'ascend_scheduler_config': { | ||
| 'enabled': True | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should enable chunked_prefill_for_mla instead of ascend_scheduler_config ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MengqingCao this is used for compare the result between chunked prefill enabled or not. L52 enabled chunked prefill, and here disable by using ascend scheduler. So the test is fine.
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget to remove chunked_prefill_for_mla once pta is upgraded.
| | `expert_tensor_parallel_size` | str | `0` | Expert tensor parallel size the model to use. | | ||
| | `refresh` | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf case. | | ||
| | `expert_map_path` | str | None | When using expert load balancing for the MOE model, an expert map path needs to be passed in. | | ||
| | `chunked_prefill_for_mla` | bool | `False` | Whether to enable the fused operator-like chunked_prefill. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this config is only used temporary. we'll remove this value once pta is upgraded. And currently set this value to True will lead known error.
| additional_config={ | ||
| 'ascend_scheduler_config': { | ||
| 'enabled': True | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MengqingCao this is used for compare the result between chunked prefill enabled or not. L52 enabled chunked prefill, and here disable by using ascend scheduler. So the test is fine.
| scale=self.scale, | ||
| alibi_slopes=None, | ||
| causal=True) | ||
| elif attn_metadata.attn_state in [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll remove it in the future once pta is upgraded.
|
@ganyi1996ppo @Yikun @jianzs I talked with @fems14 offline. This change need more work to do base on new version of pta. She will update in the future. Currently, this change is backward compatible, so I merged it to unblock the feture. If you have any question, feel free to keep reviewing. The follow-up PR is welcome. Thanks. |
### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA --------- Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA main 关联pr:#1172 --------- <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: fems14 <1804143737@qq.com>
…t#1240) vllm-ascend support chunked prefill for MLA main 关联pr:vllm-project#1172 --------- <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: fems14 <1804143737@qq.com>
…t#1240) ### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA main 关联pr:vllm-project#1172 --------- <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: fems14 <1804143737@qq.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
vllm-ascend support chunked prefill for MLA --------- Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA --------- Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA --------- Signed-off-by: fems14 <1804143737@qq.com>
What this PR does / why we need it?
vllm-ascend support chunked prefill for MLA
Does this PR introduce any user-facing change?
How was this patch tested?