Skip to content

Conversation

@fems14
Copy link
Contributor

@fems14 fems14 commented Jun 11, 2025

What this PR does / why we need it?

vllm-ascend support chunked prefill for MLA

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: fems14 <1804143737@qq.com>
@Yikun Yikun added long-term-test enable long term test for PR ready-for-test start test by label for PR labels Jun 12, 2025
max_context_chunk = (self.chunked_prefill_workspace_size //
num_prefills_with_context_cpu)
max_context_chunk = round_down(
max_context_chunk, self.block_size) #待确认是否需要是block_size的倍数
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not use chinese in code

self.expert_tensor_parallel_size = int(
additional_config.get("expert_tensor_parallel_size", 0))
self.expert_map_path = additional_config.get("expert_map_path", None)
self.new_chunked = additional_config.get("new_chunked", False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to something like chunked_prefill_for_mla. And the doc for additional_config should be updated as well

Copy link
Collaborator

@ganyi1996ppo ganyi1996ppo Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vLLM use chunked prefill to schedule the request by default, maybe we should remove this.

attn_state = AscendAttentionState.SpecDecoding
# splitfuse
elif not ascend_config.ascend_scheduler_config.enabled or self.chunked_prefill_enabled:
elif self.chunked_prefill_enabled:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of code still have problems, my suggest is remain unchange over this part and remove the flash_attention and vanilla_mla path at mla_backend.

dtype=query.dtype,
device=query.device)
# current requests is chunked in prefill, disable flash attention with chunked prefill
vanilla_chunked_prefill_mla(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove vanilla_chunked_prefill_mla and flash_attention path, we no longer need them

scale=self.scale,
alibi_slopes=None,
causal=True)
elif attn_metadata.attn_state in [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need attn_state any more you can just ignore all the attn_state, and just execute ring attention in forward_prefill by default.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll remove it in the future once pta is upgraded.

Signed-off-by: fems14 <1804143737@qq.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jun 12, 2025
fems14 added 3 commits June 12, 2025 15:36
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
@pytest.mark.skipif(os.getenv("VLLM_USE_V1") == "0",
reason="new chunked only support on v1")
@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("max_tokens", [1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please set max_tokens greater to make this test meaningful

additional_config={
'ascend_scheduler_config': {
'enabled': True
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should enable chunked_prefill_for_mla instead of ascend_scheduler_config ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MengqingCao this is used for compare the result between chunked prefill enabled or not. L52 enabled chunked prefill, and here disable by using ascend scheduler. So the test is fine.

fems14 added 6 commits June 12, 2025 17:04
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: fems14 <1804143737@qq.com>
@wangxiyuan wangxiyuan merged commit ab5d110 into vllm-project:main Jun 14, 2025
18 checks passed
Copy link
Collaborator

@wangxiyuan wangxiyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to remove chunked_prefill_for_mla once pta is upgraded.

| `expert_tensor_parallel_size` | str | `0` | Expert tensor parallel size the model to use. |
| `refresh` | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf case. |
| `expert_map_path` | str | None | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
| `chunked_prefill_for_mla` | bool | `False` | Whether to enable the fused operator-like chunked_prefill. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this config is only used temporary. we'll remove this value once pta is upgraded. And currently set this value to True will lead known error.

additional_config={
'ascend_scheduler_config': {
'enabled': True
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MengqingCao this is used for compare the result between chunked prefill enabled or not. L52 enabled chunked prefill, and here disable by using ascend scheduler. So the test is fine.

scale=self.scale,
alibi_slopes=None,
causal=True)
elif attn_metadata.attn_state in [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll remove it in the future once pta is upgraded.

@wangxiyuan
Copy link
Collaborator

wangxiyuan commented Jun 14, 2025

@ganyi1996ppo @Yikun @jianzs I talked with @fems14 offline. This change need more work to do base on new version of pta. She will update in the future. Currently, this change is backward compatible, so I merged it to unblock the feture. If you have any question, feel free to keep reviewing. The follow-up PR is welcome. Thanks.

fems14 added a commit to fems14/vllm-ascend that referenced this pull request Jun 16, 2025
### What this PR does / why we need it?
vllm-ascend support chunked prefill for MLA


---------

Signed-off-by: fems14 <1804143737@qq.com>
ganyi1996ppo pushed a commit that referenced this pull request Jun 16, 2025
### What this PR does / why we need it?
vllm-ascend support chunked prefill for MLA
main 关联pr:#1172

---------

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: fems14 <1804143737@qq.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
…t#1240)

vllm-ascend support chunked prefill for MLA
main 关联pr:vllm-project#1172

---------

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: fems14 <1804143737@qq.com>
yiz-liu pushed a commit to yiz-liu/vllm-ascend that referenced this pull request Jun 17, 2025
…t#1240)

### What this PR does / why we need it?
vllm-ascend support chunked prefill for MLA
main 关联pr:vllm-project#1172

---------

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: fems14 <1804143737@qq.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
shiyuan680 pushed a commit to raindaywhu/vllm-ascend that referenced this pull request Jul 7, 2025
vllm-ascend support chunked prefill for MLA

---------

Signed-off-by: fems14 <1804143737@qq.com>
@Yikun Yikun mentioned this pull request Aug 8, 2025
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
### What this PR does / why we need it?
vllm-ascend support chunked prefill for MLA


---------

Signed-off-by: fems14 <1804143737@qq.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
vllm-ascend support chunked prefill for MLA


---------

Signed-off-by: fems14 <1804143737@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation long-term-test enable long term test for PR module:core module:tests ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants