vllm-ascend support chunked prefill #1172

fems14 · 2025-06-11T10:41:17Z

What this PR does / why we need it?

vllm-ascend support chunked prefill for MLA

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: fems14 <1804143737@qq.com>

wangxiyuan · 2025-06-12T02:52:49Z

vllm_ascend/attention/mla_v1.py

+                max_context_chunk = (self.chunked_prefill_workspace_size //
+                                     num_prefills_with_context_cpu)
+                max_context_chunk = round_down(
+                    max_context_chunk, self.block_size)  #待确认是否需要是block_size的倍数


do not use chinese in code

wangxiyuan · 2025-06-12T02:53:56Z

vllm_ascend/ascend_config.py

        self.expert_tensor_parallel_size = int(
            additional_config.get("expert_tensor_parallel_size", 0))
        self.expert_map_path = additional_config.get("expert_map_path", None)
+        self.new_chunked = additional_config.get("new_chunked", False)


rename to something like chunked_prefill_for_mla. And the doc for additional_config should be updated as well

vLLM use chunked prefill to schedule the request by default, maybe we should remove this.

ganyi1996ppo · 2025-06-12T06:42:42Z

vllm_ascend/worker/model_runner_v1.py

            attn_state = AscendAttentionState.SpecDecoding
        # splitfuse
-        elif not ascend_config.ascend_scheduler_config.enabled or self.chunked_prefill_enabled:
+        elif self.chunked_prefill_enabled:


This part of code still have problems, my suggest is remain unchange over this part and remove the flash_attention and vanilla_mla path at mla_backend.

ganyi1996ppo · 2025-06-12T06:48:17Z

vllm_ascend/attention/mla_v1.py

+                                            dtype=query.dtype,
+                                            device=query.device)
            # current requests is chunked in prefill, disable flash attention with chunked prefill
            vanilla_chunked_prefill_mla(


Please remove vanilla_chunked_prefill_mla and flash_attention path, we no longer need them

ganyi1996ppo · 2025-06-12T06:51:11Z

vllm_ascend/attention/mla_v1.py

                scale=self.scale,
                alibi_slopes=None,
                causal=True)
+        elif attn_metadata.attn_state in [


We don't need attn_state any more you can just ignore all the attn_state, and just execute ring attention in forward_prefill by default.

we'll remove it in the future once pta is upgraded.

Signed-off-by: fems14 <1804143737@qq.com>

vllm_ascend/attention/mla_v1.py

Signed-off-by: fems14 <1804143737@qq.com>

MengqingCao · 2025-06-12T08:47:39Z

tests/singlecard/test_chunked.py

+@pytest.mark.skipif(os.getenv("VLLM_USE_V1") == "0",
+                    reason="new chunked only support on v1")
+@pytest.mark.parametrize("model", MODELS)
+@pytest.mark.parametrize("max_tokens", [1])


please set max_tokens greater to make this test meaningful

MengqingCao · 2025-06-12T08:57:06Z

tests/singlecard/test_chunked.py

+                         additional_config={
+                             'ascend_scheduler_config': {
+                                 'enabled': True
+                             },


I think you should enable chunked_prefill_for_mla instead of ascend_scheduler_config ?

@MengqingCao this is used for compare the result between chunked prefill enabled or not. L52 enabled chunked prefill, and here disable by using ascend scheduler. So the test is fine.

Signed-off-by: fems14 <1804143737@qq.com>

vllm_ascend/worker/model_runner_v1.py

wangxiyuan

don't forget to remove chunked_prefill_for_mla once pta is upgraded.

wangxiyuan · 2025-06-14T14:33:18Z

docs/source/user_guide/additional_config.md

 | `expert_tensor_parallel_size` | str | `0`  | Expert tensor parallel size the model to use.                                                 |
 | `refresh`                     | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf case.     |
 | `expert_map_path`             | str | None | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
+| `chunked_prefill_for_mla` | bool | `False` | Whether to enable the fused operator-like chunked_prefill. |


this config is only used temporary. we'll remove this value once pta is upgraded. And currently set this value to True will lead known error.

wangxiyuan · 2025-06-14T14:34:58Z

tests/singlecard/test_chunked.py

+                         additional_config={
+                             'ascend_scheduler_config': {
+                                 'enabled': True
+                             },


@MengqingCao this is used for compare the result between chunked prefill enabled or not. L52 enabled chunked prefill, and here disable by using ascend scheduler. So the test is fine.

wangxiyuan · 2025-06-14T14:35:22Z

vllm_ascend/attention/mla_v1.py

                scale=self.scale,
                alibi_slopes=None,
                causal=True)
+        elif attn_metadata.attn_state in [


we'll remove it in the future once pta is upgraded.

wangxiyuan · 2025-06-14T14:39:32Z

@ganyi1996ppo @Yikun @jianzs I talked with @fems14 offline. This change need more work to do base on new version of pta. She will update in the future. Currently, this change is backward compatible, so I merged it to unblock the feture. If you have any question, feel free to keep reviewing. The follow-up PR is welcome. Thanks.

### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA --------- Signed-off-by: fems14 <1804143737@qq.com>

### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA main 关联pr:#1172 ---------  ### What this PR does / why we need it?  ### Does this PR introduce _any_ user-facing change?  ### How was this patch tested?  Signed-off-by: fems14 <1804143737@qq.com>

…t#1240) vllm-ascend support chunked prefill for MLA main 关联pr:vllm-project#1172 ---------     Signed-off-by: fems14 <1804143737@qq.com>

…t#1240) ### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA main 关联pr:vllm-project#1172 ---------  ### What this PR does / why we need it?  ### Does this PR introduce _any_ user-facing change?  ### How was this patch tested?  Signed-off-by: fems14 <1804143737@qq.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

vllm-ascend support chunked prefill for MLA --------- Signed-off-by: fems14 <1804143737@qq.com>

### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA --------- Signed-off-by: fems14 <1804143737@qq.com>

vllm-ascend support chunked prefill

a89616d

Signed-off-by: fems14 <1804143737@qq.com>

github-actions bot added module:tests module:core labels Jun 11, 2025

Yikun added long-term-test enable long term test for PR ready-for-test start test by label for PR labels Jun 12, 2025

wangxiyuan reviewed Jun 12, 2025

View reviewed changes

ganyi1996ppo reviewed Jun 12, 2025

View reviewed changes

change review comments

f2b6728

Signed-off-by: fems14 <1804143737@qq.com>

github-actions bot added the documentation Improvements or additions to documentation label Jun 12, 2025

ganyi1996ppo reviewed Jun 12, 2025

View reviewed changes

vllm_ascend/attention/mla_v1.py Show resolved Hide resolved

fems14 added 3 commits June 12, 2025 15:36

change review comments

2845f25

Signed-off-by: fems14 <1804143737@qq.com>

change review comments

c48d51d

Signed-off-by: fems14 <1804143737@qq.com>

change review comments

ef96c63

Signed-off-by: fems14 <1804143737@qq.com>

MengqingCao reviewed Jun 12, 2025

View reviewed changes

fems14 added 6 commits June 12, 2025 17:04

change review comments

5eddf2d

Signed-off-by: fems14 <1804143737@qq.com>

change review comments

6c42a41

Signed-off-by: fems14 <1804143737@qq.com>

change review comments

504ec2e

Signed-off-by: fems14 <1804143737@qq.com>

change review comments

46ea221

Signed-off-by: fems14 <1804143737@qq.com>

vllm-ascend support chunked prefill

b76afd6

Signed-off-by: fems14 <1804143737@qq.com>

vllm-ascend support chunked prefill

26d871d

Signed-off-by: fems14 <1804143737@qq.com>

wangxiyuan reviewed Jun 14, 2025

View reviewed changes

vllm_ascend/worker/model_runner_v1.py Show resolved Hide resolved

wangxiyuan approved these changes Jun 14, 2025

View reviewed changes

wangxiyuan merged commit ab5d110 into vllm-project:main Jun 14, 2025
18 checks passed

wangxiyuan reviewed Jun 14, 2025

View reviewed changes

ttanzhiqiang mentioned this pull request Jun 16, 2025

update chunk prefill torch #679

Closed

fems14 added a commit to fems14/vllm-ascend that referenced this pull request Jun 16, 2025

vllm-ascend support chunked prefill (vllm-project#1172)

cf10bae

### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA --------- Signed-off-by: fems14 <1804143737@qq.com>

fems14 mentioned this pull request Jun 16, 2025

[cherry-pick][0.9.1] vllm-ascend support chunked prefill #1240

Merged

shiyuan680 pushed a commit to raindaywhu/vllm-ascend that referenced this pull request Jul 7, 2025

vllm-ascend support chunked prefill (vllm-project#1172)

7ea3be6

vllm-ascend support chunked prefill for MLA --------- Signed-off-by: fems14 <1804143737@qq.com>

Yikun mentioned this pull request Aug 8, 2025

Refactor e2e CI #2276

Merged

vllm-ascend support chunked prefill #1172

vllm-ascend support chunked prefill #1172

Uh oh!

Conversation

fems14 commented Jun 11, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ganyi1996ppo Jun 12, 2025 •

edited

Loading

wangxiyuan commented Jun 14, 2025 •

edited

Loading