Skip to content

Conversation

@wangxiyuan
Copy link
Collaborator

@wangxiyuan wangxiyuan commented Nov 29, 2025

set enable_chunked_prefill to True for e2e test by default to keep the same behavior with vLLM

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables chunked prefill by default in the test suite by changing the default value of enable_chunked_prefill in VllmRunner. This is accompanied by un-skipping a prefix caching test, presumably now fixed, and cleaning up a redundant parameter in another test.

My main concern is regarding the handling of chunked prefill for MLA models. The change enables chunked prefill for all models in tests, but there's an unused configuration parameter chunked_prefill_for_mla that suggests this behavior should be conditional for MLA models. This inconsistency could be confusing and lead to unexpected behavior. Please see my detailed comment on this.

Otherwise, the changes look consistent with the goal of enabling chunked prefill in CI.

tensor_parallel_size: int = 1,
block_size: int = 16,
enable_chunked_prefill: bool = False,
enable_chunked_prefill: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Enabling enable_chunked_prefill by default will affect MLA (Multi-Layer Attention) models in tests. The logic in vllm_ascend/worker/model_runner_v1.py's _build_attn_state function will now set the attention state to ChunkedPrefill for all models, including MLA.

However, I found a configuration chunked_prefill_for_mla in vllm_ascend/ascend_config.py which seems intended to control this behavior for MLA models, but it is currently not used. This is confusing and could lead to unexpected behavior for developers trying to configure this feature.

If chunked prefill is now fully supported for MLA models and should be on by default, please consider removing the unused chunked_prefill_for_mla configuration to avoid confusion.

If chunked prefill for MLA is experimental or should be opt-in, the logic in _build_attn_state should be updated to respect this configuration. For example:

# In vllm_ascend/worker/model_runner_v1.py _build_attn_state
...
        elif self.scheduler_config.enable_chunked_prefill:
            if self.vllm_config.model_config.use_mla and not self.ascend_config.chunked_prefill_for_mla:
                 attn_state = AscendAttentionState.PrefillCacheHit
            else:
                 attn_state = AscendAttentionState.ChunkedPrefill
...

Given this PR changes the default for all tests, clarifying the intended behavior for MLA models is important for maintainability.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@MengqingCao MengqingCao added ready read for review ready-for-test start test by label for PR labels Nov 29, 2025
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants