-
Notifications
You must be signed in to change notification settings - Fork 618
[CI]enable chunked prefill by default #4569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
55385aa to
e2e4c4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables chunked prefill by default in the test suite by changing the default value of enable_chunked_prefill in VllmRunner. This is accompanied by un-skipping a prefix caching test, presumably now fixed, and cleaning up a redundant parameter in another test.
My main concern is regarding the handling of chunked prefill for MLA models. The change enables chunked prefill for all models in tests, but there's an unused configuration parameter chunked_prefill_for_mla that suggests this behavior should be conditional for MLA models. This inconsistency could be confusing and lead to unexpected behavior. Please see my detailed comment on this.
Otherwise, the changes look consistent with the goal of enabling chunked prefill in CI.
| tensor_parallel_size: int = 1, | ||
| block_size: int = 16, | ||
| enable_chunked_prefill: bool = False, | ||
| enable_chunked_prefill: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enabling enable_chunked_prefill by default will affect MLA (Multi-Layer Attention) models in tests. The logic in vllm_ascend/worker/model_runner_v1.py's _build_attn_state function will now set the attention state to ChunkedPrefill for all models, including MLA.
However, I found a configuration chunked_prefill_for_mla in vllm_ascend/ascend_config.py which seems intended to control this behavior for MLA models, but it is currently not used. This is confusing and could lead to unexpected behavior for developers trying to configure this feature.
If chunked prefill is now fully supported for MLA models and should be on by default, please consider removing the unused chunked_prefill_for_mla configuration to avoid confusion.
If chunked prefill for MLA is experimental or should be opt-in, the logic in _build_attn_state should be updated to respect this configuration. For example:
# In vllm_ascend/worker/model_runner_v1.py _build_attn_state
...
elif self.scheduler_config.enable_chunked_prefill:
if self.vllm_config.model_config.use_mla and not self.ascend_config.chunked_prefill_for_mla:
attn_state = AscendAttentionState.PrefillCacheHit
else:
attn_state = AscendAttentionState.ChunkedPrefill
...Given this PR changes the default for all tests, clarifying the intended behavior for MLA models is important for maintainability.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
set
enable_chunked_prefillto True for e2e test by default to keep the same behavior with vLLM