-
Notifications
You must be signed in to change notification settings - Fork 547
[Perf] Delete redundant operations in model_runner and forward_context #3677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the codebase by moving some checks into utility functions and removing redundant operations. The changes in model_runner_v1.py correctly fix a bug where stale data was being used and remove redundant code, improving correctness and clarity. However, the new utility function has_layer_idx in utils.py introduces a critical bug due to a flawed caching mechanism. I've provided a comment with a suggested fix for this issue.
| def has_layer_idx(model_instance: torch.nn.Module) -> bool: | ||
| global _HAS_LAYER_IDX | ||
| if _HAS_LAYER_IDX is None: | ||
| _HAS_LAYER_IDX = model_instance is not None and \ | ||
| hasattr(model_instance, "model") and \ | ||
| hasattr(model_instance.model, "start_layer") | ||
| return _HAS_LAYER_IDX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of has_layer_idx uses a global variable _HAS_LAYER_IDX to cache its result. This caching mechanism is flawed because the function's outcome depends on the model_instance argument, which is not guaranteed to be the same across all calls.
For instance, set_ascend_forward_context can be invoked with model_instance=None (e.g., from kv_connector_no_forward), which would cause _HAS_LAYER_IDX to be cached as False. Any subsequent calls, even with a valid model_instance, would then incorrectly return False, preventing features that rely on layer_idx from being enabled.
Since this check is inexpensive, I recommend removing the caching mechanism to fix this bug. The global variable _HAS_LAYER_IDX should also be removed.
def has_layer_idx(model_instance: torch.nn.Module) -> bool:
return (model_instance is not None and hasattr(model_instance, "model") and
hasattr(model_instance.model, "start_layer"))6547eb7 to
7dc6a33
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
ce910ca to
35b0878
Compare
35b0878 to
804c357
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
804c357 to
453e3ff
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
453e3ff to
fcb2127
Compare
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
…rd_context (#3775) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> cherry pick #3677 Remove redundant operations from `model_runner` and `forward_context`. This optimization can significantly reduce the idle time (bubble) before decoding when running models with small parameter counts (e.g., Qwen/Qwen2.5-0.5B). Testing on 800I A2, bubble is reduced from 3.8ms to 2.8ms : Before <img width="1655" height="696" alt="image" src="https://github.com/user-attachments/assets/d7608e52-2438-46dd-8fc9-391fd6274495" /> After <img width="1607" height="774" alt="image" src="https://github.com/user-attachments/assets/56daf081-2dba-4d2e-99d4-e055187d9806" /> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> No ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>
What this PR does / why we need it?
Remove redundant operations from
model_runnerandforward_context. This optimization can significantly reduce the idle time (bubble) before decoding when running models with small parameter counts (e.g., Qwen/Qwen2.5-0.5B).Testing on 800I A2, bubble is reduced from 3.8ms to 2.8ms :

Before
After

Does this PR introduce any user-facing change?
No
How was this patch tested?