-
Notifications
You must be signed in to change notification settings - Fork 544
[Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 #1416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rror Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1416 +/- ##
==========================================
- Coverage 27.39% 27.21% -0.19%
==========================================
Files 56 56
Lines 6191 6214 +23
==========================================
- Hits 1696 1691 -5
- Misses 4495 4523 +28
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Yikun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that we have been digg this problem for a long long long time, it would be great if there is a regression test (ut or e2e test) to check and prevent the break again.
Feel free to add test in a separate PR.
|
Let's add the ut later once the test framwork is strong enough. |
Maybe we should re-evaluate the code that generates the various attention masks to identify any conflicts with the padding logic, and add some tests for masks, as these do not exist in vLLM. |
…rV3 (vllm-project#1416) ### What this PR does / why we need it? Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds asserts in the `GatherV3` operator. Currently, in [`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124), the `position` tensor may contain values that exceed the dimensions of the attention mask, triggering a `GatherV3` boundary check failure. These invalid indices originate from stale “dirty” entries left over in `position` due to padding logic in the ACL graph. Specifically, in [`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989), the variable `num_input_tokens` is always greater than or equal to `total_num_scheduled_tokens`, so any positions not explicitly cleared from a previous batch will persist and cause this sporadic error. BTW, in the original vLLM implementation, masks are constructed internally using other args, so these lingering values do not surface. However, on the Ascend platform—where split-fuse attention requires externally supplied masks—these residual indices become critical and lead to this elusive, hard-to-reproduce failure. The fix is to explicitly reset or zero out all unused entries in the `position` tensor before passing it to `GatherV3`, ensuring that every index lies within the valid range of the attention mask. Closes: vllm-project#1038 ### Does this PR introduce _any_ user-facing change? No Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…f-bounds in GatherV3 (vllm-project#1416) Merge branch wengang/cherry-pick-1416 of git@code.alipay.com:Theta/vllm-ascend.git into dev-v0.9.1.0622 https://code.alipay.com/Theta/vllm-ascend/pull_requests/222 Reviewed-by: 子宏 <tanzhiqiang.tzq@antgroup.com> * [Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 (vllm-project#1416)
…rV3 (vllm-project#1416) ### What this PR does / why we need it? Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds asserts in the `GatherV3` operator. Currently, in [`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124), the `position` tensor may contain values that exceed the dimensions of the attention mask, triggering a `GatherV3` boundary check failure. These invalid indices originate from stale “dirty” entries left over in `position` due to padding logic in the ACL graph. Specifically, in [`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989), the variable `num_input_tokens` is always greater than or equal to `total_num_scheduled_tokens`, so any positions not explicitly cleared from a previous batch will persist and cause this sporadic error. BTW, in the original vLLM implementation, masks are constructed internally using other args, so these lingering values do not surface. However, on the Ascend platform—where split-fuse attention requires externally supplied masks—these residual indices become critical and lead to this elusive, hard-to-reproduce failure. The fix is to explicitly reset or zero out all unused entries in the `position` tensor before passing it to `GatherV3`, ensuring that every index lies within the valid range of the attention mask. Closes: vllm-project#1038 ### Does this PR introduce _any_ user-facing change? No Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…rV3 (vllm-project#1416) ### What this PR does / why we need it? Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds asserts in the `GatherV3` operator. Currently, in [`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124), the `position` tensor may contain values that exceed the dimensions of the attention mask, triggering a `GatherV3` boundary check failure. These invalid indices originate from stale “dirty” entries left over in `position` due to padding logic in the ACL graph. Specifically, in [`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989), the variable `num_input_tokens` is always greater than or equal to `total_num_scheduled_tokens`, so any positions not explicitly cleared from a previous batch will persist and cause this sporadic error. BTW, in the original vLLM implementation, masks are constructed internally using other args, so these lingering values do not surface. However, on the Ascend platform—where split-fuse attention requires externally supplied masks—these residual indices become critical and lead to this elusive, hard-to-reproduce failure. The fix is to explicitly reset or zero out all unused entries in the `position` tensor before passing it to `GatherV3`, ensuring that every index lies within the valid range of the attention mask. Closes: vllm-project#1038 ### Does this PR introduce _any_ user-facing change? No Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
What this PR does / why we need it?
Reset all unused positions in
NPUModelRunnerto prevent out-of-bounds asserts in theGatherV3operator.Currently, in
get_splitfuse_attn_mask, thepositiontensor may contain values that exceed the dimensions of the attention mask, triggering aGatherV3boundary check failure. These invalid indices originate from stale “dirty” entries left over inpositiondue to padding logic in the ACL graph. Specifically, in_process_reqs, the variablenum_input_tokensis always greater than or equal tototal_num_scheduled_tokens, so any positions not explicitly cleared from a previous batch will persist and cause this sporadic error.BTW, in the original vLLM implementation, masks are constructed internally using other args, so these lingering values do not surface. However, on the Ascend platform—where split-fuse attention requires externally supplied masks—these residual indices become critical and lead to this elusive, hard-to-reproduce failure.
The fix is to explicitly reset or zero out all unused entries in the
positiontensor before passing it toGatherV3, ensuring that every index lies within the valid range of the attention mask.Closes: #1038
Does this PR introduce any user-facing change?
No
How was this patch tested?