Skip to content

Conversation

@yiz-liu
Copy link
Collaborator

@yiz-liu yiz-liu commented Jun 25, 2025

What this PR does / why we need it?

Reset all unused positions in NPUModelRunner to prevent out-of-bounds asserts in the GatherV3 operator.

Currently, in get_splitfuse_attn_mask, the position tensor may contain values that exceed the dimensions of the attention mask, triggering a GatherV3 boundary check failure. These invalid indices originate from stale “dirty” entries left over in position due to padding logic in the ACL graph. Specifically, in _process_reqs, the variable num_input_tokens is always greater than or equal to total_num_scheduled_tokens, so any positions not explicitly cleared from a previous batch will persist and cause this sporadic error.

BTW, in the original vLLM implementation, masks are constructed internally using other args, so these lingering values do not surface. However, on the Ascend platform—where split-fuse attention requires externally supplied masks—these residual indices become critical and lead to this elusive, hard-to-reproduce failure.

The fix is to explicitly reset or zero out all unused entries in the position tensor before passing it to GatherV3, ensuring that every index lies within the valid range of the attention mask.

Closes: #1038

Does this PR introduce any user-facing change?

No

How was this patch tested?

…rror

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
@codecov
Copy link

codecov bot commented Jun 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 27.21%. Comparing base (c30ddb8) to head (0f3fa7b).
⚠️ Report is 550 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1416      +/-   ##
==========================================
- Coverage   27.39%   27.21%   -0.19%     
==========================================
  Files          56       56              
  Lines        6191     6214      +23     
==========================================
- Hits         1696     1691       -5     
- Misses       4495     4523      +28     
Flag Coverage Δ
unittests 27.21% <ø> (-0.19%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Yikun Yikun changed the title [main][Bugfix] Fix GatherV3 bug [Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 Jun 25, 2025
@Yikun Yikun added the ready read for review label Jun 25, 2025
Copy link
Collaborator

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that we have been digg this problem for a long long long time, it would be great if there is a regression test (ut or e2e test) to check and prevent the break again.

Feel free to add test in a separate PR.

@wangxiyuan
Copy link
Collaborator

Let's add the ut later once the test framwork is strong enough.

@wangxiyuan wangxiyuan merged commit 2690697 into vllm-project:main Jun 26, 2025
28 checks passed
@yiz-liu yiz-liu deleted the fix-gatherv3-main branch June 26, 2025 02:34
@yiz-liu
Copy link
Collaborator Author

yiz-liu commented Jun 26, 2025

Considering that we have been digg this problem for a long long long time, it would be great if there is a regression test (ut or e2e test) to check and prevent the break again.

Feel free to add test in a separate PR.

Maybe we should re-evaluate the code that generates the various attention masks to identify any conflicts with the padding logic, and add some tests for masks, as these do not exist in vLLM.

@Yikun Yikun added long-term-test enable long term test for PR ready-for-test start test by label for PR labels Jun 26, 2025
weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request Jun 30, 2025
…rV3 (vllm-project#1416)

### What this PR does / why we need it?
Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds
asserts in the `GatherV3` operator.

Currently, in
[`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124),
the `position` tensor may contain values that exceed the dimensions of
the attention mask, triggering a `GatherV3` boundary check failure.
These invalid indices originate from stale “dirty” entries left over in
`position` due to padding logic in the ACL graph. Specifically, in
[`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989),
the variable `num_input_tokens` is always greater than or equal to
`total_num_scheduled_tokens`, so any positions not explicitly cleared
from a previous batch will persist and cause this sporadic error.

BTW, in the original vLLM implementation, masks are constructed
internally using other args, so these lingering values do not surface.
However, on the Ascend platform—where split-fuse attention requires
externally supplied masks—these residual indices become critical and
lead to this elusive, hard-to-reproduce failure.

The fix is to explicitly reset or zero out all unused entries in the
`position` tensor before passing it to `GatherV3`, ensuring that every
index lies within the valid range of the attention mask.

Closes: vllm-project#1038

### Does this PR introduce _any_ user-facing change?
No


Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
venus-taibai pushed a commit to venus-taibai/vllm-ascend that referenced this pull request Sep 17, 2025
…f-bounds in GatherV3 (vllm-project#1416)

Merge branch wengang/cherry-pick-1416 of git@code.alipay.com:Theta/vllm-ascend.git into dev-v0.9.1.0622
https://code.alipay.com/Theta/vllm-ascend/pull_requests/222

Reviewed-by: 子宏 <tanzhiqiang.tzq@antgroup.com>


* [Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 (vllm-project#1416)
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
…rV3 (vllm-project#1416)

### What this PR does / why we need it?
Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds
asserts in the `GatherV3` operator.

Currently, in
[`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124),
the `position` tensor may contain values that exceed the dimensions of
the attention mask, triggering a `GatherV3` boundary check failure.
These invalid indices originate from stale “dirty” entries left over in
`position` due to padding logic in the ACL graph. Specifically, in
[`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989),
the variable `num_input_tokens` is always greater than or equal to
`total_num_scheduled_tokens`, so any positions not explicitly cleared
from a previous batch will persist and cause this sporadic error.

BTW, in the original vLLM implementation, masks are constructed
internally using other args, so these lingering values do not surface.
However, on the Ascend platform—where split-fuse attention requires
externally supplied masks—these residual indices become critical and
lead to this elusive, hard-to-reproduce failure.

The fix is to explicitly reset or zero out all unused entries in the
`position` tensor before passing it to `GatherV3`, ensuring that every
index lies within the valid range of the attention mask.

Closes: vllm-project#1038

### Does this PR introduce _any_ user-facing change?
No


Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…rV3 (vllm-project#1416)

### What this PR does / why we need it?
Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds
asserts in the `GatherV3` operator.

Currently, in
[`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124),
the `position` tensor may contain values that exceed the dimensions of
the attention mask, triggering a `GatherV3` boundary check failure.
These invalid indices originate from stale “dirty” entries left over in
`position` due to padding logic in the ACL graph. Specifically, in
[`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989),
the variable `num_input_tokens` is always greater than or equal to
`total_num_scheduled_tokens`, so any positions not explicitly cleared
from a previous batch will persist and cause this sporadic error.

BTW, in the original vLLM implementation, masks are constructed
internally using other args, so these lingering values do not surface.
However, on the Ascend platform—where split-fuse attention requires
externally supplied masks—these residual indices become critical and
lead to this elusive, hard-to-reproduce failure.

The fix is to explicitly reset or zero out all unused entries in the
`position` tensor before passing it to `GatherV3`, ensuring that every
index lies within the valid range of the attention mask.

Closes: vllm-project#1038

### Does this PR introduce _any_ user-facing change?
No


Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

long-term-test enable long term test for PR ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Failed to complete vllm benchmark after enable VLLM_USE_V1=1 due to gather_v3 error

3 participants