[0.9.1][Perf] Use fused ops npu_top_k_top_p #1920
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it?
Use fused ops torch_npu.npu_top_k_top_p(logits, p, k) when p and k are not None, otherwise fallback to the original one. The replacement will take place automatically when VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1 .
This patch are using npu_top_k_top_p which required torch_npu>=2.5.1.post1.dev20250619
This modification is backport from https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/patch/worker/patch_common/patch_sampler.py
PR:#1308
Does this PR introduce any user-facing change?
No
How was this patch tested?
ut & e2e