[0.7.3] Optimize apply_penalties & topKtopP for both V0/V1 Engine #525

linfeng-yuan · 2025-04-14T22:54:02Z

This PR optimizes apply_penalties & topKtopP implementation in both V0/V1 Engine by avoiding using torch.scatter and matrix indexing operations.

We verified the functionality of this PR using Qwen2.5-72B-Instruct. At a concurrency of 40 and with post-processing parameters set to "temperature": 0.3, "top_k": 100, "top_p": 0.9, "repetition_penalty": 1.01, the average decoding time was reduced from 300ms to 50ms.

vllm_ascend/sample/sampler_v1.py

vllm_ascend/sample/ops/ascend_topk_topp_sampler.py

Signed-off-by: linfeng-yuan <1102311262@qq.com>

linfeng-yuan force-pushed the v0.7.3-dev branch 6 times, most recently from 873e680 to 7f46d64 Compare April 16, 2025 05:41

wangxiyuan changed the title ~~Optimize apply_penalties & topKtopP for both V0/V1 Engine~~ [0.7.3] Optimize apply_penalties & topKtopP for both V0/V1 Engine Apr 17, 2025

ganyi1996ppo reviewed Apr 21, 2025

View reviewed changes

vllm_ascend/sample/sampler_v1.py Outdated Show resolved Hide resolved

ganyi1996ppo reviewed Apr 21, 2025

View reviewed changes

vllm_ascend/sample/ops/ascend_topk_topp_sampler.py Show resolved Hide resolved

wangxiyuan mentioned this pull request Apr 27, 2025

[Release]: vLLM Ascend v0.7.3 release checklist #644

Closed

46 tasks

linfeng-yuan added 2 commits April 28, 2025 13:36

perf(npu): greatly accelerate post-processing on Ascend platform

b55ffca

Signed-off-by: linfeng-yuan <1102311262@qq.com>

refactor: support scenarios where top_p or top_k is None

d377ba3

Signed-off-by: linfeng-yuan <1102311262@qq.com>

linfeng-yuan force-pushed the v0.7.3-dev branch from 7f46d64 to d377ba3 Compare April 28, 2025 05:36

ganyi1996ppo approved these changes Apr 28, 2025

View reviewed changes

wangxiyuan approved these changes Apr 28, 2025

View reviewed changes

ganyi1996ppo merged commit 2204e4d into vllm-project:v0.7.3-dev Apr 28, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[0.7.3] Optimize apply_penalties & topKtopP for both V0/V1 Engine #525

[0.7.3] Optimize apply_penalties & topKtopP for both V0/V1 Engine #525

Uh oh!

linfeng-yuan commented Apr 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[0.7.3] Optimize apply_penalties & topKtopP for both V0/V1 Engine #525

[0.7.3] Optimize apply_penalties & topKtopP for both V0/V1 Engine #525

Uh oh!

Conversation

linfeng-yuan commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

linfeng-yuan commented Apr 14, 2025 •

edited

Loading