Skip to content

Conversation

@linfeng-yuan
Copy link
Collaborator

@linfeng-yuan linfeng-yuan commented Apr 14, 2025

This PR optimizes apply_penalties & topKtopP implementation in both V0/V1 Engine by avoiding using torch.scatter and matrix indexing operations.

We verified the functionality of this PR using Qwen2.5-72B-Instruct. At a concurrency of 40 and with post-processing parameters set to "temperature": 0.3, "top_k": 100, "top_p": 0.9, "repetition_penalty": 1.01, the average decoding time was reduced from 300ms to 50ms.

@linfeng-yuan linfeng-yuan force-pushed the v0.7.3-dev branch 6 times, most recently from 873e680 to 7f46d64 Compare April 16, 2025 05:41
@wangxiyuan wangxiyuan changed the title Optimize apply_penalties & topKtopP for both V0/V1 Engine [0.7.3] Optimize apply_penalties & topKtopP for both V0/V1 Engine Apr 17, 2025
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
@ganyi1996ppo ganyi1996ppo merged commit 2204e4d into vllm-project:v0.7.3-dev Apr 28, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants