Skip to content

Conversation

@linfeng-yuan
Copy link
Collaborator

What this PR does / why we need it?

Same as pull/525, this PR optimizes apply_penalties & topKtopP implementation in both V0/V1 Engine by avoiding using torch.scatter and matrix indexing operations.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This patch was tested with vllm v0.9.0, torch-2.5.1 & torch_npu-2.5.1 (both torch_npu in PyPI and newest internal beta version). At a concurrency of 58 and with post-processing parameters set to "temperature": 0.2, "top_k": 1000, "top_p": 0.92, the average sampling time in each decoding stage was reduced from 90ms to 8ms.

@wangxiyuan wangxiyuan added the ready read for review label Jun 6, 2025
Signed-off-by: linfeng-yuan <1102311262@qq.com>
@linfeng-yuan linfeng-yuan force-pushed the sampler_optimization branch from 83094bb to f7d8c2b Compare June 6, 2025 13:58
from vllm.v1.sample.ops.topk_topp_sampler import TopKTopPSampler, random_sample


class AscendTopKTopPSampler(TopKTopPSampler):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is duplicated with #970

@wangxiyuan wangxiyuan removed the ready read for review label Jun 7, 2025
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants