[V1][Sampler] Faster top-k only implementation #15478

njhill · 2025-03-25T15:43:13Z

When there's top-k in the batch but no top-p.

For 128k vocab, 1024 batch size, 500 ops on A100, where max top k is 10:

Before: 11.571 sec
After: 2.136 sec

Signed-off-by: Nick Hill <nhill@redhat.com>

github-actions · 2025-03-25T15:43:24Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/v1/sample/ops/topk_topp_sampler.py

NickLucche

Tested on TPU this won't work out of the box due to some broadcasting issue.

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill · 2025-03-25T17:06:02Z

@NickLucche that's strange. Which op has that issue?

NickLucche · 2025-03-25T17:26:22Z

Not too surprising, torch xla has more constraining rules on broadcasting.
This is the first error I have encountered

F0325 16:28:32.957930 1304047 debug_macros.h:21] Non-OK-status: status.status()
Status: INVALID_ARGUMENT: Input dimension should be either 1 or equal to the output dimension it is broadcasting into; the 0th operand dimension is 4, the 0th output dimension is 1.
*** Begin stack trace ***
	tsl::CurrentStackTrace[abi:cxx11]()
	xla::Shape const* ConsumeValue<xla::Shape const*>(absl::lts_20230802::StatusOr<xla::Shape const*>&&)
	torch_xla::ShapeHelper::ShapeOfXlaOp(xla::XlaOp)
	torch_xla::InferOutputShape(absl::lts_20230802::Span<xla::Shape const>, std::function<xla::XlaOp (absl::lts_20230802::Span<xla::XlaOp const>)> const&)
	
	
	torch_xla::XlaNode::GetOpShape(std::function<xla::Shape ()> const&) const
	torch_xla::XlaNode::XlaNode(torch::lazy::OpKind, c10::ArrayRef<torch::lazy::Value>, std::function<xla::Shape ()> const&, unsigned long, torch::lazy::hash_t)
	torch_xla::Gather::Gather(torch::lazy::Value const&, long, torch::lazy::Value const&)
	std::shared_ptr<torch::lazy::Node> torch_xla::MakeNode<torch_xla::Gather, torch::lazy::Value, long&, torch::lazy::Value>(torch::lazy::Value&&, long&, torch::lazy::Value&&)
	torch_xla::tensor_methods::gather(c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > const&, long, c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > const&)
	torch_xla::XLANativeFunctions::gather(at::Tensor const&, long, at::Tensor const&, bool)
	
	
	at::_ops::gather::redispatch(c10::DispatchKeySet, at::Tensor const&, long, at::Tensor const&, bool)
	
	
	at::_ops::gather::call(at::Tensor const&, long, at::Tensor const&, bool)

on the .gather op. I expanded k but then ran into another issue.

vllm/v1/sample/ops/topk_topp_sampler.py

Signed-off-by: Nick Hill <nhill@redhat.com>

vllm/v1/sample/ops/topk_topp_sampler.py

WoosukKwon · 2025-03-25T21:38:54Z

vllm/v1/sample/ops/topk_topp_sampler.py

    """
-    if k is None and p is None:
+    if p is None:
+        if k is None:


Do we have a unit test checking the correctness of this?

We should really have blanket coverage for this kind of thing, including different combinations of parameters (i.e. top-k with/without top-p etc.). I'm not sure whether we do though. I will check and add a unit test to compare the two impls.

Signed-off-by: Nick Hill <nhill@redhat.com>

NickLucche

I tested this version again today and it's working on TPU too, nice one @njhill thanks!
I was wondering could we still factor-out this topk opt into its own function so I can call it from TPU side?
We agreed with @WoosukKwon to try and keep things separated, I'd like to keep forward_tpu around.

NickLucche · 2025-03-26T10:15:49Z

Something like a5bf849#diff-6047245d864bf5fd68b5b947b735beca94723bad40d20bfc0803d9b3eea5c1edR121-R136.
Wdyt? Of course I'd wait for this PR to land and then rebase, I've shamelessly just copy-pasted your code there.

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill · 2025-03-26T14:36:17Z

Thanks @NickLucche, I've split into separate function. And @WoosukKwon I've added a correctness test.

WoosukKwon

LGTM! Thanks for addressing my comments.

hyeygit · 2025-03-30T19:19:25Z

@njhill really neat idea to threshold the logits! However I think one corner case where this would break is if there are duplicate elements in the logit that equal the cut off value (i.e. top_k_mask). For example, given an input of [1, 2, 2, 2, 3] and k=3, the current apply_top_k_only would return [-inf, 2, 2, 2, 3] while the correct result should be [-inf, -inf, 2, 2, 3].

In #15736 I use a similar thresholding logic for top-p, but introduced a small random perturbation to break the ties. Maybe the same idea can be used here for top-k as well.

Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: Nick Hill <nhill@redhat.com>

Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

[V1][Sampler] Faster top-k only implementation

bcee0c4

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill requested review from WoosukKwon, alexm-redhat, comaniac, robertgshaw2-redhat and ywang96 as code owners March 25, 2025 15:43

mergify bot added the v1 label Mar 25, 2025

njhill mentioned this pull request Mar 25, 2025

[V1][TPU] Speed up top-k on TPU by using torch.topk #15242

Merged

njhill commented Mar 25, 2025

View reviewed changes

vllm/v1/sample/ops/topk_topp_sampler.py Outdated Show resolved Hide resolved

NickLucche approved these changes Mar 25, 2025

View reviewed changes

Also in-place cumsum for top-p

7156150

Signed-off-by: Nick Hill <nhill@redhat.com>

WoosukKwon reviewed Mar 25, 2025

View reviewed changes

vllm/v1/sample/ops/topk_topp_sampler.py Outdated Show resolved Hide resolved

Add comments

1feffb0

Signed-off-by: Nick Hill <nhill@redhat.com>

WoosukKwon reviewed Mar 25, 2025

View reviewed changes

Add comments about in-place logits updates.

be9e5d7

Signed-off-by: Nick Hill <nhill@redhat.com>

NickLucche requested changes Mar 26, 2025

View reviewed changes

NickLucche mentioned this pull request Mar 26, 2025

[V1][TPU] Enable Top K #15489

Merged

njhill added 2 commits March 26, 2025 07:17

Add test

c09dd00

Signed-off-by: Nick Hill <nhill@redhat.com>

Move to separate function per @NickLucche's request

e47f5b9

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 26, 2025

WoosukKwon approved these changes Mar 26, 2025

View reviewed changes

njhill merged commit 35fad35 into vllm-project:main Mar 26, 2025
39 checks passed

njhill deleted the torch-topk branch March 26, 2025 17:56

hyeygit mentioned this pull request Mar 30, 2025

[V1][TPU] TPU-optimized top-p implementation (avoids scattering). #15736

Merged

NickLucche mentioned this pull request Apr 1, 2025

[Core] Optimize topp/topk calculation in sampler #12156

Closed

Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025

[V1][Sampler] Faster top-k only implementation (vllm-project#15478)

0e57df7

Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[V1][Sampler] Faster top-k only implementation (vllm-project#15478)

c116565

Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[V1][Sampler] Faster top-k only implementation (vllm-project#15478)

2b30424

Signed-off-by: Nick Hill <nhill@redhat.com>

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[V1][Sampler] Faster top-k only implementation (vllm-project#15478)

eaded4b

Signed-off-by: Nick Hill <nhill@redhat.com>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[V1][Sampler] Faster top-k only implementation (vllm-project#15478)

c7eb537

Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Uh oh!

[V1][Sampler] Faster top-k only implementation #15478

[V1][Sampler] Faster top-k only implementation #15478

Uh oh!

Conversation

njhill commented Mar 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Mar 25, 2025

Uh oh!

NickLucche commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

WoosukKwon Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche commented Mar 26, 2025

Uh oh!

njhill commented Mar 26, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hyeygit commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

njhill commented Mar 25, 2025 •

edited by github-actions bot

Loading

hyeygit commented Mar 30, 2025 •

edited

Loading