[V1][TPU] Enable Top K #15489

NickLucche · 2025-03-25T18:35:17Z

Enabling the topk optimization that was introduced in #15242.

Currently facing the very issue foreseen by @njhill here #15242 (comment).

ERROR 03-25 18:27:23 [core.py:343]     random_sampled = self.topk_topp_sampler(
ERROR 03-25 18:27:23 [core.py:343]                      ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-25 18:27:23 [core.py:343]   File "/home/nick/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 03-25 18:27:23 [core.py:343]     return self._call_impl(*args, **kwargs)
ERROR 03-25 18:27:23 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-25 18:27:23 [core.py:343]   File "/home/nick/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 03-25 18:27:23 [core.py:343]     return forward_call(*args, **kwargs)
ERROR 03-25 18:27:23 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-25 18:27:23 [core.py:343]   File "/home/nick/vllm/vllm/v1/sample/ops/topk_topp_sampler.py", line 119, in forward_tpu
ERROR 03-25 18:27:23 [core.py:343]     topk_values, topk_indices = torch.topk(logits, k, dim=-1)
ERROR 03-25 18:27:23 [core.py:343]                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-25 18:27:23 [core.py:343] TypeError: topk(): argument 'k' (position 2) must be int, not Tensor

~~Dumping work for reference, will look into it asap.~~

Update:

I've fixed topk only using @njhill work on [V1][Sampler] Faster top-k only implementation #15478 which should land soon on main
Removed the env variable that would fallback to native impl (as dicussed in prev pr)
Addressed some of the cruft leftover from [TPU][V1] Fix Sampler recompilation #15309

For completeness, I've run microbenchmarks and the new impl is slower (but of course correct):

//before
Running 32 elapsed time: 0.0018310546875
Running 32 elapsed time: 0.0017833709716796875
// after
 Running 32 elapsed time: 0.003275632858276367
Running 32 elapsed time: 0.003297090530395508

cc @hyeygit

github-actions · 2025-03-25T18:35:27Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-03-26T03:31:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche · 2025-03-26T11:14:31Z

Let's hold until main is fixed to reduce entropy

mergify · 2025-03-27T05:54:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

hyeygit · 2025-03-27T13:02:55Z

Thank you @NickLucche for this PR and thank you @njhill for fixing the batch case for top-k! In #15242 I only tested with the microbenchmark and test_sampler.py (scalar case only), not realizing that k can be batched. Thank you for the catch and sorry for the miss.

One thing to note is that on TPU torch.topk still involves a full vocab sort (see XLA lowering). The reason using torch.topk is so much faster on TPU is because of avoiding a full-vocab torch.scatter (as used in apply_top_k_top_p) which is extremely slow on TPU.

NickLucche · 2025-03-31T13:48:55Z

The main blocker for this PR is topk recompilation.
I was under the impression the XLA lowering would circumvent that need

import torch_xla.core.xla_model as xm
import torch_xla.debug.metrics as met
import torch

B, V = 3, 64
device = xm.xla_device()

logits = torch.randn(B, V, device=device)

# pre compile
k = 3
top_k_mask = logits.topk(k, dim=1).values
top_k_mask = top_k_mask.cpu()

print(met.short_metrics_report())
met.clear_all()

# Run
for k in [1, V]:
    top_k_mask = logits.topk(k, dim=1).values
    top_k_mask = top_k_mask.cpu()
    print(met.short_metrics_report()) # shows it compiles
    met.clear_all()

Is there something I am missing? @hyeygit @yaochengji

hyeygit · 2025-03-31T14:14:23Z

The main blocker for this PR is topk recompilation. I was under the impression the XLA lowering would circumvent that need

@NickLucche thank you for raising this. I think it's because the return shape of torch.topk is a function of k, so a change in k would cause XLA to recompile.

If this k-induced recompilation is not acceptable (sounds like it isn't), then let me implement a TPU-specific top-k using the same logic as in #15736. Will send out a PR shortly.

NickLucche · 2025-03-31T14:31:08Z

I think it's because the return shape of torch.topk is a function of k, so a change in k would cause XLA to recompile.

Yeah the reason is quite clear to me, I just don't know why I believed this wouldn't trigger recompilation lol

then let me implement a TPU-specific top-k using the same logic as in #15736.

That will do thanks a lot!
Then we have to put this PR on hold until the above gets merged or I cherry pick your work here and co-author.

mergify · 2025-04-02T12:55:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche · 2025-04-02T12:57:54Z

@hyeygit I took the liberty of adding your contrib #15891 to this PR so we can test. Locally everything looks fine, let's wait for the CI tests. Nice work thanks!

hyeygit · 2025-04-03T15:30:27Z

So enabling topk here is costing us a bit of performance.

@NickLucche yes I think I can corroborate with this. I ran benchmark_serving.py with top-k and top-p enabled and observed a 6.8% drop in throughput (5.98 req/s -> 5.57 req/s). This was on TPU VM v6e-1 and running Llama3.1-8B.

alexm-redhat · 2025-04-03T15:41:50Z

Just ran this PR for llama 70B with 8 x v6e TPUs. It achieves 5 reqs/sec instead of the previous 5.1 reqs/sec so the performance penalty I see is negligible.

yaochengji

LGTM! Thanks, Nick!

Signed-off-by: NickLucche <nlucches@redhat.com>

Previously we found that using torch.topk resulted in significant speed up for TPU. Turns out that's not a viable solution because the return shape of torch.topk depends on k, which means an XLA recompilation is triggered everytime k changes. Additionally, we realized that torch.scatter was the main bottleneck for the original top-k impl on TPU. This PR circumvents both problems by using a threshold-based approach to find the top-k set. The algorithm is nearly identical to that of top-p; see vllm-project#15736 for more details. Signed-off-by: Hyesoo Yang <hyeygit@gmail.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

alexm-redhat

LGTM! Just some small comments

tests/v1/tpu/test_sampler.py

Signed-off-by: NickLucche <nlucches@redhat.com>

hyeygit · 2025-04-14T17:14:11Z

@NickLucche small request -- could you incorporate the top-k equivalence test from my PR? https://github.com/vllm-project/vllm/pull/15891/files#diff-09d15417fe42d494c51aaa9635ad51536751cc3c0659a7c4ce3b66bd6900eb1f

vllm/v1/sample/ops/topk_topp_sampler.py

Co-authored-by: Hyesoo Yang <hyeygit@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Hyesoo Yang <hyeygit@gmail.com> Co-authored-by: Hyesoo Yang <hyeygit@gmail.com> Signed-off-by: Yang Wang <elainewy@meta.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Hyesoo Yang <hyeygit@gmail.com> Co-authored-by: Hyesoo Yang <hyeygit@gmail.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Hyesoo Yang <hyeygit@gmail.com> Co-authored-by: Hyesoo Yang <hyeygit@gmail.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Hyesoo Yang <hyeygit@gmail.com> Co-authored-by: Hyesoo Yang <hyeygit@gmail.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

hgt312 · 2025-07-29T02:06:06Z

default top_k pad value should not be zero? 0 trilgget the error in probs_sort.gather(-1, top_k_count)

mergify bot added the v1 label Mar 25, 2025

mergify bot added the needs-rebase label Mar 26, 2025

NickLucche force-pushed the tpu-enable-topk branch from 817c4f1 to a5bf849 Compare March 26, 2025 10:13

mergify bot removed the needs-rebase label Mar 26, 2025

NickLucche marked this pull request as ready for review March 26, 2025 10:38

NickLucche requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners March 26, 2025 10:38

NickLucche marked this pull request as draft March 26, 2025 11:14

mergify bot added tpu Related to Google TPUs needs-rebase labels Mar 27, 2025

NickLucche force-pushed the tpu-enable-topk branch from fac0e65 to 4759b89 Compare March 27, 2025 16:19

mergify bot removed the needs-rebase label Mar 27, 2025

hyeygit mentioned this pull request Mar 30, 2025

[V1][TPU] TPU-optimized top-p implementation (avoids scattering). #15736

Merged

NickLucche force-pushed the tpu-enable-topk branch from 4759b89 to 0b022fb Compare March 31, 2025 13:41

NickLucche marked this pull request as ready for review March 31, 2025 13:43

mergify bot added the needs-rebase label Apr 2, 2025

This was referenced Apr 7, 2025

[V1][TPU] TPU-optimized top-k implementation (2nd try) #15891

Closed

[RFC]: TPU V1 Sampler planning #16268

Closed

NickLucche force-pushed the tpu-enable-topk branch from 6dc18ce to 6a6579c Compare April 10, 2025 08:30

yaochengji approved these changes Apr 10, 2025

View reviewed changes

NickLucche and others added 3 commits April 11, 2025 15:01

enable topk

7baa0e8

Signed-off-by: NickLucche <nlucches@redhat.com>

random k in test + fix rebase

71700b5

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the tpu-enable-topk branch from 0530d80 to 71700b5 Compare April 11, 2025 15:02

alexm-redhat approved these changes Apr 11, 2025

View reviewed changes

tests/v1/tpu/test_sampler.py Outdated Show resolved Hide resolved

simplify test

339650c

Signed-off-by: NickLucche <nlucches@redhat.com>

hyeygit reviewed Apr 14, 2025

View reviewed changes

vllm/v1/sample/ops/topk_topp_sampler.py Outdated Show resolved Hide resolved

topk equivalence tests

ee0cd58

Co-authored-by: Hyesoo Yang <hyeygit@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the tpu-enable-topk branch from 98ac301 to ee0cd58 Compare April 15, 2025 07:40

revert long

e62ca74

Signed-off-by: NickLucche <nlucches@redhat.com>

robertgshaw2-redhat enabled auto-merge (squash) April 17, 2025 16:28

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 17, 2025

mgoin approved these changes Apr 17, 2025

View reviewed changes

robertgshaw2-redhat merged commit eb5819b into vllm-project:main Apr 17, 2025
58 checks passed

NickLucche mentioned this pull request Apr 18, 2025

[TPU][V1] Enable Top-P #16843

Merged

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[V1][TPU] Enable Top K (vllm-project#15489)

f65c5f9

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Hyesoo Yang <hyeygit@gmail.com> Co-authored-by: Hyesoo Yang <hyeygit@gmail.com>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[V1][TPU] Enable Top K (vllm-project#15489)

f1e396f

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Hyesoo Yang <hyeygit@gmail.com> Co-authored-by: Hyesoo Yang <hyeygit@gmail.com>

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

Uh oh!

Uh oh!

[V1][TPU] Enable Top K #15489

[V1][TPU] Enable Top K #15489

Uh oh!

Conversation

NickLucche commented Mar 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

mergify bot commented Mar 26, 2025

Uh oh!

NickLucche commented Mar 26, 2025

Uh oh!

mergify bot commented Mar 27, 2025

Uh oh!

hyeygit commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickLucche commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyeygit commented Mar 31, 2025

Uh oh!

NickLucche commented Mar 31, 2025

Uh oh!

mergify bot commented Apr 2, 2025

Uh oh!

NickLucche commented Apr 2, 2025

Uh oh!

hyeygit commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexm-redhat commented Apr 3, 2025

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

alexm-redhat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hyeygit commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hgt312 commented Jul 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

NickLucche commented Mar 25, 2025 •

edited by github-actions bot

Loading

hyeygit commented Mar 27, 2025 •

edited

Loading

NickLucche commented Mar 31, 2025 •

edited

Loading

hyeygit commented Apr 3, 2025 •

edited

Loading

hyeygit commented Apr 14, 2025 •

edited

Loading