[V1][TPU] TPU-optimized top-k implementation (2nd try) #15891

hyeygit · 2025-04-01T15:35:57Z

Previously we found that using torch.topk resulted in significant
speed up for TPU. Turns out that's not a viable solution because the
return shape of torch.topk depends on k, which means an XLA recompilation
is triggered everytime k changes.

Additionally, we realized that torch.scatter was the main bottleneck for
the original top-k impl on TPU. This PR circumvents both problems by using
a threshold-based approach to find the top-k set. The algorithm is nearly
identical to that of top-p; see #15736 for more details.

Benchmark

Sampling microbenchmark yields similar result as in the top-p PR, with "Running 32 elapsed time" averaging ~5 ms (down from 500 ms pre-optimization).

End-to-end serving benchmark with both top-k and top-p enabled show that on TPU (v6e-1) running Llama3.1-8B, the TPU-optimization here (along with the top-p PR) yields 23X speed up.

Description	Throughput (req/s)
Baseline (NO sampling)	5.98
Baseline w/ non-optimized sampling (forward_native)	0.24
TPU-optimized sampling	5.57

github-actions · 2025-04-01T15:36:05Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

hyeygit · 2025-04-01T15:37:00Z

cc @NickLucche fyi

Previously we found that using torch.topk resulted in significant speed up for TPU. Turns out that's not a viable solution because the return shape of torch.topk depends on k, which means an XLA recompilation is triggered everytime k changes. Additionally, we realized that torch.scatter was the main bottleneck for the original top-k impl on TPU. This PR circumvents both problems by using a threshold-based approach to find the top-k set. The algorithm is nearly identical to that of top-p; see vllm-project#15736 for more details. Signed-off-by: Hyesoo Yang <hyeygit@gmail.com>

Signed-off-by: Hyesoo Yang <hyeygit@gmail.com>

NickLucche · 2025-04-07T07:47:02Z

I think we can get this one in as a commit in #15489. We can probably close this.

hyeygit · 2025-04-07T17:22:03Z

I think we can get this one in as a commit in #15489. We can probably close this.

@NickLucche Sounds good. Could you also patch ae7bc73 to your PR? Thanks!

hyeygit requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners April 1, 2025 15:35

hyeygit marked this pull request as draft April 1, 2025 15:36

mergify bot added ci/build v1 tpu Related to Google TPUs labels Apr 1, 2025

hyeygit mentioned this pull request Apr 1, 2025

[V1][TPU] TPU-optimized top-p implementation (avoids scattering). #15736

Merged

NickLucche mentioned this pull request Apr 2, 2025

[V1][TPU] Enable Top K #15489

Merged

hyeygit force-pushed the tpu_topk branch from 1d12e45 to a9f1987 Compare April 3, 2025 15:34

Update test to perform assertion on CPU.

ae7bc73

Signed-off-by: Hyesoo Yang <hyeygit@gmail.com>

hyeygit marked this pull request as ready for review April 3, 2025 15:38

hyeygit closed this Apr 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1][TPU] TPU-optimized top-k implementation (2nd try) #15891

[V1][TPU] TPU-optimized top-k implementation (2nd try) #15891

Uh oh!

hyeygit commented Apr 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Apr 1, 2025

Uh oh!

hyeygit commented Apr 1, 2025 •

edited

Loading

Uh oh!

NickLucche commented Apr 7, 2025

Uh oh!

hyeygit commented Apr 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[V1][TPU] TPU-optimized top-k implementation (2nd try) #15891

[V1][TPU] TPU-optimized top-k implementation (2nd try) #15891

Uh oh!

Conversation

hyeygit commented Apr 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

github-actions bot commented Apr 1, 2025

Uh oh!

hyeygit commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickLucche commented Apr 7, 2025

Uh oh!

hyeygit commented Apr 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hyeygit commented Apr 1, 2025 •

edited by github-actions bot

Loading

hyeygit commented Apr 1, 2025 •

edited

Loading