[Hardware][TPU] Multi-LoRA implementation for the TPU backend #12623

Akshat-Tripathi · 2025-01-31T17:43:24Z

This PR adds a Multi-LoRA implementation that works on the TPU backend, extending the work done in #11100.

Currently this uses pytorch operations for the Punica kernels, but I am going to put up a PR with Pallas kernels soon.

github-actions · 2025-01-31T17:43:37Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

liangfu · 2025-01-31T18:19:25Z

vllm/lora/layers.py

+        if current_platform.is_tpu():
+            # Because nan_to_num_ doesn't work with actual -inf values on TPU
+            neg_inf = torch.finfo(lora_logits.dtype).min
+            pos_inf = torch.finfo(lora_logits.dtype).max
+        else:
+            neg_inf = float("-inf")
+            pos_inf = float("inf")


these if-else conditions will make vLLM hard to maintain.

file an issue with torch-xla ? or abstract this as part of an utility function ?

abstract this as part of an utility function sounds good

Yeah that sounds good I can abstract it away, it was only a problem for that nan_to_num() function though, -inf works properly elsewhere.

abstract it away as a short-term solution is fine.

it would better if we can create an issue in torch-xla repo, as a longer-term solution.

Ok, I've made the issue here: pytorch/xla#8674

jeejeelee · 2025-02-01T05:17:56Z

vllm/lora/ops/xla_ops/lora_ops.py

@@ -0,0 +1,58 @@
+import torch
+
+from ..torch_ops import bgmv_expand, bgmv_expand_slice, bgmv_shrink


It seems the TPU ops are still using PyTorch operators, is it necessary to add the below ops?

The sgmv ops are slightly different here because I'm using repeat_interleave with a static size rather than a dynamic tensor, which reduces the compile time quite a bit because torch_xla can't lower the dynamic version properly.

jeejeelee · 2025-02-01T05:31:54Z

vllm/lora/punica_wrapper/punica_tpu.py

+
+# The platforms that are compatible with the PyTorch-native implementation can
+# inherit this class
+class PunicaWrapperTPU(PunicaWrapperBase):


Why not directly inherit from PunicaWrapperCPU ?

I thought about it, but this code is going to change very soon as I add in the Pallas kernels

Akshat-Tripathi · 2025-02-05T10:22:57Z

It looks like the Async Engine, Inputs, Utils, Worker Test is failing on multimodal inputs, which is WIP right now.
The TPU test seems to be failing on non lora code. Do these tests pass on main? I'm wondering if they're linked to this PR or something else

miladm · 2025-02-07T18:43:36Z

cc @lsy323 to take a pass

Akshat-Tripathi · 2025-02-18T10:41:20Z

Switched to draft while I get refactor for the v1 implementation

mergify · 2025-02-18T10:41:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Akshat-Tripathi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Akshat Tripathi <akshat@krai.ai>

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

…ack issue (vllm-project#13425) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

…t#13402)

…project#12960)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

Signed-off-by: youkaichao <youkaichao@gmail.com>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

…ecutor (vllm-project#14053)

…llm-project#13841) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Kuntai Du <kuntai@uchicago.edu>

)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

…kend class (vllm-project#14065) Signed-off-by: Sage Moore <sage@neuralmagic.com>

…-project#14073)

…oject#14081)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

…4063) Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>

Signed-off-by: qux-bbb <1147635419@qq.com>

Signed-off-by: Ce Gao <cegao@tensorchord.ai>

…ailure (vllm-project#14051)

Signed-off-by: Mengqing Cao <cmq0113@163.com>

…ED=1 (vllm-project#13921) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

mergify · 2025-03-03T18:11:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Akshat-Tripathi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Akshat-Tripathi · 2025-03-04T21:18:15Z

Closing in favour of #14238

liangfu reviewed Jan 31, 2025

View reviewed changes

jeejeelee reviewed Feb 1, 2025

View reviewed changes

Akshat-Tripathi requested review from jeejeelee and liangfu February 5, 2025 10:28

miladm requested review from lsy323 and removed request for liangfu February 7, 2025 19:02

Akshat-Tripathi marked this pull request as draft February 18, 2025 10:41

mergify bot added the needs-rebase label Feb 18, 2025

Akshat-Tripathi force-pushed the multi_lora_tpu branch from 03548df to b8a9908 Compare February 18, 2025 12:32

mergify bot added v1 and removed needs-rebase labels Feb 28, 2025

Akshat-Tripathi and others added 13 commits March 3, 2025 18:05

Fixed model compilation bugs

742dad0

Signed-off-by: Akshat Tripathi <akshat@krai.ai>

Minor changes

43efe69

Signed-off-by: Akshat Tripathi <akshat@krai.ai>

[V1] Get input tokens from scheduler (vllm-project#13339)

2eae384

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[V1][PP] Fix intermediate tensor values (vllm-project#13417)

70837d6

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

[V1][Spec decode] Move drafter to model runner (vllm-project#13363)

0bbf7db

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[Bugfix][CI][V1] Work around V1 + CUDA Graph + torch._scaled_mm fallb…

b7b9248

…ack issue (vllm-project#13425) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

[Misc] Remove dangling references to SamplingType.BEAM (vllm-projec…

f551ab5

…t#13402)

[Model] Enable quantization support for transformers backend (vllm-…

d32dd01

…project#12960)

[ROCm] fix get_device_name for rocm (vllm-project#13438)

cfcb3f2

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

[v1] fix parallel config rank (vllm-project#13445)

cabcc6e

Signed-off-by: youkaichao <youkaichao@gmail.com>

[Quant] Molmo SupportsQuant (vllm-project#13336)

79dd067

[Quant] Arctic SupportsQuant (vllm-project#13366)

682099f

[Bugfix] Only print out chat template when supplied (vllm-project#13444)

ec2ec12

jeejeelee and others added 19 commits March 3, 2025 18:07

[Bugfix] Add file lock for ModelScope download (vllm-project#14060)

1a94642

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

[Misc][Kernel]: Add GPTQAllSpark Quantization (vllm-project#12931)

d92cda2

[Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocEx…

39a2024

…ecutor (vllm-project#14053)

[Documentation] Add more deployment guide for Kubernetes deployment (v…

0daae74

…llm-project#13841) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Kuntai Du <kuntai@uchicago.edu>

[Doc] Consolidate whisper and florence2 examples (vllm-project#14050

b469f95

)

[V1][Minor] Do not print attn backend twice (vllm-project#13985)

8aea81e

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBac…

2658c53

…kend class (vllm-project#14065) Signed-off-by: Sage Moore <sage@neuralmagic.com>

[v1][Bugfix] Only cache blocks that are not in the prefix cache (vllm…

b029cc4

…-project#14073)

[v1] Add __repr__ to KVCacheBlock to avoid recursive print (vllm-pr…

a7ca7a6

…oject#14081)

[Model] Add LoRA support for TransformersModel (vllm-project#13770)

615e492

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

[Misc] Accurately capture the time of loading weights (vllm-project#1…

6bab1e2

…4063) Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>

[Doc] Source building add clone step (vllm-project#14086)

657beea

Signed-off-by: qux-bbb <1147635419@qq.com>

[v0][structured output] Support reasoning output (vllm-project#12955)

0aba218

Signed-off-by: Ce Gao <cegao@tensorchord.ai>

Update deprecated Python 3.8 typing (vllm-project#13971)

3610d54

[Bugfix] Explicitly include "omp.h" for MacOS to avoid installation f…

06c6a48

…ailure (vllm-project#14051)

[Misc] duplicate code in deepseek_v2 (vllm-project#14106)

0da18f9

[Misc][Platform] Move use allgather to platform (vllm-project#14010)

50c2cf8

Signed-off-by: Mengqing Cao <cmq0113@163.com>

[Build] Make sure local main branch is synced when VLLM_USE_PRECOMPIL…

9bad4fd

…ED=1 (vllm-project#13921) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

[V1] Refactor parallel sampling support (vllm-project#13774)

c6cc1f5

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Akshat-Tripathi force-pushed the multi_lora_tpu branch from 2758bed to c6cc1f5 Compare March 3, 2025 18:10

mergify bot added documentation Improvements or additions to documentation ci/build frontend structured-output speculative-decoding labels Mar 3, 2025

mergify bot added the needs-rebase label Mar 3, 2025

Akshat-Tripathi mentioned this pull request Mar 4, 2025

[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend #14238

Merged

Akshat-Tripathi closed this Mar 4, 2025

		@@ -0,0 +1,58 @@
		import torch

		from ..torch_ops import bgmv_expand, bgmv_expand_slice, bgmv_shrink

Uh oh!

[Hardware][TPU] Multi-LoRA implementation for the TPU backend #12623

[Hardware][TPU] Multi-LoRA implementation for the TPU backend #12623

Uh oh!

Conversation

Akshat-Tripathi commented Jan 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 31, 2025

Uh oh!

liangfu Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Akshat-Tripathi commented Feb 5, 2025

Uh oh!

miladm commented Feb 7, 2025

Uh oh!

Akshat-Tripathi commented Feb 18, 2025

Uh oh!

mergify bot commented Feb 18, 2025

Uh oh!

mergify bot commented Mar 3, 2025

Uh oh!

Akshat-Tripathi commented Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

108 participants

Akshat-Tripathi commented Jan 31, 2025 •

edited by github-actions bot

Loading

liangfu Jan 31, 2025 •

edited

Loading