[Lora]Load tuned multi-lora kernel configs from json files #26319

li2haipeng · 2025-10-06T21:11:02Z

Purpose

Add support to load tuned multi-lora shrink/expand triton kernels configs from json files.
I understand there was a discussion in #13096 that we don't want to save those json files. I was wondering if we can just simply keep the option for users in case they want to tune the kernels themselves. According to our benchmarking, we see 12% to 17% ITL gain on Llama3.3 70B with 16 lora adapters with rank=16:

The PR leverages the exact kernel loading function proposed from #13096.

Edit:
The lora config folder should be passed in by export VLLM_TUNED_CONFIG_FOLDER=/path/to/configs. Without it, the kernel would use default configs.

github-actions · 2025-10-06T21:11:11Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces a mechanism to load tuned Triton kernel configurations from JSON files for LoRA operations, replacing hardcoded values. This is a valuable addition that allows for user-specific tuning and performance improvements. The implementation centers around a new utility function, get_v1_op_configs, which dynamically loads these configurations. My review focuses on the correctness and robustness of this new loading mechanism. I've identified some critical issues, including unreachable code paths, bugs in the logic for the fused operation type, and unsafe dictionary access that could lead to runtime errors. Addressing these points will significantly improve the robustness and maintainability of this new feature.

gemini-code-assist · 2025-10-06T21:14:04Z

vllm/lora/ops/triton_ops/utils.py

+        out_dim: Optional[int] = None,
+        add_inputs: Optional[bool] = None) -> dict[str, Optional[int]]:
+
+    assert op_type in ["shrink", "expand"]


The assertion assert op_type in ["shrink", "expand"] makes the logic for op_type == "fused" unreachable in the rest of the function (e.g., lines 217-218 and 224-239). This appears to be unintentional, as there is code to handle the fused case. If fused is meant to be supported, this assertion should be updated. Otherwise, the dead code should be removed to avoid confusion and potential bugs.

Suggested change

assert op_type in ["shrink", "expand"]

assert op_type in ["shrink", "expand", "fused"]

gemini-code-assist · 2025-10-06T21:14:04Z

vllm/lora/ops/triton_ops/utils.py

+    else:
+        k, n, r = (hidden_size, out_dim, rank)


This else block is intended to handle the fused op_type but is currently unreachable due to the assertion on line 186. It should be an elif to make the logic explicit and correct. Additionally, since out_dim is required for the fused operation, an assertion should be added to ensure it is provided.

Suggested change

else:

k, n, r = (hidden_size, out_dim, rank)

elif op_type == "fused":

assert out_dim is not None, "out_dim must be provided for fused op_type"

k, n, r = (hidden_size, out_dim, rank)

gemini-code-assist · 2025-10-06T21:14:04Z

vllm/lora/ops/triton_ops/utils.py

+        config_data = config_data.get(str(n)) or config_data[min(
+            config_data.keys(), key=lambda x: abs(int(x) - r))]


This line appears to contain a copy-paste error. It attempts to look up a key using str(n) but uses r in the fallback logic. This should be slicing by r for both the direct lookup and the fallback to ensure the correct configuration is selected for the r dimension.

Suggested change

config_data = config_data.get(str(n)) or config_data[min(

config_data.keys(), key=lambda x: abs(int(x) - r))]

config_data = config_data.get(str(r)) or config_data[min(

config_data.keys(), key=lambda x: abs(int(x) - r))]

gemini-code-assist · 2025-10-06T21:14:04Z

vllm/lora/ops/triton_ops/utils.py

+        config_data = config_data.get(str(max_loras)) or config_data[min(
+        config_data.keys(), key=lambda x: abs(int(x) - max_loras))]
+        # slice by num_slices
+        config_data = config_data[str(num_slices)]


This direct dictionary access for num_slices is unsafe and could raise a KeyError if the key doesn't exist in the loaded configuration. This is inconsistent with the other dimension lookups (e.g., for m, k, n), which safely use .get() with a fallback to the nearest key. For robustness, this should be updated to use the same safe access pattern.

Suggested change

config_data = config_data[str(num_slices)]

config_data = config_data.get(str(num_slices)) or config_data[min(

config_data.keys(), key=lambda x: abs(int(x) - num_slices))]

vllm/lora/ops/triton_ops/utils.py

li2haipeng · 2025-10-06T21:15:14Z

@varun-sundar-rabindranath @jeejeelee Could you help review the PR?

jeejeelee · 2025-10-07T00:58:14Z

Will look at this PR ASAP

jeejeelee · 2025-10-08T01:16:47Z

Yes, as you can see, these configs are too large, and considering there are too many various LoRA configurations, we don't want to keep these configurations. However, tuning can indeed improve performance. Our previous idea was:

Tell users how to tune and save the results through documentation
Use the env VLLM_TUNED_CONFIG_FOLDER to allow users to pass in tuned configurations (what this PR does)

@li2haipeng If you agree or have better ideas, feel free to share your feedback. We can work together to move this forward

li2haipeng · 2025-10-08T17:19:20Z

Thanks for the feedback @jeejeelee. Yes, I agree with you. Users usually don't publish their own adapters so it's not necessary for us to keep those configs, an option to allow users to load their own configs is sufficient enough. Regarding the tuning doc, where should we save it? I can add the env in this PR.

jeejeelee · 2025-10-09T12:07:30Z

Thanks for the feedback @jeejeelee. Yes, I agree with you. Users usually don't publish their own adapters so it's not necessary for us to keep those configs, an option to allow users to load their own configs is sufficient enough. Regarding the tuning doc, where should we save it? I can add the env in this PR.

Maybe vllm/vllm/lora/ops/triton_ops/ is a pretty good choice, also cc @varun-sundar-rabindranath

li2haipeng · 2025-10-10T00:53:08Z

@jeejeelee @varun-sundar-rabindranath A simple README is added, let me know if the PR is ready to merge.

jeejeelee · 2025-10-11T03:09:49Z

vllm/lora/ops/triton_ops/README_TUNING.md

@@ -0,0 +1,29 @@
+# Multi-LoRA Tuning


Can we add a description of how to do tuning, so that users can perform tuning by reading this document, we can refer to https://github.com/sgl-project/sglang/blob/main/benchmark/kernels/fused_moe_triton/README.md#tuning-tool

Sorry, I didn't submit my comment. After responding to the above comment, we can consider merging this PR.

vllm/lora/ops/triton_ops/utils.py

jeejeelee · 2025-10-15T02:06:20Z

Aferting Removing the config/ and fixing the pre-commit issues, LGTM

Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>

Signed-off-by: Haipeng Li <li2haipeng@gmail.com>

…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>

…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

Load tuned multi-lora kernel configs from json files

7b8d7fd

li2haipeng requested a review from jeejeelee as a code owner October 6, 2025 21:11

li2haipeng changed the title ~~Load tuned multi-lora kernel configs from json files~~ [mutli-lora]Load tuned multi-lora kernel configs from json files Oct 6, 2025

gemini-code-assist bot reviewed Oct 6, 2025

View reviewed changes

li2haipeng added 2 commits October 6, 2025 21:18

code clean

568c288

clean

16b1d30

li2haipeng changed the title ~~[mutli-lora]Load tuned multi-lora kernel configs from json files~~ [Lora]Load tuned multi-lora kernel configs from json files Oct 6, 2025

li2haipeng and others added 2 commits October 8, 2025 15:35

Merge branch 'vllm-project:main' into main

e970a28

using ENV to pass in config folder path

bc0f7c7

li2haipeng added 2 commits October 10, 2025 00:49

add README

7dc30b1

nit

450ec16

jeejeelee reviewed Oct 14, 2025

View reviewed changes

modify README

1e57c36

jeejeelee reviewed Oct 15, 2025

View reviewed changes

vllm/lora/ops/triton_ops/utils.py Outdated Show resolved Hide resolved

jeejeelee reviewed Oct 15, 2025

View reviewed changes

vllm/lora/ops/triton_ops/utils.py Outdated Show resolved Hide resolved

li2haipeng and others added 7 commits October 14, 2025 21:03

Update vllm/lora/ops/triton_ops/utils.py

ffb7e8f

Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>

Update vllm/lora/ops/triton_ops/utils.py

b4926d4

Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>

remove /configs

aca76b2

Merge branch 'main' into main

4a2b86d

pre-commit fix

25182d4

Signed-off-by: Haipeng Li <li2haipeng@gmail.com>

fix

2a59623

Signed-off-by: Haipeng Li <li2haipeng@gmail.com>

fix

9e4c1ac

ruff format fix

fbc9e10

jeejeelee approved these changes Oct 15, 2025

View reviewed changes

Merge branch 'main' into main

3246494

jeejeelee enabled auto-merge (squash) October 15, 2025 05:39

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 15, 2025

jeejeelee merged commit d4d1a60 into vllm-project:main Oct 15, 2025
51 checks passed

dcmaddix mentioned this pull request Oct 17, 2025

Adding json config loading for fused_moe_lora kernel wcwuwc/vllm#17

Open

	assert op_type in ["shrink", "expand"]
	assert op_type in ["shrink", "expand", "fused"]

		config_data = config_data.get(str(n)) or config_data[min(
		config_data.keys(), key=lambda x: abs(int(x) - r))]

	config_data = config_data[str(num_slices)]
	config_data = config_data.get(str(num_slices)) or config_data[min(
	config_data.keys(), key=lambda x: abs(int(x) - num_slices))]

Uh oh!

[Lora]Load tuned multi-lora kernel configs from json files #26319

[Lora]Load tuned multi-lora kernel configs from json files #26319

Uh oh!

Conversation

li2haipeng commented Oct 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

github-actions bot commented Oct 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

li2haipeng commented Oct 6, 2025

Uh oh!

jeejeelee commented Oct 7, 2025

Uh oh!

jeejeelee commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

li2haipeng commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeejeelee commented Oct 9, 2025

Uh oh!

li2haipeng commented Oct 10, 2025

Uh oh!

jeejeelee Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

jeejeelee Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jeejeelee commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

li2haipeng commented Oct 6, 2025 •

edited by github-actions bot

Loading

jeejeelee commented Oct 8, 2025 •

edited

Loading

li2haipeng commented Oct 8, 2025 •

edited

Loading