Skip to content

Conversation

@li2haipeng
Copy link
Contributor

@li2haipeng li2haipeng commented Oct 6, 2025

Purpose

Add support to load tuned multi-lora shrink/expand triton kernels configs from json files.
I understand there was a discussion in #13096 that we don't want to save those json files. I was wondering if we can just simply keep the option for users in case they want to tune the kernels themselves. According to our benchmarking, we see 12% to 17% ITL gain on Llama3.3 70B with 16 lora adapters with rank=16:

Screenshot 2025-10-06 at 14 08 46

The PR leverages the exact kernel loading function proposed from #13096.

Edit:
The lora config folder should be passed in by export VLLM_TUNED_CONFIG_FOLDER=/path/to/configs. Without it, the kernel would use default configs.

@li2haipeng li2haipeng requested a review from jeejeelee as a code owner October 6, 2025 21:11
@github-actions
Copy link

github-actions bot commented Oct 6, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@li2haipeng li2haipeng changed the title Load tuned multi-lora kernel configs from json files [mutli-lora]Load tuned multi-lora kernel configs from json files Oct 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to load tuned Triton kernel configurations from JSON files for LoRA operations, replacing hardcoded values. This is a valuable addition that allows for user-specific tuning and performance improvements. The implementation centers around a new utility function, get_v1_op_configs, which dynamically loads these configurations. My review focuses on the correctness and robustness of this new loading mechanism. I've identified some critical issues, including unreachable code paths, bugs in the logic for the fused operation type, and unsafe dictionary access that could lead to runtime errors. Addressing these points will significantly improve the robustness and maintainability of this new feature.

out_dim: Optional[int] = None,
add_inputs: Optional[bool] = None) -> dict[str, Optional[int]]:

assert op_type in ["shrink", "expand"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The assertion assert op_type in ["shrink", "expand"] makes the logic for op_type == "fused" unreachable in the rest of the function (e.g., lines 217-218 and 224-239). This appears to be unintentional, as there is code to handle the fused case. If fused is meant to be supported, this assertion should be updated. Otherwise, the dead code should be removed to avoid confusion and potential bugs.

Suggested change
assert op_type in ["shrink", "expand"]
assert op_type in ["shrink", "expand", "fused"]

Comment on lines 217 to 218
else:
k, n, r = (hidden_size, out_dim, rank)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This else block is intended to handle the fused op_type but is currently unreachable due to the assertion on line 186. It should be an elif to make the logic explicit and correct. Additionally, since out_dim is required for the fused operation, an assertion should be added to ensure it is provided.

Suggested change
else:
k, n, r = (hidden_size, out_dim, rank)
elif op_type == "fused":
assert out_dim is not None, "out_dim must be provided for fused op_type"
k, n, r = (hidden_size, out_dim, rank)

Comment on lines 238 to 239
config_data = config_data.get(str(n)) or config_data[min(
config_data.keys(), key=lambda x: abs(int(x) - r))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This line appears to contain a copy-paste error. It attempts to look up a key using str(n) but uses r in the fallback logic. This should be slicing by r for both the direct lookup and the fallback to ensure the correct configuration is selected for the r dimension.

Suggested change
config_data = config_data.get(str(n)) or config_data[min(
config_data.keys(), key=lambda x: abs(int(x) - r))]
config_data = config_data.get(str(r)) or config_data[min(
config_data.keys(), key=lambda x: abs(int(x) - r))]

config_data = config_data.get(str(max_loras)) or config_data[min(
config_data.keys(), key=lambda x: abs(int(x) - max_loras))]
# slice by num_slices
config_data = config_data[str(num_slices)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This direct dictionary access for num_slices is unsafe and could raise a KeyError if the key doesn't exist in the loaded configuration. This is inconsistent with the other dimension lookups (e.g., for m, k, n), which safely use .get() with a fallback to the nearest key. For robustness, this should be updated to use the same safe access pattern.

Suggested change
config_data = config_data[str(num_slices)]
config_data = config_data.get(str(num_slices)) or config_data[min(
config_data.keys(), key=lambda x: abs(int(x) - num_slices))]

@li2haipeng
Copy link
Contributor Author

@varun-sundar-rabindranath @jeejeelee Could you help review the PR?

@li2haipeng li2haipeng changed the title [mutli-lora]Load tuned multi-lora kernel configs from json files [Lora]Load tuned multi-lora kernel configs from json files Oct 6, 2025
@jeejeelee
Copy link
Collaborator

Will look at this PR ASAP

@jeejeelee
Copy link
Collaborator

jeejeelee commented Oct 8, 2025

Yes, as you can see, these configs are too large, and considering there are too many various LoRA configurations, we don't want to keep these configurations. However, tuning can indeed improve performance. Our previous idea was:

  • Tell users how to tune and save the results through documentation
  • Use the env VLLM_TUNED_CONFIG_FOLDER to allow users to pass in tuned configurations (what this PR does)

@li2haipeng If you agree or have better ideas, feel free to share your feedback. We can work together to move this forward

@li2haipeng
Copy link
Contributor Author

li2haipeng commented Oct 8, 2025

Thanks for the feedback @jeejeelee. Yes, I agree with you. Users usually don't publish their own adapters so it's not necessary for us to keep those configs, an option to allow users to load their own configs is sufficient enough. Regarding the tuning doc, where should we save it? I can add the env in this PR.

@jeejeelee
Copy link
Collaborator

Thanks for the feedback @jeejeelee. Yes, I agree with you. Users usually don't publish their own adapters so it's not necessary for us to keep those configs, an option to allow users to load their own configs is sufficient enough. Regarding the tuning doc, where should we save it? I can add the env in this PR.

Maybe vllm/vllm/lora/ops/triton_ops/ is a pretty good choice, also cc @varun-sundar-rabindranath

@li2haipeng
Copy link
Contributor Author

@jeejeelee @varun-sundar-rabindranath A simple README is added, let me know if the PR is ready to merge.

@@ -0,0 +1,29 @@
# Multi-LoRA Tuning
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a description of how to do tuning, so that users can perform tuning by reading this document, we can refer to https://github.com/sgl-project/sglang/blob/main/benchmark/kernels/fused_moe_triton/README.md#tuning-tool

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't submit my comment. After responding to the above comment, we can consider merging this PR.

@jeejeelee
Copy link
Collaborator

Aferting Removing the config/ and fixing the pre-commit issues, LGTM

li2haipeng and others added 7 commits October 14, 2025 21:03
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
@jeejeelee jeejeelee enabled auto-merge (squash) October 15, 2025 05:39
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 15, 2025
@jeejeelee jeejeelee merged commit d4d1a60 into vllm-project:main Oct 15, 2025
51 checks passed
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
…ect#26319)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
mandy-li pushed a commit to mandy-li/vllm that referenced this pull request Oct 16, 2025
…ect#26319)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 16, 2025
…ect#26319)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 16, 2025
…ect#26319)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…ect#26319)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…ect#26319)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ect#26319)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ect#26319)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…ect#26319)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…ect#26319)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
…ect#26319)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Nov 12, 2025
…ect#26319)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants