-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Lora]Load tuned multi-lora kernel configs from json files #26319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a mechanism to load tuned Triton kernel configurations from JSON files for LoRA operations, replacing hardcoded values. This is a valuable addition that allows for user-specific tuning and performance improvements. The implementation centers around a new utility function, get_v1_op_configs, which dynamically loads these configurations. My review focuses on the correctness and robustness of this new loading mechanism. I've identified some critical issues, including unreachable code paths, bugs in the logic for the fused operation type, and unsafe dictionary access that could lead to runtime errors. Addressing these points will significantly improve the robustness and maintainability of this new feature.
| out_dim: Optional[int] = None, | ||
| add_inputs: Optional[bool] = None) -> dict[str, Optional[int]]: | ||
|
|
||
| assert op_type in ["shrink", "expand"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion assert op_type in ["shrink", "expand"] makes the logic for op_type == "fused" unreachable in the rest of the function (e.g., lines 217-218 and 224-239). This appears to be unintentional, as there is code to handle the fused case. If fused is meant to be supported, this assertion should be updated. Otherwise, the dead code should be removed to avoid confusion and potential bugs.
| assert op_type in ["shrink", "expand"] | |
| assert op_type in ["shrink", "expand", "fused"] |
vllm/lora/ops/triton_ops/utils.py
Outdated
| else: | ||
| k, n, r = (hidden_size, out_dim, rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else block is intended to handle the fused op_type but is currently unreachable due to the assertion on line 186. It should be an elif to make the logic explicit and correct. Additionally, since out_dim is required for the fused operation, an assertion should be added to ensure it is provided.
| else: | |
| k, n, r = (hidden_size, out_dim, rank) | |
| elif op_type == "fused": | |
| assert out_dim is not None, "out_dim must be provided for fused op_type" | |
| k, n, r = (hidden_size, out_dim, rank) |
vllm/lora/ops/triton_ops/utils.py
Outdated
| config_data = config_data.get(str(n)) or config_data[min( | ||
| config_data.keys(), key=lambda x: abs(int(x) - r))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line appears to contain a copy-paste error. It attempts to look up a key using str(n) but uses r in the fallback logic. This should be slicing by r for both the direct lookup and the fallback to ensure the correct configuration is selected for the r dimension.
| config_data = config_data.get(str(n)) or config_data[min( | |
| config_data.keys(), key=lambda x: abs(int(x) - r))] | |
| config_data = config_data.get(str(r)) or config_data[min( | |
| config_data.keys(), key=lambda x: abs(int(x) - r))] |
vllm/lora/ops/triton_ops/utils.py
Outdated
| config_data = config_data.get(str(max_loras)) or config_data[min( | ||
| config_data.keys(), key=lambda x: abs(int(x) - max_loras))] | ||
| # slice by num_slices | ||
| config_data = config_data[str(num_slices)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This direct dictionary access for num_slices is unsafe and could raise a KeyError if the key doesn't exist in the loaded configuration. This is inconsistent with the other dimension lookups (e.g., for m, k, n), which safely use .get() with a fallback to the nearest key. For robustness, this should be updated to use the same safe access pattern.
| config_data = config_data[str(num_slices)] | |
| config_data = config_data.get(str(num_slices)) or config_data[min( | |
| config_data.keys(), key=lambda x: abs(int(x) - num_slices))] |
|
@varun-sundar-rabindranath @jeejeelee Could you help review the PR? |
|
Will look at this PR ASAP |
|
Yes, as you can see, these configs are too large, and considering there are too many various LoRA configurations, we don't want to keep these configurations. However, tuning can indeed improve performance. Our previous idea was:
@li2haipeng If you agree or have better ideas, feel free to share your feedback. We can work together to move this forward |
|
Thanks for the feedback @jeejeelee. Yes, I agree with you. Users usually don't publish their own adapters so it's not necessary for us to keep those configs, an option to allow users to load their own configs is sufficient enough. Regarding the tuning doc, where should we save it? I can add the env in this PR. |
Maybe |
|
@jeejeelee @varun-sundar-rabindranath A simple README is added, let me know if the PR is ready to merge. |
| @@ -0,0 +1,29 @@ | |||
| # Multi-LoRA Tuning | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a description of how to do tuning, so that users can perform tuning by reading this document, we can refer to https://github.com/sgl-project/sglang/blob/main/benchmark/kernels/fused_moe_triton/README.md#tuning-tool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't submit my comment. After responding to the above comment, we can consider merging this PR.
|
Aferting Removing the |
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…ect#26319) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Purpose
Add support to load tuned multi-lora shrink/expand triton kernels configs from json files.
I understand there was a discussion in #13096 that we don't want to save those json files. I was wondering if we can just simply keep the option for users in case they want to tune the kernels themselves. According to our benchmarking, we see 12% to 17% ITL gain on Llama3.3 70B with 16 lora adapters with rank=16:
The PR leverages the exact kernel loading function proposed from #13096.
Edit:
The lora config folder should be passed in by
export VLLM_TUNED_CONFIG_FOLDER=/path/to/configs. Without it, the kernel would use default configs.