-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Bugfix] Fix: Fix multi loras with tp >=2 and LRU cache #20873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix: Fix multi loras with tp >=2 and LRU cache #20873
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @charent, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request delivers a crucial bug fix for vLLM's multi-LoRA inference capabilities, particularly when operating in a distributed environment with tensor parallelism and an LRU cache. The core problem involved incorrect model behavior when switching between or reloading LoRA adapters. The solution involves precise control over LoRA module merging and state management, ensuring that the correct LoRA weights are active and properly handled across parallel workers. A dedicated test case has been introduced to guarantee the stability and correctness of this fix.
Highlights
- Bug Fix: Resolved an issue where vLLM produced unexpected output with greedy search when using multiple LoRAs, tensor parallelism (TP >= 2), and an LRU cache, especially after reloading or requesting a different LoRA.
- LoRA Module Management: Implemented conditional logic in
_create_merged_loras_inplacewithinvllm/lora/models.pyto prevent the premature removal of replaced LoRA modules when tensor parallelism is enabled and specific LRU cache conditions are met. - Dynamic LoRA Reloading: Introduced a mechanism in
vllm/lora/worker_manager.pyto explicitly re-create merged LoRA weights (re_create_merged_loras) when an already loaded LoRA is requested under conditions involving an active LRU cache and tensor parallelism, ensuring correct model state. - New Test Case: Added a new comprehensive test file (
tests/lora/test_multi_loras_with_tp.py) to specifically reproduce and validate the fix for the multi-LoRA, TP, and LRU cache interaction bug, including scenarios for adding, removing, and reloading LoRAs.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug in multi-LoRA serving when using tensor parallelism with an LRU cache. The core of the fix is to conditionally preserve LoRA module weights during packing and to re-pack them when a cached LoRA is activated. The addition of a new test case is excellent for ensuring this specific scenario is covered going forward.
My review comments focus on improving the readability and maintainability of the new code by renaming variables for clarity and simplifying complex conditional statements.
vllm/lora/models.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is helpful but could be more precise. The condition for not removing modules is more complex than just tp >= 2. It would be beneficial to explain why this is needed for better future maintenance.
DO NOT Remove the modules that have been replaced when tp >= 2
and cached lora > 1 and max_loras < max_cpu_loras.
for case that run PackedLoRALayerWeights again17bc12d to
7cff672
Compare
|
Thank you , I will look at this PR ASAP |
|
Sorry for the delay response, could you explain why this bug occurs firstly |
In our production environment, we deployed multiple text-to-sql lora models. Due to the Insufficient GPU memory on a single GPU, so, an example of our case:
vLLM config: tp = 2 and max-cpu-loras = 1 ( SELECT
name, mobile
FROM user_info
WHERE name = 'Alice' and user_type = 1above sql was correct. but, when set vLLM config: tp = 2 and max-cpu-loras = 2 SELECT
name, mobile
FROM user_info
WHERE name_en = 'Alice'this sql was NOT correct, miss the condition but when calling |
|
Very sorry for keeping this PR in the queue for so long. Could you please test https://github.com/jeejeelee/vllm/blob/fix-lora-slice/vllm/lora/layers.py#L682 to verify whether it can fix this bug? |
yes, i replace the new slice_lora_b function to vllm v0.9.2 source code, it passed the test cases and fixed my problem. thank you. i found that is too complex slice logic of layers.py file, so a patch as this pr to solve this bug😂 |
Is it convenient for you to continue finishing this PR in this way? |
sorry? what can i do? close this PR or replace the code to the new slice_lora_b function? |
Signed-off-by: charent <19562666+charent@users.noreply.github.com>
7cff672 to
c20d2f0
Compare
i have updated the code of this PR, added config for auto testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM, please fix the pre-commit failure
Signed-off-by: charent <19562666+charent@users.noreply.github.com>
Signed-off-by: charent <19562666+charent@users.noreply.github.com>
Signed-off-by: charent <19562666+charent@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this fixing
…#20873) Signed-off-by: charent <19562666+charent@users.noreply.github.com>
…#20873) Signed-off-by: charent <19562666+charent@users.noreply.github.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…#20873) Signed-off-by: charent <19562666+charent@users.noreply.github.com> Signed-off-by: Noam Gat <noamgat@gmail.com>
…#20873) Signed-off-by: charent <19562666+charent@users.noreply.github.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…#20873) Signed-off-by: charent <19562666+charent@users.noreply.github.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…#20873) Signed-off-by: charent <19562666+charent@users.noreply.github.com>
…#20873) Signed-off-by: charent <19562666+charent@users.noreply.github.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
This bug only occurs when running vLLM with
tensor_parallel_size >= 2,max-cpu-loras >= 2,serving-lora-modules >= 2, andmax_loras < max-cpu-loras.Bug Reproduction
While serving multiple LoRAs with tensor parallelism (tp) >= 2 and LoRA LRU cache, after reloading or requesting a different LoRA, the next LoRA request will produce unexpected output with greedy search.
For more reproduction details, please see the file:
test_multi_loras_with_tp.py. Before being fixed, vLLM v0.9.2 could not pass the test cases.In our case, multiple LoRA models provide SQL generation services which need to use greedy search. This bug causes the LoRA models to fail to generate correct SQL as they were trained to do.
In addition, serving multiple LoRAs with SGLang works fine.
Test Plan
Before fix
After fix
(Optional) Documentation Update