Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] LoRA Support for Ultravox model #11253

Merged
merged 21 commits into from
Feb 6, 2025

Conversation

thedebugger
Copy link
Contributor

@thedebugger thedebugger commented Dec 17, 2024

This should also work w/ mistral models which also uses LlamaForCasualLM architecture re: here

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@thedebugger
Copy link
Contributor Author

thedebugger commented Dec 17, 2024

Hi folks, I'm working with @petersalas on this. PR is not complete but wanted to start the discussion as I have some open questions and need some help from vLLM community

Copy link

mergify bot commented Dec 31, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @thedebugger.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 31, 2024
@thedebugger thedebugger force-pushed the svij-ultravox-lora-dec-16 branch 2 times, most recently from 771484d to 64a664f Compare December 31, 2024 18:49
@mergify mergify bot removed the needs-rebase label Dec 31, 2024
@thedebugger thedebugger changed the title WIP: Ultravox Support for LoRA Ultravox Support for LoRA Dec 31, 2024
@thedebugger thedebugger marked this pull request as ready for review December 31, 2024 18:52
Signed-off-by: Sumit Vij <sumitvij11+github@gmail.com>
WIP: lora tests

Minor tweaks

Moar fixes

Temp changes

Cleanup

Add more debugging logs and packed modules

Signed-off-by: Sumit Vij <sumitvij11+github@gmail.com>
Remove stale comment

Add llama lora modules

Add llama test case

Add test case and log warning on missing lora modules

Rollback unwanted changes and format fixes

Signed-off-by: Sumit Vij <sumitvij11+github@gmail.com>
@thedebugger thedebugger force-pushed the svij-ultravox-lora-dec-16 branch from 64a664f to 3f5996c Compare January 1, 2025 17:31
@jeejeelee
Copy link
Collaborator

Can you refer to #10022 to minimize the changes?

@thedebugger
Copy link
Contributor Author

Can you refer to #10022 to minimize the changes?

Changes are inline with 10022 except the test case and other minor logging changes. Do you have any concerns with any particular change?

@jeejeelee
Copy link
Collaborator

It looks like there are issues with both the added tests and logs. We should only modify the Ultravox scipt, following the changes made in the #10022

@thedebugger
Copy link
Contributor Author

What is/are the issue(s)? Maybe I miss something but tests are passing

@thedebugger
Copy link
Contributor Author

@jeejeelee lmk what are your concerns please? Happy to address it. Having a test case was super helpful to make sure LoRA works as expected with llama and ultravox

@jeejeelee
Copy link
Collaborator

@thedebugger We should only modify ultravox.py, please revert other changes. After you revert them, we can merge this PR.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Copy link
Collaborator

@jeejeelee jeejeelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the unit tests as they are for now. I'll look into how to train Ultravox LoRA and update in another PR

@thedebugger
Copy link
Contributor Author

thedebugger commented Jan 20, 2025

@jeejeelee I'll fix the test tomorrow. I pulled up the latest code and it is failing for me with Cuda OOM (different than CI failure). I also ran the test again on 208e662, it works. So likely something changed on master that is causing test to fail.

AFAICT, the failure is not related to LoRA, so training ultravox lora shouldn't have any impact. Let me troubleshoot this tomorrow and figure out what is going on before you spend time on this.

@thedebugger
Copy link
Contributor Author

thedebugger commented Jan 21, 2025

I spent more time looking at the failure. The failure happens during init when running ultravox with dummy data and I haven't been able to reproduce it locally (though locally test fails with cuda oom error for llama on latest version). I'll look further tomorrow why CI is seeing device mismatch when running whisper. I checked the other ultravox test, that is working fine. I'll look further into it tomorrow

@jeejeelee
Copy link
Collaborator

I spent more time looking at the failure. The failure happens during init when running ultravox with dummy data and I haven't been able to reproduce it locally (though locally test fails with cuda oom error for llama on latest version). I'll look further tomorrow why CI is seeing device mismatch when running whisper. I checked the other ultravox test, that is working fine. I'll look further into it tomorrow

We can first remove the LoRA test-related code and merge this PR. I'll spend some time later training a LoRA model. What do you think?

auto-merge was automatically disabled January 22, 2025 06:29

Head branch was pushed to by a user without write access

@thedebugger thedebugger force-pushed the svij-ultravox-lora-dec-16 branch from d247036 to 1195ad8 Compare January 22, 2025 07:48
@jeejeelee jeejeelee removed the ready ONLY add when PR is ready to merge/full CI is needed label Jan 22, 2025
@thedebugger thedebugger force-pushed the svij-ultravox-lora-dec-16 branch 2 times, most recently from d5d023b to 09c8388 Compare January 23, 2025 06:44
@thedebugger
Copy link
Contributor Author

thedebugger commented Jan 23, 2025

@jeejeelee can you trigger the lora test and add me in Buildkite org (if you have perms) so that i can trigger lora tests next time? I made few tweaks and want to verify if it works

@jeejeelee
Copy link
Collaborator

I've already started it. I suggest that if it still doesn't pass, we should remove the lora test. There's no need to waste time here.

"""
#Check if set_default_device fixes the CI failure. Other lora tests set
# device "to cuda which might be causing device mismatch in CI
torch.set_default_device("cpu")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's too hacky, don't do it. This significantly increased the CI testing time.

Copy link
Contributor Author

@thedebugger thedebugger Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the time before and after: delta was ~4 mins. Knowing this works, let me check if I can find a better fix for this

Source

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is likely a bug in transformer's whisper. I've opened a PR to fix that.

Workaround isn't ideal but impact is limited IMO given the passing test takes roughly 2-3 mins anyways. Moreover, default device is used only when device is not explicitly passed in functions param which is why we aren't seeing much impact

I also tried few more options like passing device all way through from vllm -> ultravox -> whisper but it is lot more complicated and require more changes in bunch of places. So I think it is okay to merge for now and I can clean up when it is fixed in transformer. Sounds okay?

Copy link
Contributor Author

@thedebugger thedebugger Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeejeelee okay to merge if tests are passing?

Btw, I verified that device bug exists upstream and I'm working on getting that patched up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jeejeelee is set_default_device a blocker? Let me know how do you want to move forward

@thedebugger thedebugger force-pushed the svij-ultravox-lora-dec-16 branch 2 times, most recently from 7f010af to 2b8db17 Compare January 26, 2025 05:19
- Reduce model len and max num seq to reduce memory
- Re-trigger tests

Signed-off-by: Sumit Vij <sumitvij11+github@gmail.com>
@thedebugger thedebugger force-pushed the svij-ultravox-lora-dec-16 branch from 2b8db17 to 1976ee0 Compare January 27, 2025 02:03
@jeejeelee jeejeelee enabled auto-merge (squash) February 6, 2025 01:37
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 6, 2025
@DarkLight1337 DarkLight1337 changed the title LoRA Support for Ultravox model [Model] LoRA Support for Ultravox model Feb 6, 2025
@simon-mo simon-mo merged commit d88506d into vllm-project:main Feb 6, 2025
53 of 57 checks passed
fxmarty-amd pushed a commit to fxmarty-amd/vllm that referenced this pull request Feb 7, 2025
Signed-off-by: Felix Marty <felmarty@amd.com>
AoyuQC pushed a commit to AoyuQC/vllm that referenced this pull request Feb 8, 2025
ShangmingCai pushed a commit to ShangmingCai/vllm that referenced this pull request Feb 10, 2025
panf2333 pushed a commit to yottalabsai/vllm that referenced this pull request Feb 18, 2025
kerthcet pushed a commit to kerthcet/vllm that referenced this pull request Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants