Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Appending-Runtime-LoRA-weights #656

Open
3 tasks done
royallavanya140 opened this issue Oct 16, 2024 · 2 comments
Open
3 tasks done

[BUG] Appending-Runtime-LoRA-weights #656

royallavanya140 opened this issue Oct 16, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@royallavanya140
Copy link

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Pytorch version

2.3.1

Model

mistral-v0.3-instruct

Describe the bug

i hosted llm using fastapi and accept the lora weights from the users but If I receive the weights when the model is bsy in generation. is there any way to edit the weights without disturbing current generation.

image

Reproduction steps

  • host llm with fastapi and the payload is messages and loraspath.
  • give generation call parallelly at a same time with 2 diff lora paths.

Expected behavior

LoRAs cannot be updated while there are jobs in the generator queue

Logs

No response

Additional context

No response

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.
@royallavanya140 royallavanya140 added the bug Something isn't working label Oct 16, 2024
@turboderp
Copy link
Owner

Problem is that you can't batch forward passes with different LoRA settings. Applying a LoRA effectively changes the weights of the model. It's a temporary change via a low-rank overlay, but it's still effectively the same as swapping out the model for a different one. Which makes sense as long as there aren't any requests in the queue, but while requests are processing, I don't know how the framework should interpret that..?

@psych0v0yager
Copy link

psych0v0yager commented Nov 14, 2024

@turboderp, to be clear would it be possible to append LoRA weights if you wait until there are no requests in cue? For the dynamic generator is it as simple as adding generator.set_loras(lora) when the cue is empty, or are there additional considerations in play?

As for multiple Loras in runtime, it is theoretically possible and has been done by this library (https://github.com/S-LoRA/S-LoRA), this is how vllm lets you run multiple at runtime. However, integrating into exllama seems like a massive undertaking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants