-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[misc] LoRA - Skip LoRA kernels when not required #15152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
vllm/worker/model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes to this file :
Previously,
lora_mapping = LoRAMapping(
**dict(index_mapping=[0] * batch_size,
prompt_mapping=[0] * batch_size,
is_prefill=False))
self.set_active_loras(set(), lora_mapping)
this was sufficient for V0, to capture cudagraphs with the LoRA kernels. This is because of the is_prefill flag. When this is False, i.e. decode case, V0 always chooses to run the LoRA kernels. Therefore all captured CUDAGraphs record the LoRA kernels.
What changed ?
punica_gpu.py is now updated to do the handle to "no lora" case based on the LoRAMapping::index_mapping and the is_prefill flag is ignored. Due to this change, an index_mapping of [0] * batch_size simply translates to the no_lora_cpu_flag being set to True and the captured graphs don't include the LoRA kernels.
Fix : we explicitly add and remove the LoRA adapters (similar to what we do during profile runs)
Alternative solution/hack : In punica_gpu.py when setting the no_lora_flag_cpu, we could do,
if envs.VLLM_USE_V0:
use_cuda_graphs = not is_prefill
no_lora_flag_cpu = torch.all(token_lora_mapping == -1) and not use_cuda_graphs
but ^ seems hacky and I'd like to avoid checks V0/V1 checks.
I prefer the implemented fix where we add and remove the LoRA adapters explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed explanation, I like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bnellnm @ProExpertProg @youkaichao can you take a look when you find some time please. Thanks 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a good summarization. we leverage the point number 3 to deal with complicated attention operations, and it can be used for lora, too.
but if we have 2 code path for lora and no-lora, would it break cudagraph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cudagraphs we always capture with LoRA. so there is just 1 path in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto about item() alternatives.
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
a536fab to
012555e
Compare
| lora_path='/not/a/real/path') | ||
| self.lora_manager.add_dummy_lora(dummy_lora_request, | ||
| LORA_WARMUP_RANK) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the warmup, we added some dummy LoRAs, then removed them. Perhaps we could continue using those dummy LoRAs and remove them after the capture is complete. I think this would reduce redundant code. See : https://github.com/vllm-project/vllm/blob/main/vllm/worker/worker.py#L324
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GPUModelRunnerBase::_dummy_run is the one adding and removing the dummy LoRAs inside it. This _dummy_run function is used in 2 places,
- GPUModelRunnerBase::profile_run
- Worker::_warm_up_model
Updating _dummy_run to act differently based on the caller seems cumbersome.
But, I agree with your point on redundant code. I have refactored adding and removing of the dummy loras in this commit c05763e Please take a look.
Note : I considered using a context manager, but it looks like we already have a lot of indentation in that code and I didn't want to add another level.
What do you think ?
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
In V0 LoRA, we skipped the LoRA kernel launch when all the scheduled requests target the base model.
PR #14685 removed this optimization.
This PR re-introduces this optimizations such that it works for both V0 and V1.
Previously, i.e. before #14685 , on V0, we had a boolean flag,
no_lora, inpunica_gpu.pythat tracked if any of the input requests needed LoRA. This works well for V0. But the V1 architecture uses the traced torch.compile graphs to execute a forward pass. This tracing, doesn't play well with dynamic control flow. However, the tracing treatstorch.opsfunctions as a black box.This PR, moves the flag
no_loraflag inside thelora_expandandlora_shrinktorch operations and triggers an early exit from the operation.Benchmarks:
server command :
client command :
Numbers:
The
benchmark_serving.pycommand was run 4 times for everynum_promptsvalues. Allmean_ttft_msmeasurements are provided below,<style type="text/css"></style>
Thanks @jeejeelee for flagging this 🙌