-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster init with LoRA #872
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
I wonder if we can use |
well, I borrowed this idea from GPTQ code. But I support your idea to avoid monkey patching. |
@BenjaminBossan ! Now with Llama-2-7B application of Guanaco LoRA adapter takes 1.5 seconds vs 103 seconds originally. |
@poedator Great, thanks for experimenting. This looks like a very nice speedup indeed. I'm not 100% sure if we can remove all of that code, but let's see. Could you please run |
I added couple more commits to pass device param to LoRA layer constructors to initialize LoRA adapters right at the target device. Apparently I need to do some more local testing ... |
Thanks for applying the fixes and providing more context. Unfortunately, the tests are failing right now. The main reason seems to be that the code path when To proceed, would you be up for fixing the failing tests for the code path I mentioned? Edit: We merged #807, so there is a merge conflict now. But it is easy to fix. Please apply your changes to this file: https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py instead of |
@BenjaminBossan, thank you for your guidance on this PR. I updated the PR to catch up with #807 and also limited my changes to just the Linear layer init to prove the concept. Pls be more specific on the failing tests. do you want me to update the tests code or to change some LoRA code?
On PR scope - let me know if you want me to include code for all of the other layer types. I wanted to do it initially but got confused by failed tests and scaled back. |
Thanks for the updates and the further investigation. Regarding the failing tests, the idea is that all existing tests should pass after the change. There could be tests that need adjustment, but I'm not aware of any. Anyway, for me, all the tests passed. As you can see, the CI passes too. So not sure why they failed for you. I think the strategy you suggested, which is to start with one layer type first, is valid. We can do that and once that has proven to work well, we can tackle the others. I did some further digging into the code and IIUC, I think the weight initialization happening in these two lines is completely unnecessary: peft/src/peft/tuners/lora/layer.py Line 142 in 0b2f950
peft/src/peft/tuners/lora/layer.py Line 151 in 0b2f950
It creates a randomly initialized peft/src/peft/tuners/lora/model.py Line 213 in 0b2f950
Therefore, I think this step could be completely skipped. Since we're talking about the full sized weight here, not the smaller LoRA weights, this could take quite some time. The fix you proposed would create that weight on the meta device, but I wonder if we can rewrite the code to completely skip the initialization instead. If you want to explore that, feel free to do so, otherwise I'll take a look at it later. If you have some benchmark script to test the initialization speed, please share it.
Yes, that's indeed best left for a separate PR. |
@BenjaminBossan, let me know if you prefer to proceed with #887 or with this PR. I admit that None is even better than tensor on #872 passed the tests as of today. Apparently there is something in my local setup that causes errors when merging back the adapters for conv2d. I can investigate it further, but could you share versions of libs that you use for testing. My env is described above. |
Thanks for providing the script @poedator. Could you please paste your measurements for reference? Unfortunately, I still haven't gotten access to Llama 2 (really not sure why it's still pending), so I can't test that model. For the others, I got:
So this looks like a very decent speed up, ~10x, although in practice a lot of time is spent on loading the model in the first place, so overall clock time is only 11 vs 7 sec.
My impression is that it would be even better not to initialize the weights at all instead of initializing them on meta device. Having useless inits is just causing confusion. The main advantage with initializing with meta is that the code changes are smaller, so they're less likely to accidentally break something. But from what I can tell, removing the inits should also not break anything. @younesbelkada @pacman100 Do you have an opinion on that? How should we proceed? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work @poedator @BenjaminBossan thanks a lot for this! 🚀
The current way of proceeding looks great IMO as the code change is quite minimal for a nice gain, also I think |
Thinking a bit and discussing offline with @BenjaminBossan , there might be one problem with this approach if users uses |
Thank you @poedator and @BenjaminBossan for making the LoRA init super fast 🚀. I really like both approaches, favouring Benjamin's as it avoids the inits of weights altogether. |
Anything else missing for merging this PR? |
Partly resolves huggingface#872 Description After getting faster initialization of the LoRA Linear layer, initialization of Conv2D and Embedding is now sped up. Implementation The approach of how to achieve the speed up has slightly changed compared to last time. To refresh memory, in huggingface#887, we avoided the unnecessary initialization of the full weight matrix by completely skipping nn.Linear.__init__. Although it is possible to do the same for Embedding and Conv2d, we run into some trouble here. The issue is that the __init__ methods of these classes have quite a lot more arguments and some custom logic (i.e. not only self.foo = foo but more on top). If we wanted to skip __init__ entirely, we would have to basically copy all of that into our code. Although that is possible, it is brittle (e.g. the logic could be different for different PyTorch versions or change over time). For that reason, I opted to implement this differently, using a suggestion we had discussed earlier. The approach is to call __init__ of the parent class but enforce empty weights (this is what torch.nn.utils.skip_init does, although we cannot use that function directly). This way, we can avoid having to copy the __init__ code while still avoiding expensive initialization of the weights. I did not change the code for Linear to also use this approach because the logic inside of Linear.__init__ is quite simple (at least for now), so we are good here with the existing approach. However, I was curious how changing the approach for Linear would affect the initialization speed. Therefore, I ran the script from huggingface#872 again, 3 times each. Current approach: test 1 with model bert-base took 0.021 sec. test 1 with model bert-base took 0.020 sec. test 1 with model bert-base took 0.020 sec. test 2 with model bloomz-1b7 took 0.030 sec. test 2 with model bloomz-1b7 took 0.030 sec. test 2 with model bloomz-1b7 took 0.030 sec. New approach if applied to Linear: test 1 with model bert-base took 0.038 sec. test 1 with model bert-base took 0.039 sec. test 1 with model bert-base took 0.038 sec. test 2 with model bloomz-1b7 took 0.072 sec. test 2 with model bloomz-1b7 took 0.048 sec. test 2 with model bloomz-1b7 took 0.048 sec. This shows that the new approach is indeed a bit slower than the existing one, though still a lot faster than what we had before. IMHO, I think we're safe to leave the code inside of Linear as is and benefit from the slightly better performance at the cost of slightly more fragile code. But please let me know if you prefer: 1. The new approach should also be applied to Linear 2. The existing approach should also be applied to Embedding and Conv2d
Partly resolves #872 Description After getting faster initialization of the LoRA Linear layer, initialization of Conv2D and Embedding is now sped up. Implementation The approach of how to achieve the speed up has slightly changed compared to last time. To refresh memory, in #887, we avoided the unnecessary initialization of the full weight matrix by completely skipping nn.Linear.__init__. Although it is possible to do the same for Embedding and Conv2d, we run into some trouble here. The issue is that the __init__ methods of these classes have quite a lot more arguments and some custom logic (i.e. not only self.foo = foo but more on top). If we wanted to skip __init__ entirely, we would have to basically copy all of that into our code. Although that is possible, it is brittle (e.g. the logic could be different for different PyTorch versions or change over time). For that reason, I opted to implement this differently, using a suggestion we had discussed earlier. The approach is to call __init__ of the parent class but enforce empty weights (this is what torch.nn.utils.skip_init does, although we cannot use that function directly). This way, we can avoid having to copy the __init__ code while still avoiding expensive initialization of the weights. I did not change the code for Linear to also use this approach because the logic inside of Linear.__init__ is quite simple (at least for now), so we are good here with the existing approach. However, I was curious how changing the approach for Linear would affect the initialization speed. Therefore, I ran the script from #872 again, 3 times each. Current approach: test 1 with model bert-base took 0.021 sec. test 1 with model bert-base took 0.020 sec. test 1 with model bert-base took 0.020 sec. test 2 with model bloomz-1b7 took 0.030 sec. test 2 with model bloomz-1b7 took 0.030 sec. test 2 with model bloomz-1b7 took 0.030 sec. New approach if applied to Linear: test 1 with model bert-base took 0.038 sec. test 1 with model bert-base took 0.039 sec. test 1 with model bert-base took 0.038 sec. test 2 with model bloomz-1b7 took 0.072 sec. test 2 with model bloomz-1b7 took 0.048 sec. test 2 with model bloomz-1b7 took 0.048 sec. This shows that the new approach is indeed a bit slower than the existing one, though still a lot faster than what we had before. IMHO, I think we're safe to leave the code inside of Linear as is and benefit from the slightly better performance at the cost of slightly more fragile code.
Purpose: accelerate peft model initialisation with LoRA config.
see issue description in #871
The idea is to avoid costly calls of weight initialisation functions in CPU, when those weights are not necessary and will be replaced by the original weights anyways.
This PR initially changes only init for Linear layers.
potential further improvements: