-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can't resume training with quantized base model #9108
Comments
|
I don't see any calls to
Benjamin, I belive it would be nice to be able to support this because with models like Flux coming, it will be crucial to support quantized models + PEFT in this way. But I think this is something that needs to be fixed in |
Would be also nice to have a dummy checkpoint for |
it's that |
That means we lose the benefits of quantization. |
well that happens if you quantise after sending it to the accelerator, too. it's quite fragile currently |
Oh then it seems like the net benefits are zero in terms of memory, no? Not saying that as a negative comment because I think this exploration is critically important. |
i had to reorder things so it goes to the accelerator long after everything is done to the weights |
Could you also then provide the order that provides us with the benefits of quantization (when training is not being resumed)? |
that is in the reproducer |
I'm not familiar with quanto, what type of quantization is applied here? Also, with PEFT, we have different LoRA classes depending on quantization, so e.g. we have a different linear layer for bnb than for GPTQ. Therefore, the model already needs to be quantized before adding the adapter. Not sure if this is any help here, but I wanted to mention that just in case. |
@BenjaminBossan this is weight-only int8 quantization.
I see. So, IIUC, this needs to be implemented at the PEFT level then? I imagine |
same error when we load the adapter after freeze |
Okay, so I checked, as I wasn't sure if quanto is using its own quantization scheme or if they wrap others. But from what I could tell, its their own. Therefore, for this to work with PEFT, we would need to add support for quanto layers. Hopefully, this shouldn't be too difficult. I added an item to the backlog. In the meantime, if anyone has a prototype, it is now quite easy to add support for new LoRA layer types to PEFT by using dynamic dispatch. So this can be tested quickly without the need to create a PR on PEFT. |
Very nice, but the problem is we don't have quanto LoRA layers yet. |
ideally the lora remains unquantised and only the base weights are |
Yes, they have yet to be implemented. What I meant is that if someone has a POC, they can quickly test it with dynamic dispatch, no need to go through a lengthy process of adjusting PEFT code.
Yes, that's always the case with quantization in PEFT: The base weights are quantized but the LoRA weights are not. |
wasn't sure exactly what was meant by quanto lora layers :D |
Then I am still a little unclear what would it mean to support injecting LoRA layers into a model that is quantized with quanto (which can be detected at runtime). Could you maybe elaborate a little more? @BenjaminBossan |
we can run add_adapter on the quantised model it's just when we try loading the state dict that it really gets upset. the training of the lora also works |
What I mean is that if you want to add a new quantization method and have written the layer class as POC, normally the next step would be to edit PEFT to make use of that class and create a PR on PEFT. E.g. like in this PR for EETQ. With dynamic dispatch, you can instead just do this to use the new class: config = LoraConfig(...)
custom_module_mapping = {nn.Linear: MyQuantizedLinearClass}
config._register_custom_module(custom_module_mapping) No PEFT source code needs to be altered.
Oh, this is surprising, I'll have to look into this. |
@BenjaminBossan this is likely because in diffusers/src/diffusers/loaders/peft.py Line 111 in 1fcb811
Regarding using the dispatching system, I think there might be a way. Consider the following example: from optimum.quanto import quantize, qfloat8, freeze
import torch.nn as nn
class Model(nn.Module):
def __init__(self, n_dim):
super().__init__()
self.linear1 = nn.Linear(n_dim, 1204)
def forward(self, x):
return self.linear1(x)
model = Model(64)
quantize(model, weights=qfloat8)
freeze(model)
print(model) It prints: Model(
(linear1): QLinear(in_features=64, out_features=1204, bias=True)
) Then if I do this (mimicking what we are likely trying to do): from optimum.quanto import quantize, qfloat8, freeze, requantize
from optimum.quanto.quantize import quantization_map
import torch.nn as nn
class Model(nn.Module):
def __init__(self, n_dim):
super().__init__()
self.linear1 = nn.Linear(n_dim, 1204)
def forward(self, x):
return self.linear1(x)
model = Model(64)
quantize(model, weights=qfloat8)
freeze(model)
print(model)
qmap = quantization_map(model)
qstate_dict = model.state_dict()
new_model = Model(64)
requantize(model, state_dict=qstate_dict, quantization_map=qmap)
print(model) It prints: Model(
(linear1): QLinear(in_features=64, out_features=1204, bias=True)
)
Model(
(linear1): QLinear(in_features=64, out_features=1204, bias=True)
) So, it appears to me that we can probably use the dispatching system by mapping the |
Hmm, but this still calls: diffusers/src/diffusers/loaders/peft.py Line 146 in 1fcb811
which starts the whole PEFT machinery of checking matching layers and replacing them with
so that can't be the explanation. Possibly, even after quantization there are some non-quantized layers left that are matched by PEFT, thus preventing PEFT from erroring because it could not match any layers? I'll have to test but am currently not on the right machine to do so :) |
Sure, thank you! There might be a way for us to meanwhile use the dispatcher and I updated my comment accordingly. |
Okay, back after some testing. The reason why there is no error when adding a PEFT adapter on top of a quanto model is because the quanto I did some quick tests and it appears it's actually working out of the box -- almost. What doesn't work for now is merging the LoRA weights into the Anyway, it looks like support for quanto should not be that difficult to add. In fact, for simple use cases, it may already work correctly. However, I'd not recommend using it until we have run it through our test suite to ensure that we know what does and does not work. Until then, I'd recommend using one of the officially supported quantization methods in PEFT. |
i would love to but |
I see, all the more reasons to add quanto support soon. I'll open an issue on PEFT to call for contributions. If there is no taker, I'll work on it as soon as a have a bit of time on my hands. Update: The issue: huggingface/peft#1997 |
I coded something up quickly: huggingface/peft#2000. It's still unfinished but in my limited testing looks like it works. |
I also have this error when I try to load a lora weight into the quantized transformer class. I am using this method because it is much faster to load the quantized weight rather than quantizing at runtime. The issue is now that, when I try loading the lora weights, I get similar error
Here are some relevant part of my code
Are there any workaround or fix for this? |
@Leommm-byte The issue is that non-strict loading of |
Cc: @bghira for huggingface/optimum-quanto#278. |
Thank you for your fix. It worked really well. My own fix was to comment out these lines
which wasn't very smart in retrospect 😅. I thought it was just for checking for incompatible keys. So the side effect was that the LoRA weight wasn't having any effect on the model's output. Thank you so much once again. I'll continue using your fix till the maintainers decide to use this fix or create another. Cheers. |
you should have seen my face as i read through your comment lmao |
Can someone have a gist up that shows the application of the fix (ideally following the DreamBooth LoRA Flux script we have in the repo)? This is not to deter any finetuners like SimpleTuner or AI Toolkit but to have a more self-contained and minimal example. I can do it tomorrow once I get back to my computer. Question to @bghira and @Leommm-byte:
|
baseline memory goes from 33GB VRAM at bf16 to ~17GB VRAM at int8 for LoRA |
So, if we decrease resolution, seems like it could be possible to do it a free Colab? |
no, reducing resolution really only increases the speed. paradoxically i'm training on 2048px with the same vram use but it goes slower. this is on 3x 4090. |
How about coupling that with 8bit Adam or does that impact convergence? |
i think DeepSpeed with a single GPU will be fine for 16GB VRAM. the weights at bf16 are just under 24G on their own. cutting that in half naively will give us just around 12G of memory for the full model weights. and then the optimiser states are like +100% more memory used, with AdamW. Lion fares better since it keeps fewer copies. so something has to be offloaded even with stochastic bf16 + Lion, no autocast, even for LoRA at 16G VRAM. the way kohya is pulling this off is with a modified Adafactor with fused optimiser step + backward pass, and other offload tactics that don't exist in Diffusers |
8bit Adam doesn't help memory pressure much or at all, it needs autocast and fp32 gradients as i understand it. when we don't do that, it just kinda gets stuck and slowly degrades. |
Although my case was running inference rather than training, there's a significant improvement. My rig has an rtx 4090 mobile (it's a laptop so 16gb VRAM). Its inference is way faster from like impossible (with unquantized bf16 weights) to about 2 to 2 and a half minutes when loading from disk (using quantized transformer and T5 at fp8). But once it is in memory, it's like 20 to 30 sec. VRAM usage is about 15.5gb. |
😭 |
for training speed, due to the fact that it will OOM, there's really no way to know how slowly bf16 will run on a 3090 vs fp8/int8. but fp8 is slower than int8 on 3090. not sure about 4090. and A100 goes about 1.25 seconds per step at 512x512 whether it's a rank-16+int8 LoRA or a bf16 LoRA. fully training the non-quantised BF16 weights on 1x A100-80G SXM4 with DeepSpeed ZeRO 2 is about 10 seconds per step at 1024x1024. this is the same speed with 3x A100-80G SXM4. but then you can do nicer batch sizes. and the VRAM used per GPU dropped with more GPUs. increasing the batch size hurts throughput a lot. but not on the H100 - that is required to keep throughput scaling with batch size. that dang dispatch layer in the hw is just too good. MI300X i paid to test yesterday. it's still in need of a lot of work for scaling with batch sizes but you just need one and no deepspeed to fully tune the model for $3.99/hr. H100 is the same cost, but needs DeepSpeed using 58,000MiB of VRAM and about 176GB of system memory. and i implemented the offload techniques from kohya but their utility is limited to a single GPU. we can't fuse things as deeply on multigpu setups. so, for this you'll always need multigpu quantised training or tolerate the slowdown of heavy offloading through DeepSpeed, which is optimised for multi-GPU training. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Describe the bug
We have a KeyError when the state dict goes to load into the transformer.
Reproduction
Logs
System Info
latest git main
Who can help?
@sayakpaul
The text was updated successfully, but these errors were encountered: