-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prompt-Tuning for text-to-image diffusion models #2085
Comments
I'm not an expert on stable diffusion, but AFAIK, there is no special handling required to fine-tune the text encoder when it comes to PEFT itself. You can use LoRA or any of the other techniques that are implemented. In case the text encoder is using OpenClip or a similar architecture, you'll have to work based on the branch from #1324, as the When it comes to details like datasets and objectives fro training the text encoder, this is outside my domain and you'll have better chance looking at how other folks fine-tune the text encoder. |
Thank you for sharing your knowledge/experience and that branch. So, from what I understood, it seems the only implementation under the umbrella of PEFT methods is LoRA for the CLIP text encoder in stable diffusion (not Prompt-tuning, P-tuning, and Prefix-tuning). Please correct me if I'm wrong. Also, do you have any plan, at this time or in the future, to support the other three methods under the PEFT methods (Prompt-tuning, P-tuning, and Prefix-Tuning) for stable diffusion (or equally for its CLIP text encoder as those methods are related to the text input prompt) similar to what has already been implemented for the domain of LLMs? |
You should be able to use prompt learning techniques such as prompt-tuning too. What I meant is that methods not based on prompt learning, such as LoRA, IA³, BOFT, etc. cannot be used on |
Yes, I got it and I have also read this discussion 761, and thank you for the great contribution on that. However, what matters now for me is that: I want to know whether there is any prompt-tuning implementation (or any simple example would suffice) that shows how to do the prompt-tuning in the peft library to fine-tune the text encoder existing in the stable diffusion pipeline (e.g. CompVis/stable-diffusion-v1-4). More specifically, I know that the peft implementation gives me several TaskTypes here to fine-tune several types/categories of language models. But, honestly, as I am not an expert in language models, I am not sure the text-encoder in the diffusion pipeline (which is the CLIP) lies in which TaskTypes are mentioned above. So, as I could not find any resources/implementation on that, I am looking for a simple example to fine-tune the CLIP of the diffusion pipeline using the existing implementation of the peft library. I hope that I asked my question more clearly now. |
Unfortunately, I also never came across a use case to fine tine the LM of a SD model and there are no examples I'm aware of. Note that |
Okay, it will be very helpful that at least there is a way of doing such a use case with PEFT. Thank you for putting time into this case and I will be waiting for your update. Please also let me know if you need more info or anything from my side. |
Hi @BenjaminBossan, I wanted to kindly know if there is any update on this issue. Thanks! |
As an update: I need to do something similar to the following simple script using the PEFT library but I'm not sure what task type and what other changes need to be made in this script:
|
Note that you don't need to indicate a task type if the task you're training does not correspond to any of the existing ones. As to the rest, it really depends on the data you have, the training objective, etc. If you have an existing example that you want to modify to use PEFT, you can share it here and I can check. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Hi, I have been looking for a simple example/script that shows how I can use the prompt-tuning technique in the PEFT library to fine-tune the text encoder of a stable diffusion model. But I could not find any. Could you please introduce me if there is already one? If there is no implementation, I would appreciate any help/available resources for fine-tuning the text encoder with or without the PEFT library. Thanks!
The text was updated successfully, but these errors were encountered: