-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IP-Adapter] Support multiple IP-Adapters #6573
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
if you use the multiple images for style, does it work? with just 4 images and the "wonder woman" prompt, you should get something like this: for style I usually use the normal "not plus" adapter, also the face adapter is really strong it could be changing the style. Just looking at the face it looks like even at 0.3 is still too strong, the face should be more "animated" like this one: your result looks more real and more like if I put the scale to 1.0 and with a mask: to me it looks like the second adapter is overwriting the first one. |
thanks @asomoza I also saw on invokeAI's code, there are fields |
IMO that shouldn't matter because they're doing that in the pipeline, they check the %, convert it to steps and simply set the scale to 0.0 before and after. It's a nice feature to have though but that can be easily added later. I'm still struggling to understand the diffusers implementation, maybe I'm missing some of the diffusers coding practices, but the |
@asomoza
We did it that way to be consistent with the unet design in rest of our codebase. I think it makes less sense now with multi-image support. We are considering refactoring our unet to separate these projection layers as well. |
oh I see, thanks for the response, I'm also currently implementing the multi ip adapters myself, If I find something useful I'll let you know. |
Super excited you're looking at adding these features. I've found chained IP Adapters very useful in ComfyUI for generating new backgrounds for an existing photo subject. ComfyUI also has a Conditioning (Set Mask) node that allows you to set a drawing mask for the text prompt as well as a Conditioning (Combine) node to bring foreground and background prompt/controlnet conditioning together into a single positive prompt which then gets passed into the sampler. Been researching furiously but can't find the equivalent of that conditioning masking and combining capability in Diffusers. Are there equivalents in Diffusers? |
@russmaschmeyer I did bring up the masking for IP Adapters, but as I understood with the comments, most people don't use them, they use diffusers as a fast and simple way of generation so the masking is delegated to community pipelines, this is somehow related to this PR but mostly not, so I think it would be better to ask this question in the discussions tab. IMO what you're looking for is a combination of regional prompting with masks for controlnet and IP Adapters which is the best combination for image composition, I eventually need to make this sooner or later and maybe I'll have the time to port it to a diffusers pipeline if no one has done it at that time. |
@yiyixuxu I'm done with my implementation and I got it working, this is the result:
Reading your code, I think your problem is in the init of the IP Adapter Attention Processor, I use this: self.to_k_ip = nn.ModuleList([nn.Linear(cross_attention_dim, hidden_size, bias=False) for _ in range(len(num_tokens))])
self.to_v_ip = nn.ModuleList([nn.Linear(cross_attention_dim, hidden_size, bias=False) for _ in range(len(num_tokens))]) |
Co-authored-by: Alvaro Somoza <somoza.alvaro@gmail.com>
@asomoza |
Glad to be of help, I've been doing tests with multiple combinations and the results are the same as ComfyUI. Now I'll wait for the final refactor to see how much I need to deviate from the diffusers code (hopefully not that much). I still need to add the start and end % and masking options afterwards. |
YES! That nails it. Will research regional prompting solutions. FWIW I have found IP Adapter attention masking to be VERY powerful, so I'm happy to add another voice to the "it would be amazing if that were supported in Diffusers!" vote. In ComfyUI I've found that regional prompting alone (with masks) doesn't get you there. You need both IP attention masking AND conditioning masking. Thanks for the tip @asomoza! |
@@ -763,28 +768,14 @@ def _convert_ip_adapter_image_proj_to_diffusers(self, state_dict): | |||
image_projection.load_state_dict(updated_state_dict) | |||
return image_projection | |||
|
|||
def _load_ip_adapter_weights(self, state_dict): | |||
def _convert_ip_adapter_attn_to_diffusers(self, state_dicts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100 percent the right choice!
from ..models.attention_processor import ( | ||
AttnProcessor, | ||
AttnProcessor2_0, | ||
IPAdapterAttnProcessor, | ||
IPAdapterAttnProcessor2_0, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can move the imports to the top no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the same pattern in this file https://github.com/huggingface/diffusers to/blob/87a92f779c5ba9c180aec4b90c38149eb108d888/src/diffusers/loaders/unet.py#L449
I thought it was to avoid circular import or something, but I'm not really sure. I can look into maybe in a separate PR to see if we can move all the import
to top
if not isinstance(state_dicts, list): | ||
state_dicts = [state_dicts] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it to ensure BW compatibility? I don't see how state_dicts
could not be a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is
# Set encoder_hid_proj after loading ip_adapter weights, | ||
# because `IPAdapterPlusImageProjection` also has `attn_processors`. | ||
self.encoder_hid_proj = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't understand the comment fully. Could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't 😛 I assume it's notes from contributor of Ip adapter plus
image_projection_layers = [] | ||
for state_dict in state_dicts: | ||
image_projection_layer = self._convert_ip_adapter_image_proj_to_diffusers(state_dict["image_proj"]) | ||
image_projection_layer.to(device=self.device, dtype=self.dtype) | ||
image_projection_layers.append(image_projection_layer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice delegation. First convert to attention procs and then handle the rest of the stuff like projection with a dedicated method.
residual = hidden_states | ||
|
||
# separate ip_hidden_states from encoder_hidden_states | ||
if encoder_hidden_states is not None: | ||
if isinstance(encoder_hidden_states, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BW compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking solid! I only left a handful of comments.
(not merge-blocking): Do we need to update all those pipelines in this PR? Maybe we could open it up for the community?
--------- Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Alvaro Somoza <somoza.alvaro@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
any plans to add face id? I'd love to use face plus with face id but can't with the community pipeline |
also, is there a from_single_file equivalent for the load_ip_adapter function? |
I don't think there's any need as we directly load the original single-file checkpoint of IP Adapter from the get-go. |
@sayakpaul what if I have my own trained ip adapter? What if I want to use models from 2 different users? |
If they follow the original IP Adapter format (example), then |
yes, but for 1 only. You can't use multi ip adapters from different sources with the current implementation |
Help us with a reproducible snippet in a new thread. |
I was only thinking of using ip_adapter_face_plus with ip_adapter_face_id that aren't in the same directory. it seems like it would be a pain in the ass the implement for you with the current design |
@alexblattner you should be able to use it like:
|
--------- Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Alvaro Somoza <somoza.alvaro@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
@okaris doesn't work with faceid |
@alexblattner can you send a sample code please? |
initial draft for multiple IP-Adapter support
for #6318
see the discussion thread here #6544
to-do
working now! thanks to @asomoza
testing multi-adapter and multi-image
testing script
image inputs
face image
stype images
output
slow test
testing previous API with single ip-adapter and single image input. below test script generate identical results on main and PR branch
testing batch generation
The current implementation does not work with batch - it won't work with multiple prompts or
num_iamges_per_prompt > 1
. This is fixed in this PRtesting script