[IP-Adapter] Support multiple IP-Adapters #6573

yiyixuxu · 2024-01-15T05:19:56Z

initial draft for multiple IP-Adapter support

for #6318
see the discussion thread here #6544

to-do

add multi-adapter support
add multi-image support
refactor
doc and tests

working now! thanks to @asomoza

testing multi-adapter and multi-image

testing script

# yiyi testing script for multi-ipadapter: face + style folder
import torch
from diffusers import AutoPipelineForText2Image, DDIMScheduler
from transformers import CLIPVisionModelWithProjection
from diffusers.utils import load_image

noise_scheduler = DDIMScheduler(
    num_train_timesteps=1000,
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule="scaled_linear",
    clip_sample=False,
    set_alpha_to_one=False,
    steps_offset=1
)

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter", 
    subfolder="models/image_encoder",
    torch_dtype=torch.float16,
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    scheduler=noise_scheduler,
    image_encoder=image_encoder,
)

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"])
pipeline.set_ip_adapter_scale([0.7, 0.3])

pipeline.enable_model_cpu_offload()

face_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png")

style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy"
style_images =  [load_image(f"{style_folder}/img{i}.png") for i in range(10)]

generator = torch.Generator(device="cpu").manual_seed(0)

image = pipeline(
    prompt="wonderwoman",
    ip_adapter_image=[style_images, face_image],
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", 
    num_inference_steps=50, num_images_per_prompt=1,
    generator=generator,
).images[0]
image.save(f"yiyi_test_out.png")

image inputs

face image

stype images

output

slow test

testing previous API with single ip-adapter and single image input. below test script generate identical results on main and PR branch

from diffusers import AutoPipelineForText2Image
import torch
from diffusers.utils import load_image


pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
# ip-adapter

ip_adapter_weights = {
    "ip-adapter": "ip-adapter_sd15.bin",
    "ip-adapter-plus": "ip-adapter-plus_sd15.bin",
    "ip-adapter-full-face": "ip-adapter-full-face_sd15.bin",
}

for adapter_name, weight_name in ip_adapter_weights.items():
    pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name=weight_name)
    pipeline.set_ip_adapter_scale(0.6)
    image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/load_neg_embed.png")
    generator = torch.Generator(device="cpu").manual_seed(33)
    images = pipeline(
        prompt='best quality, high quality, wearing sunglasses',
        ip_adapter_image=image,
        negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", 
        num_inference_steps=50,
        num_images_per_prompt=1,
        generator=generator,
    ).images
    images[0].save(f"yiyi_test_2_out_{adapter_name}.png")

ip-adapter	ip-adapter-full	ip-adapter-flus

testing batch generation

The current implementation does not work with batch - it won't work with multiple prompts or num_iamges_per_prompt > 1. This is fixed in this PR

testing script

# yiyi testing script for multi-ipadapter: face + style folder
import torch
from diffusers import AutoPipelineForText2Image, DDIMScheduler
from transformers import CLIPVisionModelWithProjection
from diffusers.utils import load_image, make_image_grid

noise_scheduler = DDIMScheduler(
    num_train_timesteps=1000,
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule="scaled_linear",
    clip_sample=False,
    set_alpha_to_one=False,
    steps_offset=1
)

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter", 
    subfolder="models/image_encoder",
    torch_dtype=torch.float16,
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    scheduler=noise_scheduler,
    image_encoder=image_encoder,
)

pipeline.load_ip_adapter(["h94/IP-Adapter"], subfolder=["sdxl_models"], weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"])
pipeline.set_ip_adapter_scale([0.7, 0.3])

pipeline.enable_model_cpu_offload()

face_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png")

style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy"
style_images =  [load_image(f"{style_folder}/img{i}.png") for i in range(10)]

generator = torch.Generator(device="cpu").manual_seed(33)

images = pipeline(
    prompt=["batman", "wonderwoman"],
    ip_adapter_image=[style_images, face_image],
    negative_prompt=["monochrome, lowres, bad anatomy, worst quality, low quality"] * 2, 
    num_inference_steps=50, num_images_per_prompt=2,
    generator=generator,
).images

make_image_grid(images, rows=2, cols=2).save(f"yiyi_test_out.png")

HuggingFaceDocBuilderDev · 2024-01-17T09:21:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…to multi-ipadapter

asomoza · 2024-01-17T11:18:16Z

if you use the multiple images for style, does it work? with just 4 images and the "wonder woman" prompt, you should get something like this:

for style I usually use the normal "not plus" adapter, also the face adapter is really strong it could be changing the style. Just looking at the face it looks like even at 0.3 is still too strong, the face should be more "animated" like this one:

your result looks more real and more like if I put the scale to 1.0 and with a mask:

to me it looks like the second adapter is overwriting the first one.

yiyixuxu · 2024-01-17T17:15:32Z

thanks @asomoza
super valuable insights! yeah my results are off, either I missed something or had a bug somewhere. Going to dig into it more.

I also saw on invokeAI's code, there are fields begin_step_percentand end_step_percent - I did not implement this feature. do you think it could be that?

asomoza · 2024-01-17T17:54:46Z

IMO that shouldn't matter because they're doing that in the pipeline, they check the %, convert it to steps and simply set the scale to 0.0 before and after. It's a nice feature to have though but that can be easily added later.

I'm still struggling to understand the diffusers implementation, maybe I'm missing some of the diffusers coding practices, but the image_embeds shouldn't be always be passed in the cross_attention_kwargs as you're doing in this PR? It would be even better to take out the ImageProjection of the unet forward, at least I don't see the need to do it in every step when it can be done with the image encoding, I'm not 100% sure since I don't use the default diffusers code.

yiyixuxu · 2024-01-17T18:24:08Z

@asomoza
we are going to refactor the design once got the correct results:)

It would be even better to take out the ImageProjection of the unet forward, at least I don't see the need to do it in every step when it can be done with the image encoding, I'm not 100% sure since I don't use the default diffusers code.

We did it that way to be consistent with the unet design in rest of our codebase. I think it makes less sense now with multi-image support. We are considering refactoring our unet to separate these projection layers as well.

asomoza · 2024-01-17T18:55:16Z

oh I see, thanks for the response, I'm also currently implementing the multi ip adapters myself, If I find something useful I'll let you know.

russmaschmeyer · 2024-01-17T22:39:58Z

Super excited you're looking at adding these features. I've found chained IP Adapters very useful in ComfyUI for generating new backgrounds for an existing photo subject. ComfyUI also has a Conditioning (Set Mask) node that allows you to set a drawing mask for the text prompt as well as a Conditioning (Combine) node to bring foreground and background prompt/controlnet conditioning together into a single positive prompt which then gets passed into the sampler.

Been researching furiously but can't find the equivalent of that conditioning masking and combining capability in Diffusers. Are there equivalents in Diffusers?

asomoza · 2024-01-18T06:42:08Z

@russmaschmeyer I did bring up the masking for IP Adapters, but as I understood with the comments, most people don't use them, they use diffusers as a fast and simple way of generation so the masking is delegated to community pipelines, this is somehow related to this PR but mostly not, so I think it would be better to ask this question in the discussions tab.

IMO what you're looking for is a combination of regional prompting with masks for controlnet and IP Adapters which is the best combination for image composition, I eventually need to make this sooner or later and maybe I'll have the time to port it to a diffusers pipeline if no one has done it at that time.

asomoza · 2024-01-18T07:46:33Z

@yiyixuxu I'm done with my implementation and I got it working, this is the result:

face + style(11 images)	face + style PLUS(11 images)

Reading your code, I think your problem is in the init of the IP Adapter Attention Processor, I use this:

self.to_k_ip = nn.ModuleList([nn.Linear(cross_attention_dim, hidden_size, bias=False) for _ in range(len(num_tokens))])
self.to_v_ip = nn.ModuleList([nn.Linear(cross_attention_dim, hidden_size, bias=False) for _ in range(len(num_tokens))])

src/diffusers/models/attention_processor.py

Co-authored-by: Alvaro Somoza <somoza.alvaro@gmail.com>

yiyixuxu · 2024-01-18T17:39:11Z

@asomoza
thank you! you're right!! I had a bug there. Thanks so much for looking into this :) ❤️

asomoza · 2024-01-18T18:05:33Z

Glad to be of help, I've been doing tests with multiple combinations and the results are the same as ComfyUI. Now I'll wait for the final refactor to see how much I need to deviate from the diffusers code (hopefully not that much). I still need to add the start and end % and masking options afterwards.

russmaschmeyer · 2024-01-18T23:36:52Z

IMO what you're looking for is a combination of regional prompting with masks for controlnet and IP Adapters which is the best combination for image composition, I eventually need to make this sooner or later and maybe I'll have the time to port it to a diffusers pipeline if no one has done it at that time.

YES! That nails it. Will research regional prompting solutions. FWIW I have found IP Adapter attention masking to be VERY powerful, so I'm happy to add another voice to the "it would be amazing if that were supported in Diffusers!" vote. In ComfyUI I've found that regional prompting alone (with masks) doesn't get you there. You need both IP attention masking AND conditioning masking.

Thanks for the tip @asomoza!

sayakpaul · 2024-01-31T02:46:58Z

src/diffusers/loaders/unet.py

@@ -763,28 +768,14 @@ def _convert_ip_adapter_image_proj_to_diffusers(self, state_dict):
        image_projection.load_state_dict(updated_state_dict)
        return image_projection

-    def _load_ip_adapter_weights(self, state_dict):
+    def _convert_ip_adapter_attn_to_diffusers(self, state_dicts):


100 percent the right choice!

sayakpaul · 2024-01-31T02:47:13Z

src/diffusers/loaders/unet.py

        from ..models.attention_processor import (
            AttnProcessor,
            AttnProcessor2_0,
            IPAdapterAttnProcessor,
            IPAdapterAttnProcessor2_0,
        )


I think we can move the imports to the top no?

I was following the same pattern in this file https://github.com/huggingface/diffusers to/blob/87a92f779c5ba9c180aec4b90c38149eb108d888/src/diffusers/loaders/unet.py#L449

I thought it was to avoid circular import or something, but I'm not really sure. I can look into maybe in a separate PR to see if we can move all the import to top

sayakpaul · 2024-01-31T02:50:46Z

src/diffusers/loaders/unet.py

+        if not isinstance(state_dicts, list):
+            state_dicts = [state_dicts]


Is it to ensure BW compatibility? I don't see how state_dicts could not be a list.

sayakpaul · 2024-01-31T02:51:16Z

src/diffusers/loaders/unet.py

+        # Set encoder_hid_proj after loading ip_adapter weights,
+        # because `IPAdapterPlusImageProjection` also has `attn_processors`.
+        self.encoder_hid_proj = None


Sorry I don't understand the comment fully. Could you elaborate?

I can't 😛 I assume it's notes from contributor of Ip adapter plus

sayakpaul · 2024-01-31T02:52:18Z

src/diffusers/loaders/unet.py

+        image_projection_layers = []
+        for state_dict in state_dicts:
+            image_projection_layer = self._convert_ip_adapter_image_proj_to_diffusers(state_dict["image_proj"])
+            image_projection_layer.to(device=self.device, dtype=self.dtype)
+            image_projection_layers.append(image_projection_layer)


Very nice delegation. First convert to attention procs and then handle the rest of the stuff like projection with a dedicated method.

sayakpaul · 2024-01-31T02:54:22Z

src/diffusers/models/attention_processor.py

        residual = hidden_states

+        # separate ip_hidden_states from encoder_hidden_states
+        if encoder_hidden_states is not None:
+            if isinstance(encoder_hidden_states, tuple):


BW compatibility?

sayakpaul

Looking solid! I only left a handful of comments.

(not merge-blocking): Do we need to update all those pipelines in this PR? Maybe we could open it up for the community?

--------- Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Alvaro Somoza <somoza.alvaro@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

alexblattner · 2024-02-11T07:59:22Z

any plans to add face id? I'd love to use face plus with face id but can't with the community pipeline

alexblattner · 2024-02-11T08:34:58Z

also, is there a from_single_file equivalent for the load_ip_adapter function?

sayakpaul · 2024-02-11T08:36:10Z

also, is there a from_single_file equivalent for the load_ip_adapter function?

I don't think there's any need as we directly load the original single-file checkpoint of IP Adapter from the get-go.

alexblattner · 2024-02-11T09:36:28Z

@sayakpaul what if I have my own trained ip adapter? What if I want to use models from 2 different users?

sayakpaul · 2024-02-11T09:57:21Z

If they follow the original IP Adapter format (example), then load_ip_adapter() method already work.

alexblattner · 2024-02-11T10:14:49Z

yes, but for 1 only. You can't use multi ip adapters from different sources with the current implementation

sayakpaul · 2024-02-11T10:16:43Z

Help us with a reproducible snippet in a new thread.

alexblattner · 2024-02-11T10:20:27Z

I was only thinking of using ip_adapter_face_plus with ip_adapter_face_id that aren't in the same directory. it seems like it would be a pain in the ass the implement for you with the current design

okaris · 2024-04-24T07:10:26Z

@alexblattner you should be able to use it like:

pipeline.load_ip_adapter(["your-hf-account/IP-Adapter-1", "other-hf-account/IP-Adapter-2"], subfolder=["sdxl_models", "sdxl_models"], weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"])

--------- Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Alvaro Somoza <somoza.alvaro@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

alexblattner · 2024-05-01T07:56:10Z

@okaris doesn't work with faceid

okaris · 2024-05-01T08:29:30Z

@alexblattner can you send a sample code please?

yiyixuxu added 6 commits January 14, 2024 09:20

allow list

7e1bf9b

update

3b55a1e

save

345b4d6

add

f6bae6e

remove print lines

baa7b83

style

9024698

yiyixuxu mentioned this pull request Jan 15, 2024

Does the IP Adapter support mounting multiple IP Adapter models simultaneously and using multiple reference images at the same time? #6318

Closed

yiyixuxu and others added 2 commits January 17, 2024 09:09

support multi-image

27fc796

Merge branch 'main' into multi-ipadapter

67908bf

yiyixuxu added 3 commits January 17, 2024 09:25

fix a typo

afd91e3

Merge branch 'multi-ipadapter' of github.com:huggingface/diffusers in…

7d42455

…to multi-ipadapter

fix

1fdcd7b

yiyixuxu commented Jan 18, 2024

View reviewed changes

src/diffusers/models/attention_processor.py Outdated Show resolved Hide resolved

fix a bug!

45fb582

Co-authored-by: Alvaro Somoza <somoza.alvaro@gmail.com>

yiyixuxu and others added 4 commits January 19, 2024 03:59

fix

1a4c6b1

update

d924c47

fix

cc2aa1b

Merge branch 'main' into multi-ipadapter

2c86534

sayakpaul reviewed Jan 31, 2024

View reviewed changes

sayakpaul approved these changes Jan 31, 2024

View reviewed changes

yiyixuxu and others added 2 commits January 31, 2024 04:10

style

e742cf4

Merge branch 'main' into multi-ipadapter

dcdde9c

yiyixuxu merged commit 2e8d18e into main Jan 31, 2024

yiyixuxu mentioned this pull request Feb 1, 2024

[IP-Adapter] Adding IP-adapter masking feature #6802

Closed

yiyixuxu deleted the multi-ipadapter branch February 1, 2024 16:57

yiyixuxu mentioned this pull request Feb 1, 2024

[design] adding a validation decorator #6814

Open

okotaku mentioned this pull request Feb 2, 2024

Bugfix/ip adapter okotaku/diffengine#128

Merged

4 tasks

yiyixuxu mentioned this pull request Feb 2, 2024

update IP-adapter code in UNetMotionModel #6828

Merged

enesmsahin mentioned this pull request Feb 6, 2024

IP-Adapter support for StableDiffusionXLControlNetInpaintPipeline #6866

Closed

okotaku mentioned this pull request Feb 7, 2024

[Bugfix] Fix IP-Adapter training okotaku/diffengine#131

Merged

4 tasks

		if not isinstance(state_dicts, list):
		state_dicts = [state_dicts]

[IP-Adapter] Support multiple IP-Adapters #6573

[IP-Adapter] Support multiple IP-Adapters #6573

Uh oh!

Conversation

yiyixuxu commented Jan 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

testing multi-adapter and multi-image

testing script

image inputs

output

slow test

testing batch generation

testing script

Uh oh!

HuggingFaceDocBuilderDev commented Jan 17, 2024

Uh oh!

asomoza commented Jan 17, 2024

Uh oh!

yiyixuxu commented Jan 17, 2024

Uh oh!

asomoza commented Jan 17, 2024

Uh oh!

yiyixuxu commented Jan 17, 2024

Uh oh!

asomoza commented Jan 17, 2024

Uh oh!

russmaschmeyer commented Jan 17, 2024

Uh oh!

asomoza commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asomoza commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yiyixuxu commented Jan 18, 2024

Uh oh!

asomoza commented Jan 18, 2024

Uh oh!

russmaschmeyer commented Jan 18, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

alexblattner commented Feb 11, 2024

Uh oh!

alexblattner commented Feb 11, 2024

Uh oh!

sayakpaul commented Feb 11, 2024

Uh oh!

alexblattner commented Feb 11, 2024

Uh oh!

sayakpaul commented Feb 11, 2024

Uh oh!

alexblattner commented Feb 11, 2024

Uh oh!

sayakpaul commented Feb 11, 2024

Uh oh!

alexblattner commented Feb 11, 2024

Uh oh!

okaris commented Apr 24, 2024

Uh oh!

yiyixuxu commented Jan 15, 2024 •

edited

Loading

asomoza commented Jan 18, 2024 •

edited

Loading

asomoza commented Jan 18, 2024 •

edited

Loading