[Community Pipeline] UnCLIP image / text interpolations #1869

patrickvonplaten · 2022-12-30T11:28:32Z

Model/Pipeline/Scheduler description

Copied from #1858:

UnCLIP / Karlo: https://huggingface.co/spaces/kakaobrain/karlo gives some very nice and precise results when doing image generation and can strongly outperform Stable Diffusion in some - see:
https://www.reddit.com/r/StableDiffusion/comments/zshufz/karlo_the_first_large_scale_open_source_dalle_2/

Another extremely interesting aspect of Dalle 2 is its ability to interpolate between text and or image embeddings. See e.g. section 3.) of the Dalle 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf . This PR now allows to directly pass text embeddings and image embeddings which should enable those tasks!

I think we could create a super cool community pipeline. The pipeline could allow to automatically create interpolations between two text prompts and similarly we could create one to do interpolations between two images.

In terms of design to stay as efficient as possible the following would make sense:

1. The user passes two text prompts and a num_interpolations input.
1. The pipeline then embeds those two text prompts into the text embeddings x_0 and x_N and num_interpolations x_1, x_2, ... x_N-1 are created using the slerp function .
1. Then we have num_interpolations + 2 text embeddings that should be passed in a batch through the model to create a nice interpolation of images.
1. It'd be important to make use of enable_cpu_offload() to save memory.

It's probably easier to start with the UnCLIPImageInterpolationPipeline since image embeddings are just a single 1-d vector where as for text embeddings two latent vectors are used.

Would be more than happy to help if someone is interested in giving this a try - think it'll make for some super cool demos.

Open source status

The model implementation is available
The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

No response

The text was updated successfully, but these errors were encountered:

Abhinay1997 · 2023-01-01T15:11:25Z

Hi @patrickvonplaten and @williamberman, are you working on this or can I pick this up ?

patrickvonplaten · 2023-01-01T15:29:21Z

@Abhinay1997 feel free to pick it up! We re more than happy to help if needed ☺️

Abhinay1997 · 2023-01-05T01:59:20Z

This makes sense at the top level. But just so that my understanding is correct, we want a community pipeline that can interpolate between prompts/images like the StableDiffusionInterpolation but using the unCLIPPipeline ( a.k.a Dall-E 2)

The interpolation also makes sense, we generate the embeddings for the 2 prompts, p1 and p2 let's say. They would correspond to x_0 and x_N of the interpolation sequence and using slerp I would interpolate between them for N outputs in total for a pair of prompts/images.

patrickvonplaten · 2023-01-10T15:33:30Z

That's exactly right @Abhinay1997 :-)

Abhinay1997 · 2023-01-20T15:41:02Z

@patrickvonplaten sorry it took so long.

The UnCLIPTextInterpolation pipeline is actually more straight forward imo. For the UnCLIPImageInterpolation pipeline, would we not need the CLIP model that was used for training the UnCLIPPipeline ?

Because I see the CLIP model in the original codebase but not in the HuggingFace hub model

I am working on the text interpolation right now. ETA: 24th Jan.

patrickvonplaten · 2023-01-23T06:58:14Z

cc @williamberman here

williamberman · 2023-01-23T21:18:43Z

Thanks for your work @Abhinay1997!

We do still use clip for the unclip pipeline(s). Note that we just import it from transformers.

In the text to image pipeline, we use the text encoder for encoding the prompt

diffusers/src/diffusers/pipelines/unclip/pipeline_unclip.py

Lines 70 to 71 in ac3fc64

    
           text_encoder: CLIPTextModelWithProjection 
        
           tokenizer: CLIPTokenizer

In the image variation pipeline, we use the image encoder for encoding the input image and the text encoder for encoding an empty prompt. Note that we also optionally allow directly passing image embedding to the pipeline to skip encoding an input image.

diffusers/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py

Lines 75 to 78 in ac3fc64

    
           text_encoder: CLIPTextModelWithProjection 
        
           tokenizer: CLIPTokenizer 
        
           feature_extractor: CLIPFeatureExtractor 
        
           image_encoder: CLIPVisionModelWithProjection

The image interpolation pipeline should work similarly to the image variation pipeline in that it should be passed either the two sets of images which would be encoded via clip or the two sets of pre-encoded latents. Note that similarly to the image variation pipeline, we would have to encode the empty text prompt for the image interpolation pipeline.

LMK if that makes sense!

Abhinay1997 · 2023-01-25T09:22:52Z

Hi @williamberman thank you for the details. It makes sense. I was under the impression that I needed to use the actual CLIP checkpoint that the UnCLIP model learns to invert its decoder over. So I got confused.

Will share the work in progress notebooks soon.

Abhinay1997 · 2023-01-31T16:43:10Z

Hi @williamberman can you review the UnCLIPTextInterpolation notebook when you have time ?

Questions:-

When prompts have different lengths, what attention mask should be used for the intermediate interpolation steps ?
Will I be able to instantiate this community pipeline using DiffusionPipeline(custom_pipelein='....') as I am inheriting from both DiffusionPipeline[ CommunityPipeline requirement] and UnCLIPPipeline[ to be able to use UnCLIP modules] ?

williamberman · 2023-02-01T18:02:12Z

Great work so far @Abhinay1997 !

Hi @williamberman can you review the UnCLIPTextInterpolation notebook when you have time ?

Questions:-

When prompts have different lengths, what attention mask should be used for the intermediate interpolation steps ?

That's a good point, and I don't know off the top of my head or from googling around. I would recommend for now just using the mask of the longer prompt. cc @patrickvonplaten

Will I be able to instantiate this community pipeline using DiffusionPipeline(custom_pipelein='....') as I am inheriting from both DiffusionPipeline[ CommunityPipeline requirement] and UnCLIPPipeline[ to be able to use UnCLIP modules] ?

We actually want all pipelines to be completely independent so please do not inherit from the UnCLIPPipeline :)

Abhinay1997 · 2023-02-01T18:12:13Z

@williamberman, just to clarify, can I still import UnCLIPPipeline inside the methods and use it for generation ?

williamberman · 2023-02-01T18:18:28Z

Nope!

We want all pipelines to be as self contained as possible. If any methods are exactly the same, we have the # Copied from mechanism (which we should document a bit better) which will let you copy and paste the method and keep the two methods in sync in ci.

We've articulated some of our rationale in the philosophy doc

https://github.com/huggingface/diffusers/blob/main/docs/source/en/conceptual/philosophy.mdx#pipelines
https://github.com/huggingface/diffusers/blob/main/docs/source/en/conceptual/philosophy.mdx#tweakable-contributor-friendly-over-abstraction

Abhinay1997 · 2023-02-01T18:23:25Z

Thanks for the clarification @williamberman. Will update and make the PR for it soon.

Abhinay1997 · 2023-02-06T11:02:08Z

@williamberman @patrickvonplaten Please find the PR for UnCLIPTextInterpolation: #2257

Also, what about the interpolation attention_masks ? Any other thoughts on it ? Using max for now, as suggested.

P.S: Planning to complete UnCLIPImageInterpolation this week

williamberman · 2023-02-13T06:18:25Z

With the text interpolation pipeline merged, the image interpolation pipeline is still up for grabs!

Abhinay1997 · 2023-02-13T06:23:16Z

@williamberman I was planning to start on image interpolation too. Would that be okay ?

williamberman · 2023-02-13T06:33:28Z

Yes please! You're on a roll :)

Abhinay1997 · 2023-02-14T07:26:13Z

UnCLIP Text Interpolation Space: https://huggingface.co/spaces/NagaSaiAbhinay/unclip_text_interpolation_demo

Abhinay1997 · 2023-02-14T13:31:12Z

So, I was re-reading the paper of Dall-E 2 and found that their text interpolation is a little more complicated in that they interpolate on image embeddings using a normalised difference of the text embeddings of the two prompts. This produces much better results than my implementation. I'll update the text interpolation pipeline once the image interpolation is done.

Abhinay1997 · 2023-02-16T16:18:42Z

Image Interpolation is looking good. I'm getting results in line with Dall-e 2.

Notebook: https://colab.research.google.com/drive/1eN-oy3N6amFT48hhxvv02Ad5798FDvHd?usp=sharing

Results:-

Inputs:-

Will open a PR tomorrow.

patrickvonplaten · 2023-02-16T18:29:19Z

Wow that's super cool 🔥

patrickvonplaten · 2023-02-16T18:30:04Z

Very much looking forward to the PR! Let's maybe also try to make a cool spaces about this @apolinario @AK391 @osanseviero

AK391 · 2023-02-16T18:45:28Z

Hi @Abhinay1997 awesome work on the community pipeline, I opened a request for a community space in our discord for the community pipeline: https://discord.com/channels/879548962464493619/1075849794519572490/1075849794519572490, you can join here: https://discord.gg/pEYnj5ZW and check out the event by taking the role #collaborate and write under one of the paper posts under #making-demos forum

Abhinay1997 · 2023-02-17T16:36:09Z

Opened the PR for UnCLIPImageInterpolation: #2400

@williamberman @patrickvonplaten

Abhinay1997 · 2023-03-03T14:30:23Z

While #2400 is under review, I wanted to share the basic outline for the UnCLIP text diff flow:

Take the original image x0 and generate the inverted noise xT using DDIM Inversion and the image_embeddings z_img_0
Given a target prompt p_target and a caption for the original image p_start, compute text_embeddings z_txt_start and z_txt_target
Compute the text diff embeddings, z_txt_diff = norm_diff(z_txt_start, z_txt_target)
Compute the intermediate embedding z_inter = slerp(interp_value, z_img_0, z_txt_diff) where interp_value is linearly spaced in the interval [0.25,0.5] (from the Dall E 2 paper)
Use the intermediate embeddings to generate the text diff images.

Abhinay1997 · 2023-03-07T18:40:26Z

UnCLIP Image Interpolation demo space is up and running at https://huggingface.co/spaces/NagaSaiAbhinay/UnCLIP_Image_Interpolation_Demo

Do check it out !

osanseviero · 2023-03-07T19:21:54Z

Very cool Space 🔥

patrickvonplaten · 2023-03-08T19:29:50Z

Super cool space @Abhinay1997 - shared it on Reddit as well :-)

Abhinay1997 · 2023-03-09T02:41:32Z

Thanks @patrickvonplaten, @osanseviero !

sayakpaul · 2023-03-20T03:18:11Z

Can we close this issue now?

Abhinay1997 · 2023-03-20T03:21:15Z

There is unclip text diff that is part of the interpolations @sayakpaul. Let me work on the pr for this. We can close it this week 🙂

While #2400 is under review, I wanted to share the basic outline for the UnCLIP text diff flow:

Take the original image x0 and generate the inverted noise xT using DDIM Inversion and the image_embeddings z_img_0

Given a target prompt p_target and a caption for the original image p_start, compute text_embeddings z_txt_start and z_txt_target

Compute the text diff embeddings, z_txt_diff = norm_diff(z_txt_start, z_txt_target)

Compute the intermediate embedding z_inter = slerp(interp_value, z_img_0, z_txt_diff) where interp_value is linearly spaced in the interval [0.25,0.5] (from the Dall E 2 paper)

Use the intermediate embeddings to generate the text diff images.

sayakpaul · 2023-03-20T03:23:05Z

I didn't follow it, sorry. We can do it one by one if you want to since you're already tackling #2195. Totally fine with that.

Abhinay1997 · 2023-03-20T03:26:03Z

I meant there is one more interpolation that we can work on and the flow is all planned out. So there should be no surprises and we can close it this week.

As for #2195 , I fixed the issue. See my comments there. I need to update the branch to resolve conflicts with main

github-actions · 2023-04-13T15:03:49Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Abhinay1997 · 2023-04-13T15:07:05Z

Team, the pipeline is in the works. Will make the PR this week :)

github-actions · 2023-05-08T15:04:04Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

tikitong · 2023-05-16T14:39:46Z

There is unclip text diff that is part of the interpolations @sayakpaul. Let me work on the pr for this. We can close it this week 🙂

While #2400 is under review, I wanted to share the basic outline for the UnCLIP text diff flow:

Take the original image x0 and generate the inverted noise xT using DDIM Inversion and the image_embeddings z_img_0

Given a target prompt p_target and a caption for the original image p_start, compute text_embeddings z_txt_start and z_txt_target

Compute the text diff embeddings, z_txt_diff = norm_diff(z_txt_start, z_txt_target)

Compute the intermediate embedding z_inter = slerp(interp_value, z_img_0, z_txt_diff) where interp_value is linearly spaced in the interval [0.25,0.5] (from the Dall E 2 paper)

Use the intermediate embeddings to generate the text diff images.

Hi, thanks for this very interesting works ! @Abhinay1997 @patrickvonplaten UnCLIPImageInterpolation and UnCLIPTextInterpolationPipeline are they directly available from diffusers? Also has the text diff method been implemented?

Abhinay1997 · 2023-05-16T14:48:21Z

@tikitong Thanks for the interest in this !

You can use them like this:

pipe = DiffusionPipeline.from_pretrained("kakaobrain/karlo-v1-alpha-image-variations", torch_dtype=dtype, custom_pipeline='unclip_image_interpolation')

pipe = DiffusionPipeline.from_pretrained("kakaobrain/karlo-v1-alpha", torch_dtype=dtype, custom_pipeline='unclip_text_interpolation')

There are spaces for both UnCLIPTextInterpolation and UnCLIPImageInterpolation.

https://huggingface.co/spaces/NagaSaiAbhinay/UnCLIP_Image_Interpolation_Demo
https://huggingface.co/spaces/NagaSaiAbhinay/unclip_text_interpolation_demo

I'm working on something else so didn't have time for the text diff. But I'll update here once I'm done. Hopefully this weekend :)

tikitong · 2023-05-16T15:35:34Z

@Abhinay1997 thank you very much for your quick answer!

thanks also for the two commands with the custom_pipeline argument.

I'm looking foward for the text diff !

If I understand correctly, the DDIM inversion can be done just before this line right ?

diffusers/src/diffusers/schedulers/scheduling_unclip.py

Line 242 in 9d44e2f

beta_prod_t = 1 - alpha_prod_t

by doing alpha_prod_t, alpha_prod_t_prev = alpha_prod_t_prev, alpha_prod_t

In the two notebooks you put (UnCLIPTextInterpolation and UnCLIPImageInterpolation) it could be done just before the calculation of super_res_latents = self.super_res_scheduler.step(.. ?

or in fact by doing t, prev_timestep = prev_timestep, t before call self.super_res_scheduler.step(..

Abhinay1997 · 2023-05-16T16:17:39Z

@tikitong,

HuggingFace has a DDIM inversion scheduler so I was actually planning to use it similar to how SDPix2PixZero pipeline does DDIM Inversion to get the inverted latents ( original noise ). But I'll have to look into it in more detail.

You can take a look at a pipeline that uses DDIM Inversion here

Maybe I'll be able to better answer your question once I get down into the details.

@patrickvonplaten , what do you think ?

patrickvonplaten · 2023-05-17T09:04:59Z

Sorry what exactly is the question here 😅

I think the community pipelines work very well no?

tikitong · 2023-05-17T15:05:28Z

@Abhinay1997 I looked for the DDIM inversion scheduler, DDIMInverseScheduler. Yes it should work with it.
@patrickvonplaten Yes totally, I was just wondering how to get the inverted latent xT from z_img_0.

Abhinay1997 · 2023-05-19T15:21:46Z

@tikitong there is an example of DDIM inversion using a DDIM Scheduler here

tikitong · 2023-05-21T13:33:30Z

@Abhinay1997 thanks to your link and to this one I was able to get closer to the solution. I would be very interested to have your opinion on my results !

The results below are for the following features:
an adult lion -> a lion cub

I set the interp_value as follows:
interp_value= torch.linspace(0.25, 0.65, steps=10)

I calculated z_txt_diff = norm_diff(z_txt_start, z_txt_target)
as follows:
z_txt_diff = ( z_txt_target - z_txt_start) / torch.norm( z_txt_target - z_txt_start, dim=1, keepdim=True)

Starting from left to right, the first image is directly z_img_0, the second is the reconstruction of z_img_0 with xT fixed, called noise.

As you can see, it seems to work, but the adult lion never turns into a real lion cub, even at 0.65. In the dalle2 paper is for 0.5.

I have several questions about the results :
First, do you confirm that for the inversion of the ddim to have xT, scheduler.timesteps should be inverted?

Secondly, is it normal to have the generated image of z_img_0 (with random noise) different from the second one (with fixed xT). I expected to have exactly the same image.

my parameters are:

num_steps = 10
batch_size = 1
guidance_scale = 7
prior_cf_scale = 4
prior_steps = "25

For the generation of each interp_value, as for the generation of z_img_0 and noise, I left the timesteps at 1000. I get [901 801 701 601 501 401 301 201 101 1]

For the xT generation I set the timesteps to 50 and I invert it. I get [1]. Above I noticed that the results vary much more.

Abhinay1997 · 2023-05-21T14:59:19Z

@tikitong

About the timesteps, if you're using the DDIMInverseScheduler, they would go from 0 to 1000, I'd assume. See here there is no reversal of the timesteps.

Can you confirm you're using spherical interpolation to calculate z_inter i.e using the formula:
z_inter = slerp(interp_value, z_img_0, z_txt_diff). You can find a working slerp function here

As for reproducing the original image from z_img_0, noise latents generated from DDIMInversion cannot exactly reproduce the original image. They are very close but not an exact match. We can discuss this further if you're interested.

The paper mentionsinterp_value= torch.linspace(0.25, 0.5, steps=10) but to be honest, I wonder if it'll actually work because we stop the interpolation basically midway between the two images. Can you try interpolating till 1 ?

If there's a colab notebook, I'd be happy to have a look. :)

tikitong · 2023-05-23T14:57:14Z

@Abhinay1997 thanks you for these details and sorry for the delay, I have to finish another work.
Yes I use spherical interpolation, the formula you gave.
When I interpolate till 1 the image is deformed and does not represent a lion anymore.
Of course, I will make one, the time to adapt and format it, I try to do it asap.

Abhinay1997 · 2023-06-15T13:40:13Z

@tikitong sorry for the late reply. I tried it out and am getting similar results. My implementation fails to reconstruct the original image too i.e no difference between using random noise and original noise. I'm trying to figure out where the issue lies. Will update here if I make any progress :)

Abhinay1997 · 2023-06-21T16:51:37Z

I made some progress in the DDIM Inversion. I still can't get exact reconstruction of original image, but I'm getting there.

Left: Original image.
Right: Reconstructed using DDIM Inversion on UnCLIP Decoder.

github-actions · 2023-07-16T15:04:08Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patrickvonplaten assigned williamberman and patrickvonplaten Dec 30, 2022

keturn mentioned this issue Jan 15, 2023

Weighted Prompts for Diffusers stable diffusion pipeline #1506

Closed

Abhinay1997 mentioned this issue Feb 7, 2023

[Community Pipeline] UnCLIP Text Interpolation Pipeline #2257

Merged

github-actions bot added the stale Issues that haven't received updates label Apr 13, 2023

github-actions bot closed this as completed Jul 24, 2023

[Community Pipeline] UnCLIP image / text interpolations #1869

[Community Pipeline] UnCLIP image / text interpolations #1869

Comments

patrickvonplaten commented Dec 30, 2022

Model/Pipeline/Scheduler description

Open source status

Provide useful links for the implementation

Abhinay1997 commented Jan 1, 2023 • edited Loading

patrickvonplaten commented Jan 1, 2023 • edited Loading

Abhinay1997 commented Jan 5, 2023

patrickvonplaten commented Jan 10, 2023

Abhinay1997 commented Jan 20, 2023

patrickvonplaten commented Jan 23, 2023

williamberman commented Jan 23, 2023

Abhinay1997 commented Jan 25, 2023

Abhinay1997 commented Jan 31, 2023 • edited Loading

williamberman commented Feb 1, 2023 • edited Loading

Abhinay1997 commented Feb 1, 2023

williamberman commented Feb 1, 2023

Abhinay1997 commented Feb 1, 2023

Abhinay1997 commented Feb 6, 2023

williamberman commented Feb 13, 2023

Abhinay1997 commented Feb 13, 2023 • edited Loading

williamberman commented Feb 13, 2023

Abhinay1997 commented Feb 14, 2023 • edited Loading

Abhinay1997 commented Feb 14, 2023

Abhinay1997 commented Feb 16, 2023

patrickvonplaten commented Feb 16, 2023

patrickvonplaten commented Feb 16, 2023

AK391 commented Feb 16, 2023 • edited Loading

Abhinay1997 commented Feb 17, 2023

Abhinay1997 commented Mar 3, 2023 • edited Loading

Abhinay1997 commented Mar 7, 2023

osanseviero commented Mar 7, 2023

patrickvonplaten commented Mar 8, 2023

Abhinay1997 commented Mar 9, 2023

sayakpaul commented Mar 20, 2023

Abhinay1997 commented Mar 20, 2023

sayakpaul commented Mar 20, 2023

Abhinay1997 commented Mar 20, 2023

github-actions bot commented Apr 13, 2023

Abhinay1997 commented Apr 13, 2023

github-actions bot commented May 8, 2023

tikitong commented May 16, 2023

Abhinay1997 commented May 16, 2023

tikitong commented May 16, 2023

Abhinay1997 commented May 16, 2023

patrickvonplaten commented May 17, 2023

tikitong commented May 17, 2023

Abhinay1997 commented May 19, 2023

tikitong commented May 21, 2023 • edited Loading

Abhinay1997 commented May 21, 2023 • edited Loading

tikitong commented May 23, 2023 • edited Loading

Abhinay1997 commented Jun 15, 2023

Abhinay1997 commented Jun 21, 2023

github-actions bot commented Jul 16, 2023

Abhinay1997 commented Jan 1, 2023 •

edited

Loading

patrickvonplaten commented Jan 1, 2023 •

edited

Loading

Abhinay1997 commented Jan 31, 2023 •

edited

Loading

williamberman commented Feb 1, 2023 •

edited

Loading

Abhinay1997 commented Feb 13, 2023 •

edited

Loading

Abhinay1997 commented Feb 14, 2023 •

edited

Loading

AK391 commented Feb 16, 2023 •

edited

Loading

Abhinay1997 commented Mar 3, 2023 •

edited

Loading

tikitong commented May 21, 2023 •

edited

Loading

Abhinay1997 commented May 21, 2023 •

edited

Loading

tikitong commented May 23, 2023 •

edited

Loading