-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Community Pipeline] UnCLIP image / text interpolations #1869
Comments
Hi @patrickvonplaten and @williamberman, are you working on this or can I pick this up ? |
@Abhinay1997 feel free to pick it up! We re more than happy to help if needed |
This makes sense at the top level. But just so that my understanding is correct, we want a community pipeline that can interpolate between prompts/images like the StableDiffusionInterpolation but using the unCLIPPipeline ( a.k.a Dall-E 2) The interpolation also makes sense, we generate the embeddings for the 2 prompts, p1 and p2 let's say. They would correspond to x_0 and x_N of the interpolation sequence and using slerp I would interpolate between them for N outputs in total for a pair of prompts/images. |
That's exactly right @Abhinay1997 :-) |
@patrickvonplaten sorry it took so long. The UnCLIPTextInterpolation pipeline is actually more straight forward imo. For the UnCLIPImageInterpolation pipeline, would we not need the CLIP model that was used for training the UnCLIPPipeline ? Because I see the CLIP model in the original codebase but not in the HuggingFace hub model I am working on the text interpolation right now. ETA: 24th Jan. |
cc @williamberman here |
Thanks for your work @Abhinay1997! We do still use clip for the unclip pipeline(s). Note that we just import it from transformers. In the text to image pipeline, we use the text encoder for encoding the prompt diffusers/src/diffusers/pipelines/unclip/pipeline_unclip.py Lines 70 to 71 in ac3fc64
In the image variation pipeline, we use the image encoder for encoding the input image and the text encoder for encoding an empty prompt. Note that we also optionally allow directly passing image embedding to the pipeline to skip encoding an input image. diffusers/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py Lines 75 to 78 in ac3fc64
The image interpolation pipeline should work similarly to the image variation pipeline in that it should be passed either the two sets of images which would be encoded via clip or the two sets of pre-encoded latents. Note that similarly to the image variation pipeline, we would have to encode the empty text prompt for the image interpolation pipeline. LMK if that makes sense! |
Hi @williamberman thank you for the details. It makes sense. I was under the impression that I needed to use the actual CLIP checkpoint that the UnCLIP model learns to invert its decoder over. So I got confused. Will share the work in progress notebooks soon. |
Hi @williamberman can you review the UnCLIPTextInterpolation notebook when you have time ? Questions:-
|
Great work so far @Abhinay1997 !
That's a good point, and I don't know off the top of my head or from googling around. I would recommend for now just using the mask of the longer prompt. cc @patrickvonplaten
We actually want all pipelines to be completely independent so please do not inherit from the UnCLIPPipeline :) |
@williamberman, just to clarify, can I still import UnCLIPPipeline inside the methods and use it for generation ? |
Nope! We want all pipelines to be as self contained as possible. If any methods are exactly the same, we have the We've articulated some of our rationale in the philosophy doc https://github.com/huggingface/diffusers/blob/main/docs/source/en/conceptual/philosophy.mdx#pipelines |
Thanks for the clarification @williamberman. Will update and make the PR for it soon. |
@williamberman @patrickvonplaten Please find the PR for UnCLIPTextInterpolation: #2257 Also, what about the interpolation attention_masks ? Any other thoughts on it ? Using max for now, as suggested. P.S: Planning to complete UnCLIPImageInterpolation this week |
With the text interpolation pipeline merged, the image interpolation pipeline is still up for grabs! |
@williamberman I was planning to start on image interpolation too. Would that be okay ? |
Yes please! You're on a roll :) |
UnCLIP Text Interpolation Space: https://huggingface.co/spaces/NagaSaiAbhinay/unclip_text_interpolation_demo |
So, I was re-reading the paper of Dall-E 2 and found that their text interpolation is a little more complicated in that they interpolate on image embeddings using a normalised difference of the text embeddings of the two prompts. This produces much better results than my implementation. I'll update the text interpolation pipeline once the image interpolation is done. |
Image Interpolation is looking good. I'm getting results in line with Dall-e 2. Notebook: https://colab.research.google.com/drive/1eN-oy3N6amFT48hhxvv02Ad5798FDvHd?usp=sharing Will open a PR tomorrow. |
Wow that's super cool 🔥 |
Very much looking forward to the PR! Let's maybe also try to make a cool spaces about this @apolinario @AK391 @osanseviero |
Hi @Abhinay1997 awesome work on the community pipeline, I opened a request for a community space in our discord for the community pipeline: https://discord.com/channels/879548962464493619/1075849794519572490/1075849794519572490, you can join here: https://discord.gg/pEYnj5ZW and check out the event by taking the role #collaborate and write under one of the paper posts under #making-demos forum |
Opened the PR for UnCLIPImageInterpolation: #2400 |
While #2400 is under review, I wanted to share the basic outline for the UnCLIP text diff flow:
|
UnCLIP Image Interpolation demo space is up and running at https://huggingface.co/spaces/NagaSaiAbhinay/UnCLIP_Image_Interpolation_Demo Do check it out ! |
Very cool Space 🔥 |
Super cool space @Abhinay1997 - shared it on Reddit as well :-) |
Thanks @patrickvonplaten, @osanseviero ! |
Can we close this issue now? |
There is unclip text diff that is part of the interpolations @sayakpaul. Let me work on the pr for this. We can close it this week 🙂
|
I didn't follow it, sorry. We can do it one by one if you want to since you're already tackling #2195. Totally fine with that. |
I meant there is one more interpolation that we can work on and the flow is all planned out. So there should be no surprises and we can close it this week. As for #2195 , I fixed the issue. See my comments there. I need to update the branch to resolve conflicts with main |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Team, the pipeline is in the works. Will make the PR this week :) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi, thanks for this very interesting works ! @Abhinay1997 @patrickvonplaten UnCLIPImageInterpolation and UnCLIPTextInterpolationPipeline are they directly available from diffusers? Also has the text diff method been implemented? |
@tikitong Thanks for the interest in this ! You can use them like this:
There are spaces for both UnCLIPTextInterpolation and UnCLIPImageInterpolation. https://huggingface.co/spaces/NagaSaiAbhinay/UnCLIP_Image_Interpolation_Demo I'm working on something else so didn't have time for the text diff. But I'll update here once I'm done. Hopefully this weekend :) |
@Abhinay1997 thank you very much for your quick answer! thanks also for the two commands with the custom_pipeline argument. I'm looking foward for the text diff ! If I understand correctly, the DDIM inversion can be done just before this line right ?
by doing alpha_prod_t, alpha_prod_t_prev = alpha_prod_t_prev, alpha_prod_t
In the two notebooks you put (UnCLIPTextInterpolation and UnCLIPImageInterpolation) it could be done just before the calculation of or in fact by doing |
HuggingFace has a DDIM inversion scheduler so I was actually planning to use it similar to how SDPix2PixZero pipeline does DDIM Inversion to get the inverted latents ( original noise ). But I'll have to look into it in more detail. You can take a look at a pipeline that uses DDIM Inversion here Maybe I'll be able to better answer your question once I get down into the details. @patrickvonplaten , what do you think ? |
Sorry what exactly is the question here 😅 I think the community pipelines work very well no? |
@Abhinay1997 I looked for the DDIM inversion scheduler, DDIMInverseScheduler. Yes it should work with it. |
@Abhinay1997 thanks to your link and to this one I was able to get closer to the solution. I would be very interested to have your opinion on my results ! The results below are for the following features: I set the interp_value as follows: I calculated z_txt_diff = norm_diff(z_txt_start, z_txt_target) Starting from left to right, the first image is directly z_img_0, the second is the reconstruction of z_img_0 with xT fixed, called noise. As you can see, it seems to work, but the adult lion never turns into a real lion cub, even at 0.65. In the dalle2 paper is for 0.5. I have several questions about the results : Secondly, is it normal to have the generated image of z_img_0 (with random noise) different from the second one (with fixed xT). I expected to have exactly the same image. my parameters are: num_steps = 10 For the generation of each interp_value, as for the generation of z_img_0 and noise, I left the timesteps at 1000. I get [901 801 701 601 501 401 301 201 101 1] For the xT generation I set the timesteps to 50 and I invert it. I get [1]. Above I noticed that the results vary much more. |
About the timesteps, if you're using the DDIMInverseScheduler, they would go from 0 to 1000, I'd assume. See here there is no reversal of the timesteps. Can you confirm you're using spherical interpolation to calculate As for reproducing the original image from The paper mentions If there's a colab notebook, I'd be happy to have a look. :) |
@Abhinay1997 thanks you for these details and sorry for the delay, I have to finish another work. |
@tikitong sorry for the late reply. I tried it out and am getting similar results. My implementation fails to reconstruct the original image too i.e no difference between using random noise and original noise. I'm trying to figure out where the issue lies. Will update here if I make any progress :) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Model/Pipeline/Scheduler description
Copied from #1858:
I think we could create a super cool community pipeline. The pipeline could allow to automatically create interpolations between two text prompts and similarly we could create one to do interpolations between two images.
In terms of design to stay as efficient as possible the following would make sense:
num_interpolations
input.num_interpolations
x_1, x_2, ... x_N-1 are created using theslerp
function .num_interpolations
+ 2 text embeddings that should be passed in a batch through the model to create a nice interpolation of images.enable_cpu_offload()
to save memory.It's probably easier to start with the
UnCLIPImageInterpolationPipeline
since image embeddings are just a single 1-d vector where as for text embeddings two latent vectors are used.Would be more than happy to help if someone is interested in giving this a try - think it'll make for some super cool demos.
Open source status
Provide useful links for the implementation
No response
The text was updated successfully, but these errors were encountered: