Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VQ-diffusion #658

Merged

Conversation

williamberman
Copy link
Contributor

@williamberman williamberman commented Sep 27, 2022

VQ-diffusion

original issue: #319

So far, this PR only focusses on the vq-diffusion model for the ITHQ dataset. It is possible that the models for the other datasets require additional changes.

The commits are broken out by the model component they port. See the individual commit messages for more in depth write ups on changes

Example usage

See linked notebook for how to produce pretrained model for diffusers.

from diffusers.pipelines import VQDiffusionPipeline
import torch
from PIL import Image

OUTPUT_FILE = "<File to save image grid>"

torch.manual_seed(0)

def image_grid(imgs, rows, cols):
    assert len(imgs) == rows*cols
    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

device = 'cuda'

pipeline = VQDiffusionPipeline.from_pretrained("microsoft/vq-diffusion-ithq").to(device)

images = pipeline("teddy bear playing in the pool", truncation_rate=0.86, num_images_per_prompt=4)

image_grid(images.images, 2, 2).save(OUTPUT_FILE)

out

Progress
  • Port VQVAE
  • Port transformer
  • VQDiffusionTransformer documentation
  • DalleMaskImageEmbedding documentation and clean up
  • Port text embeddings (CLIP)
  • Port scheduler
  • Scheduler documentation
  • Use torch.Generator
  • VQDiffusion Pipeline docs
  • General clean up
  • Updating notebook
  • Tests?
Comparing against original model

I put together a python notebook that compares the currently ported components against eachother. It can be run on google colab and will handle installing dependencies and model weights. See its README for more details.

https://github.com/williamberman/vq-diffusion-notebook

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 27, 2022

The documentation is not available anymore as the PR was closed or merged.

@williamberman williamberman mentioned this pull request Sep 27, 2022
2 tasks
@patrickvonplaten
Copy link
Contributor

Wooow - this looks amazing! @patil-suraj mind taking a look here?

@williamberman
Copy link
Contributor Author

Wooow - this looks amazing! @patil-suraj mind taking a look here?

Thanks! Would love to know if the minimum unit to merge is the completed pipeline or if smaller chunks like this are acceptable to merge.

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Oct 7, 2022

Hey @williamberman and @345ishaan,

Sorry for the delay - I have more time going forward and think we can merge this model by next week!

It's a super nice PR. Extremely easy to understand and to follow - thanks a bunch!

I've fiddled a bit into the PR to make it a bit more light-weight :-) We don't really need a new attention layer as Conv2D layers when used for attention are just like linear layers for which an attention class already exists.

That's why I removed the Conv2DAttention class and changed the conversion script slightly so that your script still works as expected. I'm getting visually identical reconstruction images, so I think we're good with the linear attention layer that was already implemented (could you maybe double check?).

Now I think in a next step we can implement the U-Net and scheduler for the forward pass, no? Do you need help/guidance here?

@williamberman
Copy link
Contributor Author

williamberman commented Oct 7, 2022

Hey @williamberman and @345ishaan,

Sorry for the delay - I have more time going forward and think we can merge this model by next week!

It's a super nice PR. Extremely easy to understand and to follow - thanks a bunch!

I've fiddled a bit into the PR to make it a bit more light-weight :-) We don't really need a new attention layer as Conv2D layers when used for attention are just like linear layers for which an attention class already exists.

That's why I removed the Conv2DAttention class and changed the conversion script slightly so that your script still works as expected. I'm getting visually identical reconstruction images, so I think we're good with the linear attention layer that was already implemented (could you maybe double check?).

Now I think in a next step we can implement the U-Net and scheduler for the forward pass, no? Do you need help/guidance here?

Thanks! Pinged in discord as well but the model has a transformer (just the encoder iirc) for the reverse diffusion process instead of a unet. I have the transformer ported on another branch. I think the open question is would you prefer that on this PR or to merge this PR first and then merge the transformer on a separate PR?

@patrickvonplaten
Copy link
Contributor

Hey @williamberman and @345ishaan,
Sorry for the delay - I have more time going forward and think we can merge this model by next week!
It's a super nice PR. Extremely easy to understand and to follow - thanks a bunch!
I've fiddled a bit into the PR to make it a bit more light-weight :-) We don't really need a new attention layer as Conv2D layers when used for attention are just like linear layers for which an attention class already exists.
That's why I removed the Conv2DAttention class and changed the conversion script slightly so that your script still works as expected. I'm getting visually identical reconstruction images, so I think we're good with the linear attention layer that was already implemented (could you maybe double check?).
Now I think in a next step we can implement the U-Net and scheduler for the forward pass, no? Do you need help/guidance here?

Thanks! Pinged in discord as well but the model has a transformer (just the encoder iirc) for the reverse diffusion process instead of a unet. I have the transformer ported on another branch. I think the open question is would you prefer that on this PR or to merge this PR first and then merge the transformer on a separate PR?

Hey @williamberman,

Great to hear that you already have it on a branch. Could you maybe add it directly to this PR? Maybe in a next step we could verify that a forward pass through the transformer (replacement of the unet) gives identical results to the official implementation. If that works, we can integrate the scheduler and then in a last step make a whole denoising process work :-)

Overall, everything should ideally be in this PR. Since VQ-diffusion will be one of our first community pipeline contributions, there are lots of new things in this PR and I'm more than happy to help you with it (don't hesitate to ping me :-))

@williamberman
Copy link
Contributor Author

williamberman commented Oct 10, 2022

Great to hear that you already have it on a branch. Could you maybe add it directly to this PR? Maybe in a next step we could verify that a forward pass through the transformer (replacement of the unet) gives identical results to the official implementation. If that works, we can integrate the scheduler and then in a last step make a whole denoising process work :-)

Overall, everything should ideally be in this PR. Since VQ-diffusion will be one of our first community pipeline contributions, there are lots of new things in this PR and I'm more than happy to help you with it (don't hesitate to ping me :-))

SG!

Follow up:

  1. Merge transformer into this branch.
  2. Add script I've been using to test transformer to pr description.
  3. Merge CLIP in pipeline/convert script for text embeddings into this branch
  4. Add script for testing CLIP to pr description
  5. Add initial skeleton for scheduler/pipeline to this branch (I also have this on branch with transformer)

@williamberman williamberman deleted the vq-diffusion-ithq-vqvae branch October 10, 2022 23:08
@williamberman williamberman restored the vq-diffusion-ithq-vqvae branch October 10, 2022 23:09
@williamberman williamberman reopened this Oct 10, 2022
@williamberman williamberman changed the title Port VQ-diffusion VQVAE for the ITHQ dataset to diffusers [WIP] VQ-diffusion Oct 10, 2022
@williamberman williamberman force-pushed the vq-diffusion-ithq-vqvae branch 3 times, most recently from c564b04 to 496524c Compare October 11, 2022 06:54
@patrickvonplaten
Copy link
Contributor

Great to hear that you already have it on a branch. Could you maybe add it directly to this PR? Maybe in a next step we could verify that a forward pass through the transformer (replacement of the unet) gives identical results to the official implementation. If that works, we can integrate the scheduler and then in a last step make a whole denoising process work :-)
Overall, everything should ideally be in this PR. Since VQ-diffusion will be one of our first community pipeline contributions, there are lots of new things in this PR and I'm more than happy to help you with it (don't hesitate to ping me :-))

SG!

Follow up:

  1. Merge transformer into this branch.
  2. Add script I've been using to test transformer to pr description.
  3. Merge CLIP in pipeline/convert script for text embeddings into this branch
  4. Add script for testing CLIP to pr description
  5. Add initial skeleton for scheduler/pipeline to this branch (I also have this on branch with transformer)

This sounds like a great plan!

@williamberman
Copy link
Contributor Author

PR description is updated to reflect progress on merging full model. Notebook which compares outputs from autoencoder, transformer, and text embedder is here and linked in PR description https://github.com/williamberman/vq-diffusion-notebook. Once the scheduler is complete, will also add it to the notebook :)

@patrickvonplaten
Copy link
Contributor

@williamberman let me know if you'd like me to review already now or better once the scheduler is integrated as well :-)

@patrickvonplaten
Copy link
Contributor

Great progress!

@williamberman
Copy link
Contributor Author

@williamberman let me know if you'd like me to review already now or better once the scheduler is integrated as well :-)

Let’s wait until the scheduler is integrated. I cleaned some non scheduler components up while working on it that I’d like to add to this branch first :)

@williamberman williamberman force-pushed the vq-diffusion-ithq-vqvae branch 4 times, most recently from 9d33a33 to 6b2bc69 Compare October 20, 2022 10:15
@williamberman williamberman requested review from patrickvonplaten and removed request for patil-suraj November 2, 2022 07:58
torch.cuda.empty_cache()

def test_vq_diffusion(self):
expected_image = load_image(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test here @williamberman - pipeline works like a charm and fits nicely with the existing API :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, exciting!

@patrickvonplaten
Copy link
Contributor

Hey @williamberman,

You've really done an amazing job here! Very impressed by how you were able to add such a new complex model into the existing API!

The conversion script and your notebook is very easy to follow :-)

I've uploaded the ithq model now to the microsoft org here: https://huggingface.co/microsoft/vq-diffusion-ithq and added a slow test that makes sure the model works as expected.
Besides that I've done some minor naming changes.

The failing tests are unrelated and we can merge this PR more or less already as is. If ok for you, I would do some final renaming changes tomorrow morning (Paris time) to make it fit a bit better with existing configuration names (will have to sync with @patil-suraj @anton-l and @pcuenca ) and then we can merge this to be in the next release IMO.

@patil-suraj @pcuenca @anton-l could you maybe already review this PR? IMO, besides some re-naming it's ready! I've also made sure that all existing slow tests are passing!

@williamberman if you're interesting we could do the following follow-up projects to promote this model a bit more:

Obviously no need to do any of the above points if you don't want or are too busy! Regarding this PR, I think we can merge it tomorrow morning!

Really great job here 🚀

@williamberman
Copy link
Contributor Author

Hey @williamberman,

You've really done an amazing job here! Very impressed by how you were able to add such a new complex model into the existing API!

The conversion script and your notebook is very easy to follow :-)

I've uploaded the ithq model now to the microsoft org here: https://huggingface.co/microsoft/vq-diffusion-ithq and added a slow test that makes sure the model works as expected. Besides that I've done some minor naming changes.

The failing tests are unrelated and we can merge this PR more or less already as is. If ok for you, I would do some final renaming changes tomorrow morning (Paris time) to make it fit a bit better with existing configuration names (will have to sync with @patil-suraj @anton-l and @pcuenca ) and then we can merge this to be in the next release IMO.

@patil-suraj @pcuenca @anton-l could you maybe already review this PR? IMO, besides some re-naming it's ready! I've also made sure that all existing slow tests are passing!

@williamberman if you're interesting we could do the following follow-up projects to promote this model a bit more:

Obviously no need to do any of the above points if you don't want or are too busy! Regarding this PR, I think we can merge it tomorrow morning!

Really great job here 🚀

Awesome all sound good!

I think a blog post sounds great, sent you an email :)

@345ishaan
Copy link

@williamberman I am wondering did you try training it or just verified in infer mode?

@williamberman
Copy link
Contributor Author

@williamberman I am wondering did you try training it or just verified in infer mode?

Just inference using the weights microsoft published. Training would have been a good amount more work 😅

Copy link
Member

@anton-l anton-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome integration @williamberman, thank you for working on it!

patrickvonplaten and others added 2 commits November 3, 2022 12:53
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work!

@patrickvonplaten patrickvonplaten merged commit ef2ea33 into huggingface:main Nov 3, 2022
PhaneeshB pushed a commit to nod-ai/diffusers that referenced this pull request Mar 1, 2023
Just happened to have this card on my Windows machine and verified that the SD demo works on it.

```
Average step time: 144.26142692565918ms/it
Clip Inference Avg time (ms) = (205.001 + 44.000) / 2 = 124.501
VAE Inference time (ms): 281.001

Total image generation time: 7.856997728347778sec
```

I'd love to add an API upstream to derive compiler tuning flags from a host device.
yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
* Changes for VQ-diffusion VQVAE

Add specify dimension of embeddings to VQModel:
`VQModel` will by default set the dimension of embeddings to the number
of latent channels. The VQ-diffusion VQVAE has a smaller
embedding dimension, 128, than number of latent channels, 256.

Add AttnDownEncoderBlock2D and AttnUpDecoderBlock2D to the up and down
unet block helpers. VQ-diffusion's VQVAE uses those two block types.

* Changes for VQ-diffusion transformer

Modify attention.py so SpatialTransformer can be used for
VQ-diffusion's transformer.

SpatialTransformer:
- Can now operate over discrete inputs (classes of vector embeddings) as well as continuous.
- `in_channels` was made optional in the constructor so two locations where it was passed as a positional arg were moved to kwargs
- modified forward pass to take optional timestep embeddings

ImagePositionalEmbeddings:
- added to provide positional embeddings to discrete inputs for latent pixels

BasicTransformerBlock:
- norm layers were made configurable so that the VQ-diffusion could use AdaLayerNorm with timestep embeddings
- modified forward pass to take optional timestep embeddings

CrossAttention:
- now may optionally take a bias parameter for its query, key, and value linear layers

FeedForward:
- Internal layers are now configurable

ApproximateGELU:
- Activation function in VQ-diffusion's feedforward layer

AdaLayerNorm:
- Norm layer modified to incorporate timestep embeddings

* Add VQ-diffusion scheduler

* Add VQ-diffusion pipeline

* Add VQ-diffusion convert script to diffusers

* Add VQ-diffusion dummy objects

* Add VQ-diffusion markdown docs

* Add VQ-diffusion tests

* some renaming

* some fixes

* more renaming

* correct

* fix typo

* correct weights

* finalize

* fix tests

* Apply suggestions from code review

Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* finish

* finish

* up

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants