Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CogVideoX-5b-I2V support #9418

Merged
merged 34 commits into from
Sep 16, 2024
Merged

Conversation

zRzRzRzRzRzRzR
Copy link
Contributor

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR commented Sep 11, 2024

The purpose of this PR is to adapt our upcoming CogVideoX-5B-I2V model to the diffusers framework:

  1. The model takes an image and text as input and outputs a video.
  2. The in-channel of the model has been modified to 32, while the rest of the model structure is similar to the 5B T2V.
  3. A new pipeline, CogVideoXImage2Video, has been created, and the documentation has been updated accordingly.

@a-r-r-o-w @zRzRzRzRzRzRzR

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments but all of them very minor in nature. Basically, this PR looks solid to me and it shouldn't take much time to merge.

Off to @yiyixuxu.

Comment on lines 468 to 470
# Note: we use `-1` instead of `channels`:
# - It is okay to use for CogVideoX-2b and CogVideoX-5b (number of input channels is equal to output channels)
# - However, for CogVideoX-5b-I2V, input image (number of input channels is twice the output channels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is sufficiently supplemented with a comment, it should be fine!

image_rotary_emb=image_rotary_emb,
return_dict=False,
)[0]
noise_pred = noise_pred.float()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems interesting. Why do we have to manually perform the upcasting here?

Copy link
Member

@a-r-r-o-w a-r-r-o-w Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @yiyixuxu would better be able to answer this since it was copied over from other Cog pipelines. IIRC, the original codebase had an upcast here too which is why we kept it too

docs/source/en/api/pipelines/cogvideox.md Show resolved Hide resolved
tests/pipelines/cogvideo/test_cogvideox_image2video.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! left some minor comments, feel free to merge once addressed!

latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

latent_image_input = torch.cat([image_latents] * 2) if do_classifier_free_guidance else image_latents
latent_model_input = torch.cat([latent_model_input, latent_image_input], dim=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, they don't add noise to the image

@@ -78,6 +84,7 @@ def replace_up_keys_inplace(key: str, state_dict: Dict[str, Any]):
"mixins.final_layer.norm_final": "norm_out.norm",
"mixins.final_layer.linear": "proj_out",
"mixins.final_layer.adaLN_modulation.1": "norm_out.linear",
"mixins.pos_embed.pos_embedding": "patch_embed.pos_embedding", # Specific to CogVideoX-5b-I2V
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have any if/else to guard that accordingly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This layer is absent in the T2V models actually. It's called positional_embedding in T2V which is just sincos PE, while pos_embedding here. I think it's safe but going to verify it now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is safe and should not affect the T2V checkpoints since they follow different layer naming conventions

if self.use_positional_embeddings or self.use_learned_positional_embeddings:
if self.use_learned_positional_embeddings and (self.sample_width != width or self.sample_height != height):
raise ValueError(
"It is currently not possible to generate videos at a different resolution that the defaults. This should only be the case with 'THUDM/CogVideoX-5b-I2V'."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, the 2b variant supports it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we had some success with multiresolution inference quality on 2B T2V. The reason for allowing this is to not confine lora training to 720x480 videos on 2B model. 5B T2V will skip this entire branch. 5B I2V use positional embeddings that were learned, so we can't generate them on-the-fly like sincos for the 2B T2V model

Comment on lines +779 to +781
self._guidance_scale = 1 + guidance_scale * (
(1 - math.cos(math.pi * ((num_inference_steps - t.item()) / num_inference_steps) ** 5.0)) / 2
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(can revisit later)

This can introduce graph-breaks because we are combining non-torch operations with torch tensors. .item() is a data-dependent call and can also lead to performance issues.

Just noting so that we can revisit if needs be.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. My comments are minor, not blockers at all.

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR changed the title Cogvideox 5b i2v draft CogVideoX-5b-I2V support Sep 14, 2024
@a-r-r-o-w
Copy link
Member

Will be merging after CI turns green. Will take up any changes in follow-up PRs

@a-r-r-o-w a-r-r-o-w merged commit 8336405 into huggingface:main Sep 16, 2024
14 of 15 checks passed
@tin2tin
Copy link

tin2tin commented Sep 16, 2024

OSError: THUDM/CogVideoX-5b-I2V is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

@a-r-r-o-w
Copy link
Member

The planned date for the model release in some time in the next few days when the CogVideoX team is ready. Until then, we will be preparing for a Diffusers patch release to ship the pipeline

@zRzRzRzRzRzRzR
Copy link
Contributor Author

Thank you for your support! We expect to open source the project next week. If the release patch can be published before then, it would be a great help to us.

The planned date for the model release in some time in the next few days when the CogVideoX team is ready. Until then, we will be preparing for a Diffusers patch release to ship the pipeline

a-r-r-o-w added a commit that referenced this pull request Sep 17, 2024
* draft Init

* draft

* vae encode image

* make style

* image latents preparation

* remove image encoder from conversion script

* fix minor bugs

* make pipeline work

* make style

* remove debug prints

* fix imports

* update example

* make fix-copies

* add fast tests

* fix import

* update vae

* update docs

* update image link

* apply suggestions from review

* apply suggestions from review

* add slow test

* make use of learned positional embeddings

* apply suggestions from review

* doc change

* Update convert_cogvideox_to_diffusers.py

* make style

* final changes

* make style

* fix tests

---------

Co-authored-by: Aryan <aryan@huggingface.co>
leisuzz pushed a commit to leisuzz/diffusers that referenced this pull request Oct 11, 2024
* draft Init

* draft

* vae encode image

* make style

* image latents preparation

* remove image encoder from conversion script

* fix minor bugs

* make pipeline work

* make style

* remove debug prints

* fix imports

* update example

* make fix-copies

* add fast tests

* fix import

* update vae

* update docs

* update image link

* apply suggestions from review

* apply suggestions from review

* add slow test

* make use of learned positional embeddings

* apply suggestions from review

* doc change

* Update convert_cogvideox_to_diffusers.py

* make style

* final changes

* make style

* fix tests

---------

Co-authored-by: Aryan <aryan@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants