-
Notifications
You must be signed in to change notification settings - Fork 6k
Add SkyReels V2: Infinite-Length Film Generative Model #11518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
It's about time. Thanks. |
Replaces custom attention implementations with `SkyReelsV2AttnProcessor2_0` and the standard `Attention` module. Updates `WanAttentionBlock` to use `FP32LayerNorm` and `FeedForward`. Removes the `model_type` parameter, simplifying model architecture and attention block initialization.
Introduces new classes `SkyReelsV2ImageEmbedding` and `SkyReelsV2TimeTextImageEmbedding` for enhanced image and time-text processing. Refactors the `SkyReelsV2Transformer3DModel` to integrate these embeddings, updating the constructor parameters for better clarity and functionality. Removes unused classes and methods to streamline the codebase.
…ds and begin reorganizing the forward pass.
…hod, integrating rotary embeddings and improving attention handling. Removes the deprecated `rope_apply` function and streamlines the attention mechanism for better integration and clarity.
…ethod by updating parameter names for clarity, integrating attention masks, and improving the handling of encoder hidden states.
…ethod by enhancing the handling of time embeddings and encoder hidden states. Updates parameter names for clarity and integrates rotary embeddings, ensuring better compatibility with the model's architecture.
…ing components and streamline the text-to-video generation process. Updates class documentation and adjusts parameter handling for improved clarity and functionality.
…parameter handling and improving integration.
…to streamline video generation process.
…ipeline` for proper timestep management in video generation. Refactor latent variable preparation and update handling for better clarity.
…for improved memory efficiency during training.
… Update tensor handling in `SkyReelsV2Transformer3DModel` for improved dimensionality management. Clean up imports in `pipeline_skyreels_v2_diffusion_forcing.py` by removing `tqdm`.
…lsV2Transformer3DModel`. Clean up code for improved clarity and maintainability.
Mid-PR questions:
|
Thank you @tolgacangoz @a-r-r-o-w Could you take a look please |
Hi @nitinmukesh @tin2tin. You can make tests, reviews for this PR just as you have done in other PRs, if you want. |
Thank you @tolgacangoz for making the feature available in diffusers. I will test it now. |
Introduced a new markdown file detailing the SkyReelsV2Transformer3DModel, including usage instructions and model output specifications.
- Adjusted `in_channels` from 36 to 16 in `test_skyreels_v2_df_image_to_video.py`. - Added new parameters: `overlap_history`, `num_frames`, and `base_num_frames` in `test_skyreels_v2_df_video_to_video.py`. - Updated expected output shape in video tests from (17, 3, 16, 16) to (41, 3, 16, 16).
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the awesome work here @tolgacangoz! The PR looks great. I just have some nits and changes that will help keep consistent implementations across our other model/processors, and cleanup the pipelines a bit.
It is a massive PR to review, but not the reason why it took so long for me. I'll have to admit the idea of diffusion forcing is new to me and I couldn't fully wrap my head around it until going through some different implementations. Don't know how you did it so fast :)
Also great work on figuring out the numerical precision matching!
Regarding hosting the models, we will try to establish contact with SkyReels team (if not already) and see if they can host the weights.
- [SkyReels-V2 DF 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers) | ||
- [SkyReels-V2 DF 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-540P-Diffusers) | ||
- [SkyReels-V2 DF 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P-Diffusers) | ||
- [SkyReels-V2 T2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-540P-Diffusers) | ||
- [SkyReels-V2 T2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-720P-Diffusers) | ||
- [SkyReels-V2 I2V 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-1.3B-540P-Diffusers) | ||
- [SkyReels-V2 I2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-540P-Diffusers) | ||
- [SkyReels-V2 I2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P-Diffusers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @yiyixuxu Do we have contact with the SkyReels team and do we know if they would be okay with hosting the weights? If it's not possible, we could maintain skyreels-community
org similar to hunyuan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, let me check
|
||
## Notes | ||
|
||
- SkyReels-V2 supports LoRAs with [`~loaders.WanLoraLoaderMixin.load_lora_weights`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have a completely new transformer implementation (in the sense that we have new file, but similar as Wan), let's create a new lora loader mixin
video = pipe(**inputs).frames | ||
generated_video = video[0] | ||
|
||
self.assertEqual(generated_video.shape, (21, 3, 16, 16)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the total num_frames logic (we set 9 above but expect 21 here). Could you explain it a bit and provide a small example?
pass | ||
|
||
|
||
# TODO: Is this FLF2V test necessary, because the original repo doesn't seem to have this functionality for this pipeline? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From reading through the code paths, I don't think there is something that could potentially break easily when handling last image. If you think there is, we can keep the test. Otherwise, to the above test suite, let's just add a simple extension test_inference_with_last_image
for minimal testing
src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing_i2v.py
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,1109 @@ | |||
# Copyright 2025 The SkyReels-V2 Team, The Wan Team and The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as above diffusion forcing pipeline. We could probably wrap the repeated logic into a helper function and call that directly
@@ -0,0 +1,962 @@ | |||
# Copyright 2025 The SkyReels-V2 Team, The Wan Team and The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as above about shift and helper function for repeated logic
Co-authored-by: Aryan <contact.aryanvs@gmail.com>
…ch parameters to simplify
Co-authored-by: Aryan <contact.aryanvs@gmail.com>
- Changed `flag_df` to `enable_diffusion_forcing` for clarity in the SkyReelsV2Transformer3DModel and associated pipelines. - Updated all relevant method calls to reflect the new parameter name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks so much for working on this! I left some comments
docs/source/en/_toctree.yml
Outdated
@@ -30,7 +30,8 @@ | |||
- local: using-diffusers/push_to_hub | |||
title: Push files to the Hub | |||
title: Load pipelines and adapters | |||
- sections: | |||
- isExpanded: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, make style
and/or make quality
did this (as supported by its commit "style"). But when I try to reproduce it now, it doesn't happen 🤔; thus reverting.
hidden_states = attn.to_out[1](hidden_states) | ||
return hidden_states | ||
|
||
def set_ar_attention(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you tell me a bit about when user would use ar_attention? I think the only difference is pass the attention_mask here, no?
what would be the quality/performance difference with and without the mask?
encoder_hidden_states, | ||
timestep_proj, | ||
rotary_emb, | ||
causal_mask if self.config.flag_causal_attention else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this config flag_casua_attention
seems to be redundant with the set_ar_attention
, did I miss something?
|
||
if XLA_AVAILABLE: | ||
xm.mark_step() | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm we have two denoising loop here in the same pipeline?
causal_mask if self.config.flag_causal_attention else None, | ||
) | ||
|
||
if temb.dim() == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make notes (in comments) to explain? for example, which for model/checkpoint temb
will be 2d or 3d and their respective shapes
|
||
fps_emb = self.fps_embedding(fps).float() | ||
timestep_proj = timestep_proj.to(fps_emb.dtype) | ||
self.fps_projection.to(fps_emb.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.fps_projection.to(fps_emb.dtype) |
I think we can just remove this line, pytorch will automatically handle the dtype conversion so no need to explicitly convert (in this case, will upcast the weight)
but +1 on test it out to see the float32 is needed anyway
- [SkyReels-V2 DF 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers) | ||
- [SkyReels-V2 DF 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-540P-Diffusers) | ||
- [SkyReels-V2 DF 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P-Diffusers) | ||
- [SkyReels-V2 T2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-540P-Diffusers) | ||
- [SkyReels-V2 T2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-720P-Diffusers) | ||
- [SkyReels-V2 I2V 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-1.3B-540P-Diffusers) | ||
- [SkyReels-V2 I2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-540P-Diffusers) | ||
- [SkyReels-V2 I2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P-Diffusers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, let me check
Co-authored-by: YiYi Xu <yixu310@gmail.com>
- Ensured proper handling of hidden states during both gradient checkpointing and standard processing.
…mline imports - Removed the mention of ~13GB VRAM requirement for the SkyReels-V2 model. - Simplified import statements by removing unused `load_image` import.
Thanks for the opportunity to fix #11374!
Original Work
Original repo: https://github.com/SkyworkAI/SkyReels-V2
Paper: https://huggingface.co/papers/2504.13074
TODOs:
✅
FlowMatchUniPCMultistepScheduler
: just copy-pasted from the original repo✅
SkyReelsV2Transformer3DModel
: 90%WanTransformer3DModel
✅
SkyReelsV2DiffusionForcingPipeline
✅
SkyReelsV2DiffusionForcingImageToVideoPipeline
: Includes FLF2V.✅
SkyReelsV2DiffusionForcingVideoToVideoPipeline
: Extends a given video.✅
SkyReelsV2Pipeline
✅
SkyReelsV2ImageToVideoPipeline
✅
scripts/convert_skyreelsv2_to_diffusers.py
tolgacangoz/SkyReels-V2-Diffusers
⏳ Did you make sure to update the documentation with your changes? Did you write any new necessary tests?: We will construct these during review.
T2V with Diffusion Forcing (OLD)
diffusers
integrationoriginal_0_short.mp4
diffusers_0_short.mp4
diffusers
integrationoriginal_37_short.mp4
diffusers_37_short.mp4
diffusers
integrationoriginal_0_long.mp4
diffusers_0_long.mp4
diffusers
integrationoriginal_37_long.mp4
diffusers_37_long.mp4
I2V with Diffusion Forcing (OLD)
prompt
="A penguin dances."diffusers
integrationi2v-short.mp4
FLF2V with Diffusion Forcing (OLD)
Now, Houston, we have a problem.
I have been unable to produce good results with this task. I tried many hyperparameter combinations with the original code.
The first frame's latent (
torch.Size([1, 16, 1, 68, 120])
) is overwritten onto the first of25
frame latents oflatents
(torch.Size([1, 16, 25, 68, 120])). Then, the last frame's latent is concatenated, thuslatents
istorch.Size([1, 16, 26, 68, 120])
. After the denoising process, the length of the last frame latent is discarded at the end and then decoded by the VAE. I tried not concatenating the last frame but overwriting onto the latest frame oflatents
and not discarding the latest frame latent at the end, but still got bad results. Here are some results:0.mp4
1.mp4
2.mp4
3.mp4
4.mp4
5.mp4
6.mp4
7.mp4
V2V with Diffusion Forcing (OLD)
This pipeline extends a given video.
diffusers
integrationvideo1.mp4
v2v.mp4
Firstly, I want to congratulate you on this great work, and thanks for open-sourcing it, SkyReels Team! This PR proposes an integration of your model.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@yiyixuxu @a-r-r-o-w @linoytsaban @yjp999 @Howe2018 @RoseRollZhu @pftq @Langdx @guibinchen @qiudi0127 @nitinmukesh @tin2tin @ukaprch @okaris