Add SkyReels V2: Infinite-Length Film Generative Model #11518

tolgacangoz · 2025-05-07T18:58:53Z

Thanks for the opportunity to fix #11374!

Original Work

Original repo: https://github.com/SkyworkAI/SkyReels-V2
Paper: https://huggingface.co/papers/2504.13074

SkyReels V2's main contributions are summarized as follow:
• Comprehensive video captioner that understand the shot language while capturing the general description of the video, which dramatically improve the prompt adherence.
• Motion-specific preference optimization enhances motion dynamics with a semi-automatic data collection pipeline.
• Effective Diffusion-forcing adaptation enables the generation of ultra-long videos and story generation capabilities, providing a robust framework for extending temporal coherence and narrative depth.
• SkyCaptioner-V1 and SkyReels-V2 series models including diffusion-forcing, text2video, image2video, camera director and elements2video models with various sizes (1.3B, 5B, 14B) are open-sourced.

TODOs:
✅ FlowMatchUniPCMultistepScheduler: just copy-pasted from the original repo
✅ SkyReelsV2Transformer3DModel: 90% WanTransformer3DModel
✅ SkyReelsV2DiffusionForcingPipeline
✅ SkyReelsV2DiffusionForcingImageToVideoPipeline: Includes FLF2V.
✅ SkyReelsV2DiffusionForcingVideoToVideoPipeline: Extends a given video.
✅ SkyReelsV2Pipeline
✅ SkyReelsV2ImageToVideoPipeline
✅ scripts/convert_skyreelsv2_to_diffusers.py

tolgacangoz/SkyReels-V2-Diffusers

⏳ Did you make sure to update the documentation with your changes? Did you write any new necessary tests?: We will construct these during review.

T2V with Diffusion Forcing (OLD)

Skywork/SkyReels-V2-DF-1.3B-540P
seed 0 and num_frames 97
Original repo	`diffusers` integration
original_0_short.mp4	diffusers_0_short.mp4

seed 37 and num_frames 97
Original repo	`diffusers` integration
original_37_short.mp4	diffusers_37_short.mp4

seed 0 and num_frames 257
Original repo	`diffusers` integration
original_0_long.mp4	diffusers_0_long.mp4

seed 37 and num_frames 257
Original repo	`diffusers` integration
original_37_long.mp4	diffusers_37_long.mp4

!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingPipeline
from diffusers.utils import export_to_video

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."

output = pipe(
    prompt=prompt,
    num_inference_steps=30,
    height=544,
    width=960,
    num_frames=97,
    ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "T2V.mp4", fps=24, quality=8)

"""
You can set `ar_step=5` to enable asynchronous inference. When asynchronous inference,
`causal_block_size=5` is recommended while it is not supposed to be set for
synchronous generation. Asynchronous inference will take more steps to diffuse the
whole sequence which means it will be SLOWER than synchronous mode. In our
experiments, asynchronous inference may improve the instruction following and visual consistent performance.
"""

I2V with Diffusion Forcing (OLD)

`prompt`="A penguin dances."	`diffusers` integration
	i2v-short.mp4

#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

image = load_image("Penguin from https://huggingface.co/tasks/image-to-video")
prompt = "A penguin dances."

output = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=97,
    #ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "I2V.mp4", fps=24, quality=8)

"""
When I set `ar_step=5` and `causal_block_size=5`, then the results seem really bad.
"""

FLF2V with Diffusion Forcing (OLD)

Now, Houston, we have a problem.
I have been unable to produce good results with this task. I tried many hyperparameter combinations with the original code.
The first frame's latent (torch.Size([1, 16, 1, 68, 120])) is overwritten onto the first of 25 frame latents of latents (torch.Size([1, 16, 25, 68, 120])). Then, the last frame's latent is concatenated, thus latents is torch.Size([1, 16, 26, 68, 120]). After the denoising process, the length of the last frame latent is discarded at the end and then decoded by the VAE. I tried not concatenating the last frame but overwriting onto the latest frame of latents and not discarding the latest frame latent at the end, but still got bad results. Here are some results:

First Frame	Last Frame

0.mp4	1.mp4
2.mp4	3.mp4
4.mp4	5.mp4
6.mp4	7.mp4

#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

output = pipe(
    image=first_frame,
    last_image=last_frame,
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=97,
    #ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "FLF2V.mp4", fps=24, quality=8)

V2V with Diffusion Forcing (OLD)

This pipeline extends a given video.

Input Video	`diffusers` integration
video1.mp4	v2v.mp4

#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingVideoToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "CG animation style, a small blue bird flaps its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its continuing flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
video = load_video("Input video.mp4")

output = pipe(
    video=video,
    prompt=prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=120,
    base_num_frames=97,
    ar_step=0,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=17,  # Number of frames to overlap for smooth transitions in long videos
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "V2V.mp4", fps=24, quality=8)

Firstly, I want to congratulate you on this great work, and thanks for open-sourcing it, SkyReels Team! This PR proposes an integration of your model.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@yiyixuxu @a-r-r-o-w @linoytsaban @yjp999 @Howe2018 @RoseRollZhu @pftq @Langdx @guibinchen @qiudi0127 @nitinmukesh @tin2tin @ukaprch @okaris

ukaprch · 2025-05-08T15:47:38Z

It's about time. Thanks.

…tion mechanisms

Replaces custom attention implementations with `SkyReelsV2AttnProcessor2_0` and the standard `Attention` module. Updates `WanAttentionBlock` to use `FP32LayerNorm` and `FeedForward`. Removes the `model_type` parameter, simplifying model architecture and attention block initialization.

Introduces new classes `SkyReelsV2ImageEmbedding` and `SkyReelsV2TimeTextImageEmbedding` for enhanced image and time-text processing. Refactors the `SkyReelsV2Transformer3DModel` to integrate these embeddings, updating the constructor parameters for better clarity and functionality. Removes unused classes and methods to streamline the codebase.

…ds and begin reorganizing the forward pass.

…ethod

…hod, integrating rotary embeddings and improving attention handling. Removes the deprecated `rope_apply` function and streamlines the attention mechanism for better integration and clarity.

…ethod by updating parameter names for clarity, integrating attention masks, and improving the handling of encoder hidden states.

…ethod by enhancing the handling of time embeddings and encoder hidden states. Updates parameter names for clarity and integrates rotary embeddings, ensuring better compatibility with the model's architecture.

…ing components and streamline the text-to-video generation process. Updates class documentation and adjusts parameter handling for improved clarity and functionality.

…parameter handling and improving integration.

…to streamline video generation process.

…ipeline` for proper timestep management in video generation. Refactor latent variable preparation and update handling for better clarity.

…for improved memory efficiency during training.

… Update tensor handling in `SkyReelsV2Transformer3DModel` for improved dimensionality management. Clean up imports in `pipeline_skyreels_v2_diffusion_forcing.py` by removing `tqdm`.

…lsV2Transformer3DModel`. Clean up code for improved clarity and maintainability.

tolgacangoz · 2025-05-14T15:09:16Z

Hi @yiyixuxu @a-r-r-o-w

Mid-PR questions:

The issue was labelled as "contributions-welcome" but not as "community-examples". Also, the number of stars in this model surpassed that of SkyReels-V1. Thus, I located these pipelines in src/diffusers/pipelines/skyreels_v2/. Should I move these pipelines to examples/? Should I also split this PR for each pipeline (group)?
Just like SkyReels-V1 was based on HunyuanVideo, SkyReels-V2 is based on Wan, but some differences exist. I thought of moving the differences to the parent abstraction, i.e., pipeline code, so that we can use WanTransformer3DModel for both, but it didn't seem appropriate enough to me at first. But then, if we introduce Diffusion Forcing and AutoRegressive properties into WanTransformer3DModel (as native as possible, not with the exact diff below), it seems possible to me. You can examine the current diff between transformer_wan.py and transformer_skyreels_v2.py: https://www.diffchecker.com/U72HJ6ox/ WDYT?

Since SkyReels-V2 is not a completely new architecture, should I move its pipelines into src/diffusers/pipelines/wan/ similar to HunyuanSkyreelsImageToVideoPipeline, if SkyReels-V2 is seen as an official pipeline?
I am removing TeaCache-related code because it is planned for a modular extension, right? If this PR is required to move to examples/, then no need to remove, I think.
I came across this:

diffusers/src/diffusers/models/embeddings.py

Line 1153 in 01abfc8

/ (theta ** (torch.arange(0, dim, 2, dtype=freqs_dtype, device=pos.device)[: (dim // 2)] / dim))

At first, [: (dim // 2)] confused me :S Isn't it redundant? dim was already confirmed even with assert dim % 2 == 0. Can I remove [: (dim // 2)] in a separate PR?

DN6 · 2025-06-09T03:29:22Z

Thank you @tolgacangoz @a-r-r-o-w Could you take a look please

…e to improve clarity.

tolgacangoz · 2025-06-10T07:52:08Z

Hi @nitinmukesh @tin2tin. You can make tests, reviews for this PR just as you have done in other PRs, if you want.

nitinmukesh · 2025-06-10T08:15:50Z

Thank you @tolgacangoz for making the feature available in diffusers.

I will test it now.

Introduced a new markdown file detailing the SkyReelsV2Transformer3DModel, including usage instructions and model output specifications.

- Adjusted `in_channels` from 36 to 16 in `test_skyreels_v2_df_image_to_video.py`. - Added new parameters: `overlap_history`, `num_frames`, and `base_num_frames` in `test_skyreels_v2_df_video_to_video.py`. - Updated expected output shape in video tests from (17, 3, 16, 16) to (41, 3, 16, 16).

HuggingFaceDocBuilderDev · 2025-06-12T21:43:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w

Thank you for the awesome work here @tolgacangoz! The PR looks great. I just have some nits and changes that will help keep consistent implementations across our other model/processors, and cleanup the pipelines a bit.

It is a massive PR to review, but not the reason why it took so long for me. I'll have to admit the idea of diffusion forcing is new to me and I couldn't fully wrap my head around it until going through some different implementations. Don't know how you did it so fast :)

Also great work on figuring out the numerical precision matching!

Regarding hosting the models, we will try to establish contact with SkyReels team (if not already) and see if they can host the weights.

a-r-r-o-w · 2025-06-12T21:39:21Z

docs/source/en/api/pipelines/skyreels_v2.md

+- [SkyReels-V2 DF 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers)
+- [SkyReels-V2 DF 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-540P-Diffusers)
+- [SkyReels-V2 DF 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P-Diffusers)
+- [SkyReels-V2 T2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-540P-Diffusers)
+- [SkyReels-V2 T2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-720P-Diffusers)
+- [SkyReels-V2 I2V 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-1.3B-540P-Diffusers)
+- [SkyReels-V2 I2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-540P-Diffusers)
+- [SkyReels-V2 I2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P-Diffusers)


cc @yiyixuxu Do we have contact with the SkyReels team and do we know if they would be okay with hosting the weights? If it's not possible, we could maintain skyreels-community org similar to hunyuan

I think so, let me check

docs/source/en/api/pipelines/skyreels_v2.md

a-r-r-o-w · 2025-06-12T21:44:54Z

docs/source/en/api/pipelines/skyreels_v2.md

+
+## Notes
+
+- SkyReels-V2 supports LoRAs with [`~loaders.WanLoraLoaderMixin.load_lora_weights`].


Since we have a completely new transformer implementation (in the sense that we have new file, but similar as Wan), let's create a new lora loader mixin

a-r-r-o-w · 2025-06-12T22:29:31Z

tests/pipelines/skyreels_v2/test_skyreels_v2_df_video_to_video.py

+        video = pipe(**inputs).frames
+        generated_video = video[0]
+
+        self.assertEqual(generated_video.shape, (21, 3, 16, 16))


I don't fully understand the total num_frames logic (we set 9 above but expect 21 here). Could you explain it a bit and provide a small example?

a-r-r-o-w · 2025-06-12T22:31:18Z

tests/pipelines/skyreels_v2/test_skyreels_v2_image_to_video.py

+        pass
+
+
+# TODO: Is this FLF2V test necessary, because the original repo doesn't seem to have this functionality for this pipeline?


From reading through the code paths, I don't think there is something that could potentially break easily when handling last image. If you think there is, we can keep the test. Otherwise, to the above test suite, let's just add a simple extension test_inference_with_last_image for minimal testing

src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing_i2v.py

a-r-r-o-w · 2025-06-12T22:34:24Z

src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing_i2v.py

@@ -0,0 +1,1109 @@
+# Copyright 2025 The SkyReels-V2 Team, The Wan Team and The HuggingFace Team. All rights reserved.


same comments as above diffusion forcing pipeline. We could probably wrap the repeated logic into a helper function and call that directly

a-r-r-o-w · 2025-06-12T22:35:12Z

src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing_v2v.py

@@ -0,0 +1,962 @@
+# Copyright 2025 The SkyReels-V2 Team, The Wan Team and The HuggingFace Team. All rights reserved.


same comments as above about shift and helper function for repeated logic

Co-authored-by: Aryan <contact.aryanvs@gmail.com>

…ch parameters to simplify

Co-authored-by: Aryan <contact.aryanvs@gmail.com>

- Changed `flag_df` to `enable_diffusion_forcing` for clarity in the SkyReelsV2Transformer3DModel and associated pipelines. - Updated all relevant method calls to reflect the new parameter name.

yiyixuxu

thanks so much for working on this! I left some comments

yiyixuxu · 2025-06-12T19:24:29Z

docs/source/en/_toctree.yml

@@ -30,7 +30,8 @@
  - local: using-diffusers/push_to_hub
    title: Push files to the Hub
  title: Load pipelines and adapters
- sections:
+- isExpanded: false


why is this change?

IIRC, make style and/or make quality did this (as supported by its commit "style"). But when I try to reproduce it now, it doesn't happen 🤔; thus reverting.

docs/source/en/api/models/skyreels_v2_transformer_3d.md

src/diffusers/models/transformers/transformer_skyreels_v2.py

yiyixuxu · 2025-06-12T20:12:03Z

src/diffusers/models/transformers/transformer_skyreels_v2.py

+        hidden_states = attn.to_out[1](hidden_states)
+        return hidden_states
+
+    def set_ar_attention(self):


can you tell me a bit about when user would use ar_attention? I think the only difference is pass the attention_mask here, no?

what would be the quality/performance difference with and without the mask?

src/diffusers/models/transformers/transformer_skyreels_v2.py

yiyixuxu · 2025-06-12T21:11:24Z

src/diffusers/models/transformers/transformer_skyreels_v2.py

+                encoder_hidden_states,
+                timestep_proj,
+                rotary_emb,
+                causal_mask if self.config.flag_causal_attention else None,


this config flag_casua_attention seems to be redundant with the set_ar_attention, did I miss something?

yiyixuxu · 2025-06-12T21:54:47Z

src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing.py

+
+                    if XLA_AVAILABLE:
+                        xm.mark_step()
+        else:


umm we have two denoising loop here in the same pipeline?

yiyixuxu · 2025-06-12T23:07:57Z

src/diffusers/models/transformers/transformer_skyreels_v2.py

+                causal_mask if self.config.flag_causal_attention else None,
+            )
+
+        if temb.dim() == 2:


can we make notes (in comments) to explain? for example, which for model/checkpoint temb will be 2d or 3d and their respective shapes

yiyixuxu · 2025-06-14T06:31:21Z

src/diffusers/models/transformers/transformer_skyreels_v2.py

+
+            fps_emb = self.fps_embedding(fps).float()
+            timestep_proj = timestep_proj.to(fps_emb.dtype)
+            self.fps_projection.to(fps_emb.dtype)


Suggested change

self.fps_projection.to(fps_emb.dtype)

I think we can just remove this line, pytorch will automatically handle the dtype conversion so no need to explicitly convert (in this case, will upcast the weight)

but +1 on test it out to see the float32 is needed anyway

yiyixuxu · 2025-06-14T06:36:58Z

docs/source/en/api/pipelines/skyreels_v2.md

+- [SkyReels-V2 DF 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers)
+- [SkyReels-V2 DF 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-540P-Diffusers)
+- [SkyReels-V2 DF 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P-Diffusers)
+- [SkyReels-V2 T2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-540P-Diffusers)
+- [SkyReels-V2 T2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-720P-Diffusers)
+- [SkyReels-V2 I2V 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-1.3B-540P-Diffusers)
+- [SkyReels-V2 I2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-540P-Diffusers)
+- [SkyReels-V2 I2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P-Diffusers)


I think so, let me check

Co-authored-by: YiYi Xu <yixu310@gmail.com>

- Ensured proper handling of hidden states during both gradient checkpointing and standard processing.

…mline imports - Removed the mention of ~13GB VRAM requirement for the SkyReels-V2 model. - Simplified import statements by removing unused `load_image` import.

tolgacangoz and others added 28 commits May 10, 2025 10:27

up

c5b8da9

up

5835eaa

add draft transformer_skyreels_v2.py with a custom WanModel and atten…

9d2880e

…tion mechanisms

up

2c0586e

split i2v and t2v pipes for diffusion forcing

52590ea

up

f318efa

Refactors the SkyReelsV2Transformer3DModel by removing unused metho…

9688a82

…ds and begin reorganizing the forward pass.

Refactors SkyReelsV2TransformerBlock to integrate its forward() m…

825c2c1

…ethod

Refactors SkyReelsV2AttnProcessor2_0 to enhance the forward() met…

d848500

…hod, integrating rotary embeddings and improving attention handling. Removes the deprecated `rope_apply` function and streamlines the attention mechanism for better integration and clarity.

Refactors SkyReelsV2Transformer3DModel to enhance the forward() m…

2f5a4e2

…ethod by updating parameter names for clarity, integrating attention masks, and improving the handling of encoder hidden states.

Refactors SkyReelsV2Transformer3DModel to improve the forward() m…

e5870dd

…ethod by enhancing the handling of time embeddings and encoder hidden states. Updates parameter names for clarity and integrates rotary embeddings, ensuring better compatibility with the model's architecture.

Refactors SkyReelsV2Transformer3DModel forward pass

d54e3e1

Add DF inference template.

10d7480

style

fc68bf3

Refactor SkyReelsV2DiffusionForcingPipeline to remove image process…

1cb6a9e

…ing components and streamline the text-to-video generation process. Updates class documentation and adjusts parameter handling for improved clarity and functionality.

Enhance SkyReelsV2DiffusionForcingImageToVideoPipeline by refining …

ded93bc

…parameter handling and improving integration.

Remove unused dtype handling in SkyReelsV2DiffusionForcingPipeline …

c9483b2

…to streamline video generation process.

up

f7fed01

up

0e7b21d

Update references

b3698d7

Add generate_timestep_matrix method to `SkyReelsV2DiffusionForcingP…

7e0f0f5

…ipeline` for proper timestep management in video generation. Refactor latent variable preparation and update handling for better clarity.

Merge branch 'main' into skyreels-v2

47080c2

Remove training-related code

8c23208

Add gradient checkpointing support in SkyReelsV2Transformer3DModel …

1f8e268

…for improved memory efficiency during training.

Refactor SkyReelsV2TransformerBlock and remove unused Head class.…

d853521

… Update tensor handling in `SkyReelsV2Transformer3DModel` for improved dimensionality management. Clean up imports in `pipeline_skyreels_v2_diffusion_forcing.py` by removing `tqdm`.

Remove unused parameter y and associated documentation from `SkyRee…

2b79584

…lsV2Transformer3DModel`. Clean up code for improved clarity and maintainability.

tolgacangoz marked this pull request as ready for review June 8, 2025 18:01

tolgacangoz added 3 commits June 9, 2025 10:24

Simplify min_ar_step calculation in SkyReelsV2DiffusionForcingPipelin…

c2aab89

…e to improve clarity.

style and fix-copies

7ce7a96

style

32a6520

Merge branch 'main' into skyreels-v2

ca1a5f4

yiyixuxu added the roadmap Add to current release roadmap label Jun 11, 2025

github-project-automation bot added this to Diffusers Roadmap 0.34 Jun 11, 2025

github-project-automation bot moved this to In Progress in Diffusers Roadmap 0.34 Jun 11, 2025

yiyixuxu requested a review from a-r-r-o-w June 11, 2025 22:09

yiyixuxu and others added 5 commits June 11, 2025 12:10

Merge branch 'main' into skyreels-v2

87e7d08

Add documentation for SkyReelsV2Transformer3DModel

59c4057

Introduced a new markdown file detailing the SkyReelsV2Transformer3DModel, including usage instructions and model output specifications.

Merge branch 'main' into skyreels-v2

0a7647b

Refines SkyReelsV2DF test parameters

4c89187

a-r-r-o-w reviewed Jun 12, 2025

View reviewed changes

tolgacangoz and others added 4 commits June 13, 2025 15:35

Update src/diffusers/models/modeling_outputs.py

6aec002

Co-authored-by: Aryan <contact.aryanvs@gmail.com>

Refactor grid_sizes processing by using already-calculated post-pat…

8fcc7f0

…ch parameters to simplify

Update docs/source/en/api/pipelines/skyreels_v2.md

b5df175

Co-authored-by: Aryan <contact.aryanvs@gmail.com>

Refactor parameter naming for diffusion forcing in SkyReelsV2 pipelines

c446fe5

- Changed `flag_df` to `enable_diffusion_forcing` for clarity in the SkyReelsV2Transformer3DModel and associated pipelines. - Updated all relevant method calls to reflect the new parameter name.

yiyixuxu reviewed Jun 14, 2025

View reviewed changes

tolgacangoz and others added 6 commits June 14, 2025 12:07

Revert _toctree.yml to adjust section expansion states

7f13e1d

style

6931366

Update docs/source/en/api/models/skyreels_v2_transformer_3d.md

c42f98f

Co-authored-by: YiYi Xu <yixu310@gmail.com>

Add copying label to SkyReelsV2ImageEmbedding from WanImageEmbedding.

9a5b93d

Refactor transformer block processing in SkyReelsV2Transformer3DModel

fdabf03

- Ensured proper handling of hidden states during both gradient checkpointing and standard processing.

Update SkyReels V2 documentation to remove VRAM requirement and strea…

46b32ad

…mline imports - Removed the mention of ~13GB VRAM requirement for the SkyReels-V2 model. - Simplified import statements by removing unused `load_image` import.


		## Notes

		- SkyReels-V2 supports LoRAs with [`~loaders.WanLoraLoaderMixin.load_lora_weights`].

		pass


		# TODO: Is this FLF2V test necessary, because the original repo doesn't seem to have this functionality for this pipeline?

		@@ -0,0 +1,1109 @@
		# Copyright 2025 The SkyReels-V2 Team, The Wan Team and The HuggingFace Team. All rights reserved.

		@@ -0,0 +1,962 @@
		# Copyright 2025 The SkyReels-V2 Team, The Wan Team and The HuggingFace Team. All rights reserved.

Add SkyReels V2: Infinite-Length Film Generative Model #11518

Are you sure you want to change the base?

Add SkyReels V2: Infinite-Length Film Generative Model #11518

Conversation

tolgacangoz commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Original Work

T2V with Diffusion Forcing (OLD)

I2V with Diffusion Forcing (OLD)

FLF2V with Diffusion Forcing (OLD)

V2V with Diffusion Forcing (OLD)

Who can review?

Uh oh!

ukaprch commented May 8, 2025

Uh oh!

tolgacangoz commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DN6 commented Jun 9, 2025

Uh oh!

tolgacangoz commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nitinmukesh commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 12, 2025

Uh oh!

a-r-r-o-w left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tolgacangoz commented May 7, 2025 •

edited

Loading

tolgacangoz commented May 14, 2025 •

edited

Loading

tolgacangoz commented Jun 10, 2025 •

edited

Loading

nitinmukesh commented Jun 10, 2025 •

edited

Loading