Skip to content

Conversation

leffff
Copy link

@leffff leffff commented Oct 13, 2025

What does this PR do?

This PR adds Kandinsky5T2VPipeline and Kandinsky5Transformer3DModel as well as several layer classes neede for Kandinsky 5.0 Lite T2V model

@sayakpaul Please review

@sayakpaul sayakpaul requested review from DN6 and yiyixuxu October 14, 2025 04:02
@sayakpaul
Copy link
Member

Could you please update the PR with test code and some example outputs?

@leffff
Copy link
Author

leffff commented Oct 14, 2025

Sure!

@leffff
Copy link
Author

leffff commented Oct 14, 2025

@leffff
Copy link
Author

leffff commented Oct 14, 2025

Dear @sayakpaul @yiyixuxu @DN6
How should the test code and example outputs look like?

@leffff
Copy link
Author

leffff commented Oct 14, 2025

import torch
from diffusers import Kandinsky5T2VPipeline
from diffusers.utils import export_to_video

pipe = Kandinsky5T2VPipeline.from_pretrained(
    "ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers", 
    torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")

negative_prompt = [
    "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards",
]
prompt = [
    "A cat and a dog baking a cake together in a kitchen.",
]

output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=512,
    width=768,
    num_frames=121,
    num_inference_steps=50,
    guidance_scale=5.0,
    num_videos_per_prompt=1,
    generator=torch.Generator(42)
)
output.10.mp4
prompt = [
    "A monkey ridign a skateboard",
]
output.10.mp4
prompt = [
    "Several giant wooly mammoths threading through the meadow",
]
output.10.mp4

@sayakpaul
Copy link
Member

Great, thanks for providing the examples! Does the model also do realistic generations? 👀

@linoytsaban @apolinario @asomoza in case you wanna test it?

@leffff
Copy link
Author

leffff commented Oct 14, 2025

Yes of course!

A stylish woman struts confidently down a rain-drenched Tokyo street, where vibrant neon signs flicker and pulse with electric color. She wears a sleek black leather jacket over a flowing red dress, paired with polished black boots and a matching black purse. Her sunglasses reflect the glowing cityscape as she moves with a calm, assured demeanor, red lipstick adding a bold contrast to her look. The wet pavement mirrors the dazzling lights, doubling the intensity of the urban glow around her. Pedestrians bustle along the sidewalks, their silhouettes blending into the dynamic, cinematic atmosphere of the neon-lit metropolis.

output.10.mp4

A cinematic movie trailer unfolds with a 30-year-old space man traversing a vast salt desert beneath a brilliant blue sky. He wears a uniquely styled red wool knitted motorcycle helmet, adding an eccentric yet rugged charm to his spacefaring look. As he rides a retro-futuristic vehicle across the shimmering white terrain, the wind kicks up clouds of glittering salt, creating a surreal atmosphere. The scene is captured in a vivid, cinematic style, shot on 35mm film to enhance the nostalgic and dramatic grain. Explosions of color and dynamic camera movements highlight the space man's daring escape from a collapsing alien base in the distance.

output.11.mp4

Copy link
Member

@asomoza asomoza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, looks cool! left some suggestions for unused imports

Co-authored-by: Álvaro Somoza <asomoza@users.noreply.github.com>
"""
A 3D Diffusion Transformer model for video-like data.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_repeated_blocks = [
"Kandinsky5TransformerEncoderBlock",
"Kandinsky5TransformerDecoderBlock",
]

Should we declare repeated blocks @sayakpaul?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes let's add that

@leffff
Copy link
Author

leffff commented Oct 16, 2025

@yiyixuxu
I've made lot's of corrections. Please review them. I have followed through the whole feedback, tackling every issue!

image

key = apply_rotary(key, rope).type_as(key)

if sparse_params is not None:
out = self.nabla(query, key, value, sparse_params=sparse_params)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we look into if we can refactor the attention to use dispatch_attention_fn instead, so that we can use different attention implemnentation out of box, flex or others

see this PR #11916
reference code: https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_flux.py#L118
doc: https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to move on with the integration as soon as possible.
Can we contribute the code as is, but add support for 10 sec models later?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We indeed proposed a new attention algorithm.
So does that require implementing it as the new attention backend?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so I think by default we still use SDPA, no? the flex stuff is only optional and user has to config the attention_type to be flex in order to use it - if that's the case, I think the fatest way to get this PR in is to remove all the flex related stuff for now, and add that in a follow-up PR using dispatch_attention_fn

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see!
You want to have a KandinskyAttnProcessor with several backends. Okay

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you should be able to structure KandinskyAttnProcessor to use dispatch_attention_fn so that it works with different backends out of box - instead of having to manually handle it like current code does
but if it will take to much time - we can just support the default one and make it work with dispatch_attention_fn in a follow-up PR

@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Oct 17, 2025

@leffff
i refactored kandinsky 5 attention here in this commit acabbc0
you can cherrypick that commit or just do something similar to what I did there

I tested the default SDPA backend, did not test flex but the code/logic should roughly look like that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants