Prompting with video #210

antragoudaras · 2024-12-11T15:31:00Z

Hey, great work!

I was wondering, since this is an autoregresive model can we prompt it with a short video consisting of a few frames? In the code base I noticed that the Causal VAE allows for encoding/decoding multiple frames, and generate_i2v has a parameter num_images_per_prompt , but I am not sure how to use it correctly.

feifeiobama · 2024-12-14T14:32:21Z

Generalizing autoregressive video generation to video extension is a very natural idea. We haven't supported in our codebase, but technically, it's similar to I2V which is encoding the existing frames and performing autoregressive prediction.

antragoudaras · 2024-12-14T15:37:36Z

Thank you for your response, @feifeiobama. I have implemented i2v inference using multiple consecutive frames (40), following the same regime outlined in your training process for the t2v scenario in train/train_pyramid_flow.py. Specifically, I observed that you first extract the latents using tools/extract_video_vae_latents.py, where the video is encoded with the VAE.

In this setup, your video tensor has dimensions (Batch, 3, num_of_frames, H, W). After being encoded with the VAE, the resulting latent tensor is compressed spatially by a factor of 8, as described in your paper. This results in a latent tensor of size (1, 16, num_of_frames/8, H/8, W/8).

Similarly, for the i2v inference using multiple frames, I encoded a video tensor of size (1, 3, 40, H, W) into a latent tensor of size (1, 16, 5, H/8, W/8) using the VAE. However, I noticed that this approach often causes the generated video to exhibit unusual behavior, particularly with object movements. This issue frequently leads to the video collapsing, especially in dynamic scenario (moving objects).

For instance, when using 40 frames from a video of a single ping-pong ball in free fall, along with the text prompt 'Orange ping-pong ball falling down and making impact with the table surface below,' the generated output looks quite accurate for the first 2 seconds (at 24 FPS, which makes sense since most of the frames we feed into the model are derived from diffusion sampling and then reconstructed). However, for subsequent frames without guidance, anomalies start to occur.

When running the same inference script, I observe outcomes such as the ball initially falling normally but suddenly disappearing or bouncing mid-air. Then, nothing happens for a few seconds until the ball reappears, seemingly levitating. In some cases, the ball is rendered as a barely visible spinning grey object, or multiple balls spawn after the disappearance of the original ball. Despite these inconsistencies, the first 40 guidance-driven frames perform well.

feifeiobama · 2024-12-14T16:47:06Z

Wow, thank you for trying so quickly, I have two comments that may be helpful:

The VAE should encode 8k+1 frames into k latent frames. For example, in your case, the VAE should take 41 frames as input and produce 6 latent frames.
Guidance is indeed an important technique. we tried adding CFG for history frames during training, but it didn't work. it should be very important technique if implemented properly.

antragoudaras · 2024-12-14T18:14:16Z

@feifeiobama thank for your feedback and usefull comments! Since I opened this issue, I’ve been working on it over the past three days, which is why I was able to provide detailed observations in my previous comments.

Regarding the VAE, I did confirm that it encodes 8k+1 frames into k latent frames. I tested with 41 input frames, resulting in 6 latent frames. Once again the trajectory of the falling ping-pong ball corresponding to the 41 frames ( first 2 seconds almost) is accurately generated by the model (DiT backbone and decoded by the VAE), but then the subsequent frames "without guidance" have similar problems. Mainly the ball dissapears for the next 2-3 seconds, and then multiple or a single ball spawns in a random position in the frame in the last moments of the video.

As for classifier-free guidance (CFG), I understand its potential to improve performance noticeably, if your models have been by leveraging it.

Any other suggestions are more than welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompting with video #210

Prompting with video #210

antragoudaras commented Dec 11, 2024

feifeiobama commented Dec 14, 2024

antragoudaras commented Dec 14, 2024

feifeiobama commented Dec 14, 2024

antragoudaras commented Dec 14, 2024 •

edited

Loading

Prompting with video #210

Prompting with video #210

Comments

antragoudaras commented Dec 11, 2024

feifeiobama commented Dec 14, 2024

antragoudaras commented Dec 14, 2024

feifeiobama commented Dec 14, 2024

antragoudaras commented Dec 14, 2024 • edited Loading

antragoudaras commented Dec 14, 2024 •

edited

Loading