Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompting with video #210

Open
antragoudaras opened this issue Dec 11, 2024 · 4 comments
Open

Prompting with video #210

antragoudaras opened this issue Dec 11, 2024 · 4 comments

Comments

@antragoudaras
Copy link

Hey, great work!

I was wondering, since this is an autoregresive model can we prompt it with a short video consisting of a few frames? In the code base I noticed that the Causal VAE allows for encoding/decoding multiple frames, and generate_i2v has a parameter num_images_per_prompt , but I am not sure how to use it correctly.

@feifeiobama
Copy link
Collaborator

Generalizing autoregressive video generation to video extension is a very natural idea. We haven't supported in our codebase, but technically, it's similar to I2V which is encoding the existing frames and performing autoregressive prediction.

@antragoudaras
Copy link
Author

Thank you for your response, @feifeiobama. I have implemented i2v inference using multiple consecutive frames (40), following the same regime outlined in your training process for the t2v scenario in train/train_pyramid_flow.py. Specifically, I observed that you first extract the latents using tools/extract_video_vae_latents.py, where the video is encoded with the VAE.

In this setup, your video tensor has dimensions (Batch, 3, num_of_frames, H, W). After being encoded with the VAE, the resulting latent tensor is compressed spatially by a factor of 8, as described in your paper. This results in a latent tensor of size (1, 16, num_of_frames/8, H/8, W/8).

Similarly, for the i2v inference using multiple frames, I encoded a video tensor of size (1, 3, 40, H, W) into a latent tensor of size (1, 16, 5, H/8, W/8) using the VAE. However, I noticed that this approach often causes the generated video to exhibit unusual behavior, particularly with object movements. This issue frequently leads to the video collapsing, especially in dynamic scenario (moving objects).

For instance, when using 40 frames from a video of a single ping-pong ball in free fall, along with the text prompt 'Orange ping-pong ball falling down and making impact with the table surface below,' the generated output looks quite accurate for the first 2 seconds (at 24 FPS, which makes sense since most of the frames we feed into the model are derived from diffusion sampling and then reconstructed). However, for subsequent frames without guidance, anomalies start to occur.

When running the same inference script, I observe outcomes such as the ball initially falling normally but suddenly disappearing or bouncing mid-air. Then, nothing happens for a few seconds until the ball reappears, seemingly levitating. In some cases, the ball is rendered as a barely visible spinning grey object, or multiple balls spawn after the disappearance of the original ball. Despite these inconsistencies, the first 40 guidance-driven frames perform well.

@feifeiobama
Copy link
Collaborator

Wow, thank you for trying so quickly, I have two comments that may be helpful:

  • The VAE should encode 8k+1 frames into k latent frames. For example, in your case, the VAE should take 41 frames as input and produce 6 latent frames.
  • Guidance is indeed an important technique. we tried adding CFG for history frames during training, but it didn't work. it should be very important technique if implemented properly.

@antragoudaras
Copy link
Author

antragoudaras commented Dec 14, 2024

@feifeiobama thank for your feedback and usefull comments! Since I opened this issue, I’ve been working on it over the past three days, which is why I was able to provide detailed observations in my previous comments.

Regarding the VAE, I did confirm that it encodes 8k+1 frames into k latent frames. I tested with 41 input frames, resulting in 6 latent frames. Once again the trajectory of the falling ping-pong ball corresponding to the 41 frames ( first 2 seconds almost) is accurately generated by the model (DiT backbone and decoded by the VAE), but then the subsequent frames "without guidance" have similar problems. Mainly the ball dissapears for the next 2-3 seconds, and then multiple or a single ball spawns in a random position in the frame in the last moments of the video.

As for classifier-free guidance (CFG), I understand its potential to improve performance noticeably, if your models have been by leveraging it.

Any other suggestions are more than welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants