Skip to content

Latest commit

 

History

History
200 lines (126 loc) · 8.78 KB

README.md

File metadata and controls

200 lines (126 loc) · 8.78 KB

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning (COLM 2024)

Official implementatino of VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation.

arXiv Pytorch

Han Lin, Jaemin Cho, Abhay Zala, Mohit Bansal

Code comming soon!



Illustration of our two-stage framework for long, multi-scene video generation from text:

  • In the first stage, we employ the LLM as a video planner to craft a video plan, which provides an overarching plot for videos with multiple scenes, guiding the downstream video generation process. The video plan consists of scene-level text descriptions, a list of the entities and background involved in each scene, frame-by-frame entity layouts (bounding boxes), and consistency groupings for entities and backgrounds.
  • In the second stage, we utilize Layout2Vid, a grounded video generation module, to render videos based on the video plan generated in the first stage. This module uses the same image and text embeddings to represent identical entities and backgrounds from video plan, and allows for spatial control over entity layouts through the Guided 2D Attention in the spatial attention block.

Generated Examples

Single-Scene Videos

Our proposed VideoDirectorGPT framework substantially improves layout and movement control

"pushing stuffed animal from left to right" "pushing pear from right to left"
ModelScopeT2V
VideoDirectorGPT
(Ours)
"a pizza is to the left of an elephant" "four frisbees"
ModelScopeT2V
VideoDirectorGPT
(Ours)

Multi-Scene Videos

Single Text Prompt ➡ Multi-Scene Video

"make caraway cakes" "make peach melba"
ModelScopeT2V
VideoDirectorGPT
(Ours)

Our model is able to generate a detailed video plan that properly expands the original text prompt to show the process, has accurate object bounding box locations (overlaid), and maintains the consistency of the person across the scenes. ModelScopeT2V only generates the final food (caraway cake/peach melba) and that food is not consistent between scenes.

List of Scene Descriptions ➡ Multi-Scene Video

Scene 1: mouse is holding a book and makes a happy face.
Scene 2: he looks happy and talks.
Scene 3: he is pulling petals off the flower.
Scene 4: he is ripping a petal from the flower.
Scene 5: he is holding a flower by his right paw.
Scene 6: one paw pulls the last petal off the flower.
Scene 7: he is smiling and talking while holding a flower on his right paw.

ModelScopeT2V VideoDirectorGPT
(Ours)

Our video plan's object layouts (overlaid) can guide the Layout2Vid module to generate the same mouse across scenes consistently, whereas ModelScopeT2V loses track of the mouse right after the first scene.

User-Provided Input Image ➡ Video

Users can flexibly provide either text-only or image+text descriptions to place custom entities when generating videos with VideoDirectorGPT. For both text and image+text based entity grounding examples, the identities of the provided entities are well preserved across multiple scenes

Input Type Input Example Scene 1: a < S > then gets up from a plush beige bed. Scene 2: a < S > goes to the cream-colored kitchen and eats a can of gourmet cat snack. Scene 3: a < S > sits next to a large floor-to-ceiling window.
Text-Only Input < S > = "white cat"
Image+Text Input
Image+Text Input
Image+Text Input

Citation

If you find our project useful in your research, please cite the following paper:

@article{Lin2023VideoDirectorGPT,
        author = {Han Lin and Abhay Zala and Jaemin Cho and Mohit Bansal},
        title = {VideoDirectorGPT: Consistent Multi-Scene Video Generation via LLM-Guided Planning},
        year = {2023},
}