Skip to content

Video Processor as a separate class #33504

@zucchini-nlp

Description

@zucchini-nlp

Feature request

Since we currently have more and more VLMs that support image and video, and not always videos are processed same way as images are, I want to add a VideoProcessor class that inherits from ImageProcessingMixin. Thus we can have two separate classes for processing visuals, each with its own set of attributes and methods. We can also save different configs for both to avoid issues as #33484. The VideoProcessor will mainly use the same transform methods as slow image processors, by iterating over each frame and stacking it. Some additional helper fn can be added, like load_video and make_list_of_videos. The main input name will be videos and the output var name is pixel_values_videos.

For the load_video we can prob rely on av, but I find it super slow compared to other video decoders. I'll try to get a small comparison benchmarks for that, and unfortunately decord can't be used as it had problems with models on cuda.

In the long term we might consider adding video transforms where each video is transformed in one call, instead of each video frame, similar to fast image processing with torchvision.

To Do:

  • Add the VideoProcessor class and integrate with llava-next-video which is one of the models with different processing for image and videos.

  • After the changed are approved and merged, the following models will be easy to modify:

    • Video-LLaVa
    • Qwen2-VL
    • LLaVA-OneVision
  • Instructblip-Video might need deprecation as it currently accepts images as main arg and returns pixel_values . TBH, it is a video-only model so we can disregard changing it, same was as we won't touch VIVIT and other video-only models

Motivation

Easier integration of multimodal LLMs

Your contribution

@amyeroberts WDYT about this suggestion? Would love to hear your opinion 🤗

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions