Video Processor as a separate class

### Feature request

Since we currently have more and more VLMs that support image and video, and not always videos are processed same way as images are, I want to add a `VideoProcessor` class that inherits from `ImageProcessingMixin`. Thus we can have two separate classes for processing visuals, each with its own set of attributes and methods. We can also save different configs for both to avoid issues as #33484. The `VideoProcessor` will mainly use the same transform methods as slow image processors, by iterating over each frame and stacking it. Some additional helper fn can be added, like `load_video`  and `make_list_of_videos`. The main input name will be videos  and the output var name is  `pixel_values_videos`. 

For the `load_video`  we can prob rely on `av`, but I find it super slow compared to other video decoders. I'll try to get a small comparison benchmarks for that, and unfortunately `decord` can't be used as it had problems with models on cuda.

In the long term we might consider adding video transforms where each video is transformed in one call, instead of each video frame, similar to fast image processing with `torchvision`. 

To Do:
- [ ] Add the VideoProcessor class and integrate with llava-next-video which is one of the models with different processing for image and videos.
- [ ] After the changed are approved and merged, the following models will be easy to modify:
    - [ ] Video-LLaVa
    - [ ] Qwen2-VL
    - [ ] LLaVA-OneVision

- [ ] Instructblip-Video might need deprecation as it currently accepts images  as main arg and returns pixel_values . TBH, it is a video-only model so we can disregard changing it, same was as we won't touch VIVIT and other video-only models

### Motivation

Easier integration of multimodal LLMs

### Your contribution

@amyeroberts WDYT about this suggestion? Would love to hear your opinion 🤗 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Video Processor as a separate class #33504

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Video Processor as a separate class #33504

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions