Skip to content

Conversation

@zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Jun 18, 2025

What does this PR do?

As per title, adds a small utility for loading videos with torchcodec. Note that we don't use torchcodec to its fullest, i.e. loading to device or streaming. Loading to device incurs high memory usage because we load the whole video and sampling only after that. For streaming and other features, let's do it one thing at a time gradually and see how it fits in the codebase

For now we just deprecate read_video_torchvision which is anyway deprecated in torchvision for the next 2 minor releases. Users are nudged to use torchcodec instead

Also I noticed that there was a high GPU memory spike with long videos, because we moved the whole video to GPU before processing. This PR moves the device-placement after sampling so only the sampled frames are on device

The next PR will be on using torchcodec to load audio from video files, seems like it is better than librosa and supports more formats. I will still need to test. Ideally making torchcodec the default would be the final goal, as we test and iterate

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp requested a review from qubvel June 18, 2025 10:38
@zucchini-nlp zucchini-nlp changed the title [video processor] supporrt torchcodec and decrease cuda memory usage [video processor] support torchcodec and decrease cuda memory usage Jun 18, 2025
Copy link
Contributor

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a new backend! 🔥 hope it's gonna be lightning fast ⚡

Comment on lines +489 to +490
# Interestingly `exact` mode takes less than approximate when we load the whole video
seek_mode="exact",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to load the whole video? I suppose the entire idea is to avoid loading long videos -> save on RAM and increase decoding speed, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally yeah, we should get only the necessary frames. This is a result of prev modifications, which move video_sample logic inside video processors (more intuitive than having it only in apply_templates)

I haven't thought about RAM usage at that time and now I see that it's not very efficient. Seems like the best way for videos would be to load -> sample-with-decoder -> optionally cast to torch -> transforms. Here I am facing an issue, because the "load_media" is decoupled from processors' call. We can load-media only for instruct models when a conversation history is defined, and for base models user are expected to pre-load all images/videos themselves

Do you think we should start allowing users to pass url/path to processor's call directly (like Pixtral already does)? I want to keep sampling code in each model's processing file, to make it explicit for users/contributors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like working with videos is in general less efficient than images...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should start allowing users to pass url/path to processor's call directly (like Pixtral already does)? I want to keep sampling code in each model's processing file, to make it explicit for users/contributors

You mean Processor.__call__ (not the image processor), right? I don't have a strong opinion on it, it looks okay to me. Now that we allow users to pass fps/num frames, it seems like a logical next step to allow reading only the required frames if the backend supports this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no objections from me as well. The only concern I have, we might be bloating video processors. If we will finally accept "url/path" in ModelProcessor.__call__ then I will think of abstracting it (given we have tons of decoders with many options) (cc @yonigozlan we talked last year on allowing urls for __call__)

Prob by default we won't let users configure decoder-related stuff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it's also ok to provide just a basic usage, such as loading with the default settings. And clearly let them know that for an advanced setup, they can read and sample by themselves.

@zucchini-nlp
Copy link
Member Author

Btw, in my small experiments with around 200-500 videos of various length, torchcodec wasn't the fastest option. Pyav and OpenCV still remain top2 options for speed when we just load the video (no processing, sampling done) 🙃

I will do better evals and see if video length and any other factor influence it, and talk to torchcodec maintainers

@qubvel
Copy link
Contributor

qubvel commented Jun 19, 2025

Interesting observation 😄 It was claimed to be the opposite!

zucchini-nlp and others added 4 commits June 20, 2025 12:31
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Jun 20, 2025

Updating on the time measurements: I found that torchcodec is fast of we know before-hand which indices to sample and pass it to decoder.get_indices_at(). However if we don't sample and thus have to decode all frame indices, the opencv decoder is faster. Below is an example from the benchmark using torchcodec repo code and a super long video (~2min)

I will play more with video lengths and see if that influences the performance, and probably add it somewhere on our docs so users can choose wisely which decoder to use

image

@zucchini-nlp
Copy link
Member Author

@qubvel the comments are addressed, i think we can merge this one and I will work no moving the load-video by default to processors' By default it will be torchvision and thus will be BC without extra dependencies, with a plan to fallback to torchcodec after a 5-6 more releases

@zucchini-nlp zucchini-nlp requested a review from qubvel June 20, 2025 10:46
Copy link
Contributor

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! thanks for benchmarking 👍

P.S. do we have load_video somewhere in the docs?

@zucchini-nlp
Copy link
Member Author

do we have load_video somewhere in the docs?

I don't think so, but it should be added as a doc page of its own. Maybe it could be a small tutorial page when we have more quantitative results to show. I will add one in the next PRs

@zucchini-nlp
Copy link
Member Author

Will merge, has to be in the next release since the last update on main increases GPU mem by a lot. For RAM, I will fix it for next releases, off for a short vacation

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) June 25, 2025 08:12
@zucchini-nlp zucchini-nlp merged commit e212ff9 into huggingface:main Jun 25, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants