[video processor] support torchcodec and decrease cuda memory usage #38880

zucchini-nlp · 2025-06-18T08:38:53Z

What does this PR do?

As per title, adds a small utility for loading videos with torchcodec. Note that we don't use torchcodec to its fullest, i.e. loading to device or streaming. Loading to device incurs high memory usage because we load the whole video and sampling only after that. For streaming and other features, let's do it one thing at a time gradually and see how it fits in the codebase

For now we just deprecate read_video_torchvision which is anyway deprecated in torchvision for the next 2 minor releases. Users are nudged to use torchcodec instead

Also I noticed that there was a high GPU memory spike with long videos, because we moved the whole video to GPU before processing. This PR moves the device-placement after sampling so only the sampled frames are on device

The next PR will be on using torchcodec to load audio from video files, seems like it is better than librosa and supports more formats. I will still need to test. Ideally making torchcodec the default would be the final goal, as we test and iterate

HuggingFaceDocBuilderDev · 2025-06-18T08:58:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qubvel

Thanks for adding a new backend! 🔥 hope it's gonna be lightning fast ⚡

src/transformers/utils/import_utils.py

src/transformers/video_utils.py

qubvel · 2025-06-18T20:59:29Z

src/transformers/video_utils.py

+        # Interestingly `exact` mode takes less than approximate when we load the whole video
+        seek_mode="exact",


do we have to load the whole video? I suppose the entire idea is to avoid loading long videos -> save on RAM and increase decoding speed, no?

Ideally yeah, we should get only the necessary frames. This is a result of prev modifications, which move video_sample logic inside video processors (more intuitive than having it only in apply_templates)

I haven't thought about RAM usage at that time and now I see that it's not very efficient. Seems like the best way for videos would be to load -> sample-with-decoder -> optionally cast to torch -> transforms. Here I am facing an issue, because the "load_media" is decoupled from processors' call. We can load-media only for instruct models when a conversation history is defined, and for base models user are expected to pre-load all images/videos themselves

Do you think we should start allowing users to pass url/path to processor's call directly (like Pixtral already does)? I want to keep sampling code in each model's processing file, to make it explicit for users/contributors

Looks like working with videos is in general less efficient than images...

Do you think we should start allowing users to pass url/path to processor's call directly (like Pixtral already does)? I want to keep sampling code in each model's processing file, to make it explicit for users/contributors

You mean Processor.__call__ (not the image processor), right? I don't have a strong opinion on it, it looks okay to me. Now that we allow users to pass fps/num frames, it seems like a logical next step to allow reading only the required frames if the backend supports this.

Yeah, no objections from me as well. The only concern I have, we might be bloating video processors. If we will finally accept "url/path" in ModelProcessor.__call__ then I will think of abstracting it (given we have tons of decoders with many options) (cc @yonigozlan we talked last year on allowing urls for __call__)

Prob by default we won't let users configure decoder-related stuff

IMO, it's also ok to provide just a basic usage, such as loading with the default settings. And clearly let them know that for an advanced setup, they can read and sample by themselves.

zucchini-nlp · 2025-06-19T07:52:34Z

Btw, in my small experiments with around 200-500 videos of various length, torchcodec wasn't the fastest option. Pyav and OpenCV still remain top2 options for speed when we just load the video (no processing, sampling done) 🙃

I will do better evals and see if video length and any other factor influence it, and talk to torchcodec maintainers

qubvel · 2025-06-19T09:37:17Z

Interesting observation 😄 It was claimed to be the opposite!

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>

zucchini-nlp · 2025-06-20T10:35:43Z

Updating on the time measurements: I found that torchcodec is fast of we know before-hand which indices to sample and pass it to decoder.get_indices_at(). However if we don't sample and thus have to decode all frame indices, the opencv decoder is faster. Below is an example from the benchmark using torchcodec repo code and a super long video (~2min)

I will play more with video lengths and see if that influences the performance, and probably add it somewhere on our docs so users can choose wisely which decoder to use

zucchini-nlp · 2025-06-20T10:37:47Z

@qubvel the comments are addressed, i think we can merge this one and I will work no moving the load-video by default to processors' By default it will be torchvision and thus will be BC without extra dependencies, with a plan to fallback to torchcodec after a 5-6 more releases

qubvel

Great! thanks for benchmarking 👍

P.S. do we have load_video somewhere in the docs?

zucchini-nlp · 2025-06-20T11:49:04Z

do we have load_video somewhere in the docs?

I don't think so, but it should be added as a doc page of its own. Maybe it could be a small tutorial page when we have more quantitative results to show. I will add one in the next PRs

zucchini-nlp · 2025-06-25T08:12:15Z

Will merge, has to be in the next release since the last update on main increases GPU mem by a lot. For RAM, I will fix it for next releases, off for a short vacation

zucchini-nlp added 5 commits June 17, 2025 15:47

don't move the whole video to GPU

1a5b599

add torchcodec

e4b3478

add tests

6bfcce8

merge main

01f287b

make style

30f0b4b

instrucblip as well

8cd0501

zucchini-nlp requested a review from qubvel June 18, 2025 10:38

zucchini-nlp changed the title ~~[video processor] supporrt torchcodec and decrease cuda memory usage~~ [video processor] support torchcodec and decrease cuda memory usage Jun 18, 2025

consistency

1907bfc

qubvel reviewed Jun 18, 2025

View reviewed changes

zucchini-nlp and others added 4 commits June 20, 2025 12:31

Update src/transformers/utils/import_utils.py

e1b106b

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>

Update src/transformers/utils/import_utils.py

e33b3a8

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>

Update src/transformers/video_utils.py

f55e9b3

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>

Merge remote-tracking branch 'upstream/main' into video_processor

ab2c4c9

Merge branch 'main' into video_processor

137a9b3

zucchini-nlp requested a review from qubvel June 20, 2025 10:46

qubvel approved these changes Jun 20, 2025

View reviewed changes

Merge branch 'main' into video_processor

cb2bcfc

zucchini-nlp enabled auto-merge (squash) June 25, 2025 08:12

zucchini-nlp merged commit e212ff9 into huggingface:main Jun 25, 2025
20 checks passed

		# Interestingly `exact` mode takes less than approximate when we load the whole video
		seek_mode="exact",

[video processor] support torchcodec and decrease cuda memory usage #38880

[video processor] support torchcodec and decrease cuda memory usage #38880

Uh oh!

Conversation

zucchini-nlp commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Jun 18, 2025

Uh oh!

qubvel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qubvel Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

qubvel Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

qubvel Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Jun 19, 2025

Uh oh!

qubvel commented Jun 19, 2025

Uh oh!

zucchini-nlp commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Jun 20, 2025

Uh oh!

qubvel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Jun 20, 2025

Uh oh!

zucchini-nlp commented Jun 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zucchini-nlp commented Jun 18, 2025 •

edited

Loading

zucchini-nlp commented Jun 20, 2025 •

edited

Loading

qubvel left a comment •

edited

Loading