-
Notifications
You must be signed in to change notification settings - Fork 31.1k
[video processor] support torchcodec and decrease cuda memory usage #38880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a new backend! 🔥 hope it's gonna be lightning fast ⚡
| # Interestingly `exact` mode takes less than approximate when we load the whole video | ||
| seek_mode="exact", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have to load the whole video? I suppose the entire idea is to avoid loading long videos -> save on RAM and increase decoding speed, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally yeah, we should get only the necessary frames. This is a result of prev modifications, which move video_sample logic inside video processors (more intuitive than having it only in apply_templates)
I haven't thought about RAM usage at that time and now I see that it's not very efficient. Seems like the best way for videos would be to load -> sample-with-decoder -> optionally cast to torch -> transforms. Here I am facing an issue, because the "load_media" is decoupled from processors' call. We can load-media only for instruct models when a conversation history is defined, and for base models user are expected to pre-load all images/videos themselves
Do you think we should start allowing users to pass url/path to processor's call directly (like Pixtral already does)? I want to keep sampling code in each model's processing file, to make it explicit for users/contributors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like working with videos is in general less efficient than images...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should start allowing users to pass url/path to processor's call directly (like Pixtral already does)? I want to keep sampling code in each model's processing file, to make it explicit for users/contributors
You mean Processor.__call__ (not the image processor), right? I don't have a strong opinion on it, it looks okay to me. Now that we allow users to pass fps/num frames, it seems like a logical next step to allow reading only the required frames if the backend supports this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, no objections from me as well. The only concern I have, we might be bloating video processors. If we will finally accept "url/path" in ModelProcessor.__call__ then I will think of abstracting it (given we have tons of decoders with many options) (cc @yonigozlan we talked last year on allowing urls for __call__)
Prob by default we won't let users configure decoder-related stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it's also ok to provide just a basic usage, such as loading with the default settings. And clearly let them know that for an advanced setup, they can read and sample by themselves.
|
Btw, in my small experiments with around 200-500 videos of various length, I will do better evals and see if video length and any other factor influence it, and talk to torchcodec maintainers |
|
Interesting observation 😄 It was claimed to be the opposite! |
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
|
Updating on the time measurements: I found that I will play more with video lengths and see if that influences the performance, and probably add it somewhere on our docs so users can choose wisely which decoder to use |
|
@qubvel the comments are addressed, i think we can merge this one and I will work no moving the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! thanks for benchmarking 👍
P.S. do we have load_video somewhere in the docs?
I don't think so, but it should be added as a doc page of its own. Maybe it could be a small tutorial page when we have more quantitative results to show. I will add one in the next PRs |
|
Will merge, has to be in the next release since the last update on main increases GPU mem by a lot. For RAM, I will fix it for next releases, off for a short vacation |

What does this PR do?
As per title, adds a small utility for loading videos with torchcodec. Note that we don't use torchcodec to its fullest, i.e. loading to device or streaming. Loading to device incurs high memory usage because we load the whole video and sampling only after that. For streaming and other features, let's do it one thing at a time gradually and see how it fits in the codebase
For now we just deprecate
read_video_torchvisionwhich is anyway deprecated in torchvision for the next 2 minor releases. Users are nudged to usetorchcodecinsteadAlso I noticed that there was a high GPU memory spike with long videos, because we moved the whole video to GPU before processing. This PR moves the device-placement after sampling so only the sampled frames are on device
The next PR will be on using torchcodec to load audio from video files, seems like it is better than librosa and supports more formats. I will still need to test. Ideally making torchcodec the default would be the final goal, as we test and iterate