-
Notifications
You must be signed in to change notification settings - Fork 31.1k
[video processor] support torchcodec and decrease cuda memory usage #38880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+129
−9
Merged
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
1a5b599
don't move the whole video to GPU
zucchini-nlp e4b3478
add torchcodec
zucchini-nlp 6bfcce8
add tests
zucchini-nlp 01f287b
merge main
zucchini-nlp 30f0b4b
make style
zucchini-nlp 8cd0501
instrucblip as well
zucchini-nlp 1907bfc
consistency
zucchini-nlp e1b106b
Update src/transformers/utils/import_utils.py
zucchini-nlp e33b3a8
Update src/transformers/utils/import_utils.py
zucchini-nlp f55e9b3
Update src/transformers/video_utils.py
zucchini-nlp ab2c4c9
Merge remote-tracking branch 'upstream/main' into video_processor
zucchini-nlp 137a9b3
Merge branch 'main' into video_processor
zucchini-nlp cb2bcfc
Merge branch 'main' into video_processor
zucchini-nlp File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have to load the whole video? I suppose the entire idea is to avoid loading long videos -> save on RAM and increase decoding speed, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally yeah, we should get only the necessary frames. This is a result of prev modifications, which move
video_samplelogic inside video processors (more intuitive than having it only inapply_templates)I haven't thought about RAM usage at that time and now I see that it's not very efficient. Seems like the best way for videos would be to
load -> sample-with-decoder -> optionally cast to torch -> transforms. Here I am facing an issue, because the "load_media" is decoupled from processors' call. We can load-media only for instruct models when a conversation history is defined, and for base models user are expected to pre-load all images/videos themselvesDo you think we should start allowing users to pass
url/pathto processor's call directly (like Pixtral already does)? I want to keep sampling code in each model's processing file, to make it explicit for users/contributorsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like working with videos is in general less efficient than images...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean
Processor.__call__(not the image processor), right? I don't have a strong opinion on it, it looks okay to me. Now that we allow users to pass fps/num frames, it seems like a logical next step to allow reading only the required frames if thebackendsupports this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, no objections from me as well. The only concern I have, we might be bloating video processors. If we will finally accept "url/path" in
ModelProcessor.__call__then I will think of abstracting it (given we have tons of decoders with many options) (cc @yonigozlan we talked last year on allowing urls for__call__)Prob by default we won't let users configure decoder-related stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it's also ok to provide just a basic usage, such as loading with the default settings. And clearly let them know that for an advanced setup, they can read and sample by themselves.