Description
🚀 Feature
We're proposing to add a lower level, more flexible, and equally robust API than the one currently existing in torchvision
.
It would be implemented in C++ and compatible with torchscript. Following the merge of #2596 it would also be installable via pip or conda
Motivation
Currently, our API supports returning a tensor of (TxCxHxW)
via read_video
(see here) abstraction. This can be prohibitive if a user wants to get a single frame, perform some operations on a per-frame basis. For example, I've ran into multiple issues where I'd want to return a single frame, iterate over frames, or (for example in EPIC Kitchens dataset) reduce the memory usage by transforming the elements before I save them to output tensors.
Pitch
We propose the following style of API:
First we'd have a constructor that would be a part of torch registered C++ classes, and would take some basic inputs.
import torch.classes.torchvision as tvcls
vid = tvcls.Video(path, "stream:stream_id")
Returning a frame is as simple as calling next
on the container [optionally, we can define stream from which we'd like to return the next frame from]. What a frame is will largely depend on the encoding of the video. For video, it is almost always an RGB image, whilst for audio it might be a 1024point sample. In most cases, same temporal timespan is covered with variable number of frames (1s of a video might contain 30 video frames and 40 audio frames), so returning the presentation timestamp of the returned frame allows for a more precise control of the resulting clip.
frame, timestamp = vid.next(optional: "stream:stream_id")
To get the exact frame that we want, a seek function can be exposed (with an optional stream definition). Seeking is done either to the closest keyframe before the requested timestamp, or to the exact frame if possible.
vid.seek(ts_in_seconds, any_frame=True)
For example if we seek into the 5s of a video container, following call to next()
will return either 1) the last keyframe before 5s in the video (if any_frame=False
), 2a) the frame with pts=5.0 (if any_frame=True
and frame at 5s exist), or 2b) the first frame after 5s, e.g. with pts 5.03 (if any_frame=True
and frame at 5s doesn't exist).
We plan to expose metadata getters, and add additional functionality down the line.
Alternatives
In the end, every video decoder library is a tradeoff between speed and flexibility. Libraries that support batch decoding such as decord offer greater speed (due to multithreaded Loader objects and/or GPU decoding) at the expense of dataloader compatibility, robustness (in terms of available formats), or flexibility. Other libraries that offer greater flexibility such as pyav, opencv, or decord (in sequential reading mode) can sacrifice either speed or ease of use.
We're aiming for this API to be as close in flexibility to pyav as possible, with the same (or better) per-frame decoding speed, all of which by being torch scriptable.
Additional context
Whilst technically, this would mean depreciating our current read_video
API, during a transition period, we would actually support it through a simple function that would mimic the implementation of current read_video, with minimum to no performance impact.
cc @bjuncek