Skip to content

New video API Proposal #2660

Open
Open
@bjuncek

Description

@bjuncek

🚀 Feature

We're proposing to add a lower level, more flexible, and equally robust API than the one currently existing in torchvision.
It would be implemented in C++ and compatible with torchscript. Following the merge of #2596 it would also be installable via pip or conda

Motivation

Currently, our API supports returning a tensor of (TxCxHxW) via read_video (see here) abstraction. This can be prohibitive if a user wants to get a single frame, perform some operations on a per-frame basis. For example, I've ran into multiple issues where I'd want to return a single frame, iterate over frames, or (for example in EPIC Kitchens dataset) reduce the memory usage by transforming the elements before I save them to output tensors.

Pitch

We propose the following style of API:
First we'd have a constructor that would be a part of torch registered C++ classes, and would take some basic inputs.

import torch.classes.torchvision as tvcls
vid = tvcls.Video(path, "stream:stream_id")

Returning a frame is as simple as calling next on the container [optionally, we can define stream from which we'd like to return the next frame from]. What a frame is will largely depend on the encoding of the video. For video, it is almost always an RGB image, whilst for audio it might be a 1024point sample. In most cases, same temporal timespan is covered with variable number of frames (1s of a video might contain 30 video frames and 40 audio frames), so returning the presentation timestamp of the returned frame allows for a more precise control of the resulting clip.

frame, timestamp = vid.next(optional: "stream:stream_id")

To get the exact frame that we want, a seek function can be exposed (with an optional stream definition). Seeking is done either to the closest keyframe before the requested timestamp, or to the exact frame if possible.

vid.seek(ts_in_seconds, any_frame=True)

For example if we seek into the 5s of a video container, following call to next() will return either 1) the last keyframe before 5s in the video (if any_frame=False), 2a) the frame with pts=5.0 (if any_frame=True and frame at 5s exist), or 2b) the first frame after 5s, e.g. with pts 5.03 (if any_frame=True and frame at 5s doesn't exist).

We plan to expose metadata getters, and add additional functionality down the line.

Alternatives

In the end, every video decoder library is a tradeoff between speed and flexibility. Libraries that support batch decoding such as decord offer greater speed (due to multithreaded Loader objects and/or GPU decoding) at the expense of dataloader compatibility, robustness (in terms of available formats), or flexibility. Other libraries that offer greater flexibility such as pyav, opencv, or decord (in sequential reading mode) can sacrifice either speed or ease of use.

We're aiming for this API to be as close in flexibility to pyav as possible, with the same (or better) per-frame decoding speed, all of which by being torch scriptable.

Additional context

Whilst technically, this would mean depreciating our current read_video API, during a transition period, we would actually support it through a simple function that would mimic the implementation of current read_video, with minimum to no performance impact.

cc @bjuncek

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions