New video API Proposal

## 🚀 Feature

We're proposing to add a lower level, more flexible, and equally robust API than the one currently existing in `torchvision`.
It would be implemented in C++ and compatible with torchscript. Following the merge of https://github.com/pytorch/vision/pull/2596 it would also be installable via pip or conda

## Motivation

Currently, our API supports returning a tensor of `(TxCxHxW)` via `read_video` (see here) abstraction. This can be prohibitive if a user wants to get a single frame, perform some operations on a per-frame basis. For example, I've ran into multiple issues where I'd want to return a single frame, iterate over frames, or (for example in EPIC Kitchens dataset) reduce the memory usage by transforming the elements before I save them to output tensors.

## Pitch

We propose the following style of API:
First we'd have a constructor that would be a part of torch registered C++ classes, and would take some basic inputs.
```
import torch.classes.torchvision as tvcls
vid = tvcls.Video(path, "stream:stream_id")
```

Returning a frame is as simple as calling `next` on the container [optionally, we can define stream from which we'd like to return the next frame from]. What a frame is will largely depend on the encoding of the video. For video, it is almost always an RGB image, whilst for audio it might be a 1024point sample. In most cases, same temporal timespan is covered with variable number of frames (1s of a video might contain 30 video frames and 40 audio frames), so returning the presentation timestamp of the returned frame allows for a more precise control of the resulting clip.
```
frame, timestamp = vid.next(optional: "stream:stream_id")
```

To get the exact frame that we want, a seek function can be exposed (with an optional stream definition). Seeking is done either to the closest keyframe before the requested timestamp, or to the exact frame if possible.
```
vid.seek(ts_in_seconds, any_frame=True)
```
For example if we seek into the 5s of a video container, following call to `next()` will return either 1) the last keyframe before 5s in the video (if `any_frame=False`), 2a) the frame with pts=5.0 (if `any_frame=True` and frame at 5s exist), or 2b) the first frame after 5s, e.g. with pts 5.03 (if  `any_frame=True` and frame at 5s doesn't exist). 

We plan to expose metadata getters, and add additional functionality down the line. 

## Alternatives

In the end, every video decoder library is a tradeoff between speed and flexibility. Libraries that support batch decoding such as decord offer greater speed (due to multithreaded Loader objects and/or GPU decoding) at the expense of dataloader compatibility, robustness (in terms of available formats), or flexibility. Other libraries that offer greater flexibility such as pyav, opencv, or decord (in sequential reading mode) can sacrifice either speed or ease of use.

We're aiming for this API to be as close in flexibility to pyav as possible, with the same (or better) per-frame decoding speed, all of which by being torch scriptable. 

## Additional context

Whilst technically, this would mean depreciating our current `read_video` API, during a transition period, we would actually support it through a simple function that would mimic the implementation of current read_video, with minimum to no performance impact. 

cc @bjuncek

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New video API Proposal #2660

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New video API Proposal #2660

Description

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions