Pass pre-allocate tensors in batch APIs to avoid copies #266

NicolasHug · 2024-10-16T10:56:17Z

First follow-up to #260

This PR modifies the low-level decoding entrypoints to accept an optional pre-allocated output tensor. If such a tensor exists, it is used to store the decoded output.

This allows the batch APIs that call into those low-level decoding entrypoints to save a tensor copy.

Note:

This PR has no effect when filtergraph is used. We can only save a copy with libswscale
This PR allows pre-allocated output tensors to be passed. It does not pre-allocate any tensor though. It doesn't need to: those tensors were already pre-allocated in the existing code.

Follow-ups for other PRs:

Working on this PR made me realize that we have different tensor allocation logics in different places, potentially using different sources of info (I need to double check). I'll follow-up with an issue to clarify and try to improve the situation. That is however orthogonal to this PR.
Maybe fix Improve the way we allocate and use memory for GPU batch decoding #189 if in scope
Comments / docs on the pre-allocation logic

This reverts commit 8e06aa6.

NicolasHug · 2024-10-16T12:44:27Z

src/torchcodec/decoders/_core/VideoDecoder.cpp

  } else if (streamInfo.options.device.type() == torch::kCUDA) {
+    // TODO: handle pre-allocated output tensor


This PR is a no-op for CUDA devices. I'm leaving-out CUDA pre-allocation because this is strongly tied to #189 and can be treated separately.

src/torchcodec/decoders/_core/VideoDecoder.cpp

NicolasHug · 2024-10-16T13:01:21Z

src/torchcodec/decoders/_core/VideoDecoder.h

+  DecodedOutput getNextDecodedOutputNoDemux(
+      torch::Tensor& preAllocatedOutputTensor);


Note: I had to overload this one (only), because it's called in a ton of places in the C++ tests. Forcing to pass an empty tensor at all call-sites would be quite noisy

src/torchcodec/decoders/_core/VideoDecoder.cpp

…ocated_tensors

ahmadsharif1 · 2024-10-16T16:30:23Z

I will review the code in detail in a bit, but since this is a rather large C++ change can you run the perf benchmarks to make sure we don't regress?

ahmadsharif1

This change looks a lot better now. Thanks @NicolasHug

I think we should fail loudly if the shapes don't match up

ahmadsharif1 · 2024-10-16T16:36:41Z

src/torchcodec/decoders/_core/VideoDecoder.cpp

-          {height, width, 3}, torch::TensorOptions().dtype({torch::kUInt8}));
+      torch::Tensor tensor;
+      if (preAllocatedOutputTensor.has_value()) {
+        // TODO: check shape of preAllocatedOutputTensor?


I think we should TORCH_CHECK for height, width, shape, etc. here.

I have this a try thinking it would be a simple assert like

assert `shape[-3] == H, W, 3`

But it turns out it's not as simple. Some tensors come as HWC while some other come pas HWC. This is because the pre-allocated batched tensors are allocated as such:

torchcodec/src/torchcodec/decoders/_core/VideoDecoder.cpp

Lines 171 to 186 in c6a0a5a

if (options.dimensionOrder == "NHWC") {

frames = torch::empty(

{numFrames,

options.height.value_or(*metadata.height),

options.width.value_or(*metadata.width),

3},

{torch::kUInt8});

} else if (options.dimensionOrder == "NCHW") {

frames = torch::empty(

{numFrames,

3,

options.height.value_or(*metadata.height),

options.width.value_or(*metadata.width)},

torch::TensorOptions()

.memory_format(torch::MemoryFormat::ChannelsLast)

.dtype({torch::kUInt8}));

It then me realize that everything works, but it's pretty magical. We end up doing the .pemute() calls in different places, but I think it would be a lot cleaner if we allocated batched output only as NHWC, and then permute this entire NHWC tensor in one go. What we do right now is that we permute all the N HWC tensors, and that's probably not as efficient (or as clean).

I want to fix this as an immediate follow-up if that's OK. I gave it a try here, but it's not trivial and it might be preferable not to overcomplexify this PR.

I'm in favor of @NicolasHug suggestion. The logic he points out is legacy from way back when, and it wasn't necessarily throught through in terms of long term maintenance and code health. Always doing it one way, and then permuting as needed on the way out, sounds easier and cleaner.

Sounds good to me

src/torchcodec/decoders/_core/VideoDecoder.cpp

…ocated_tensors

ahmadsharif1 · 2024-10-17T14:23:54Z

Don't forget to run the benchmark. Sometimes small changes can cause regressions

test/decoders/test_video_decoder_ops.py

…ocated_tensors

…rchcodec into pass_preallocated_tensors

…ocated_tensors

NicolasHug · 2024-10-18T09:25:26Z

Benchmark results look... good, but also fishy. The variance on the main branch is crazy high. I can reproduce the same results consistently. If anything, this PR reduces the variance drastically (and the decoding time too). But I would take those results with a pinch of salt. On my laptop, main and this PR give the same timings, within stddev.

PR devvm
get_frames_in_range, swscale
med = 67.56ms +- 4.25
get_frames_by_pts_in_range, swscale
med = 78.34ms +- 4.12
get_frames_in_range, filtergraph
med = 161.04ms +- 97.86
get_frames_by_pts_in_range, filtergraph
med = 199.17ms +- 114.61

main devvm

get_frames_in_range, swscale
med = 537.86ms +- 133.63
get_frames_by_pts_in_range, swscale
med = 594.93ms +- 149.99
get_frames_in_range, filtergraph
med = 410.08ms +- 99.76
get_frames_by_pts_in_range, filtergraph
med = 308.44ms +- 160.11

Benchmark code:

import torch
from time import perf_counter_ns

from torchcodec.decoders._core import (
    _add_video_stream,
    create_from_file,
    get_frames_by_pts_in_range,
    get_frames_in_range,
    scan_all_streams_to_update_metadata,
)

import torch
from time import perf_counter_ns


def bench(f, color_conversion_library, num_exp=100):

    times = []
    for _ in range(num_exp):
        VIDEO_PATH = "./test/resources/nasa_13013.mp4"
        decoder = create_from_file(VIDEO_PATH)
        _add_video_stream(
            decoder,
            color_conversion_library=color_conversion_library,
        )
        scan_all_streams_to_update_metadata(decoder)

        start = perf_counter_ns()
        f(decoder)
        end = perf_counter_ns()
        times.append(end - start)
    return torch.tensor(times).float()

def report_stats(times, unit="ms"):
    mul = {
        "ns": 1,
        "µs": 1e-3,
        "ms": 1e-6,
        "s": 1e-9,
    }[unit]
    times = times * mul
    std = times.std().item()
    med = times.median().item()
    print(f"{med = :.2f}{unit} +- {std:.2f}")
    return med



NUM_EXP = 100
stream_index=3

def _get_frames_in_range(decoder):
    get_frames_in_range(decoder=decoder, stream_index=stream_index, start=0, stop=100)

def _get_frames_by_pts_in_range(decoder):
    get_frames_by_pts_in_range(decoder=decoder, stream_index=stream_index, start_seconds=0, stop_seconds=4)

for color_conversion_library in ("swscale", "filtergraph"):
    print(f"get_frames_in_range, {color_conversion_library}")
    times = bench(_get_frames_in_range, color_conversion_library=color_conversion_library, num_exp=NUM_EXP)
    report_stats(times)
    print(f"get_frames_by_pts_in_range, {color_conversion_library}")
    times = bench(_get_frames_by_pts_in_range, color_conversion_library=color_conversion_library, num_exp=NUM_EXP)
    report_stats(times)

NicolasHug added 3 commits October 16, 2024 09:41

pass a pointer, segfaults

8e06aa6

Revert "pass a pointer, segfaults"

025bf27

This reverts commit 8e06aa6.

Pre-allocate tensors when possible to avoid copies

f83ada9

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 16, 2024

NicolasHug added 3 commits October 16, 2024 12:32

refac

72717bd

Fix C++ tests

291bc87

minor simplification

887ae42

NicolasHug changed the title ~~Pre-allocate tensors when possible to avoid copies~~ Pass pre-allocate tensors in batch APIs to avoid copies Oct 16, 2024

NicolasHug commented Oct 16, 2024

View reviewed changes

src/torchcodec/decoders/_core/VideoDecoder.cpp Outdated Show resolved Hide resolved

NicolasHug commented Oct 16, 2024

View reviewed changes

src/torchcodec/decoders/_core/VideoDecoder.cpp Outdated Show resolved Hide resolved

NicolasHug marked this pull request as ready for review October 16, 2024 12:49

NicolasHug commented Oct 16, 2024

View reviewed changes

scotts reviewed Oct 16, 2024

View reviewed changes

src/torchcodec/decoders/_core/VideoDecoder.cpp Outdated Show resolved Hide resolved

NicolasHug added 4 commits October 16, 2024 16:40

WIP

9418cb3

Merge branch 'main' of github.com:pytorch/torchcodec into pass_preall…

6b3da59

…ocated_tensors

WIP

6a2190c

don't use a ref

c8f2e79

ahmadsharif1 reviewed Oct 16, 2024

View reviewed changes

scotts reviewed Oct 16, 2024

View reviewed changes

src/torchcodec/decoders/_core/VideoDecoder.cpp Outdated Show resolved Hide resolved

NicolasHug added 3 commits October 17, 2024 06:18

Avoid temporary variable

5113b9c

Test, and fix

9387537

Merge branch 'main' of github.com:pytorch/torchcodec into pass_preall…

bcb4e50

…ocated_tensors

NicolasHug marked this pull request as draft October 17, 2024 13:34

ahmadsharif1 approved these changes Oct 17, 2024

View reviewed changes

NicolasHug commented Oct 17, 2024

View reviewed changes

test/decoders/test_video_decoder_ops.py Outdated Show resolved Hide resolved

NicolasHug added 2 commits October 18, 2024 09:49

Merge branch 'main' of github.com:pytorch/torchcodec into pass_preall…

5db658e

…ocated_tensors

Update test/decoders/test_video_decoder_ops.py

e23acb7

NicolasHug added 2 commits October 18, 2024 10:21

M)erge branch 'pass_preallocated_tensors' of github.com:nicolashug/to…

96deb24

…rchcodec into pass_preallocated_tensors

Merge branch 'main' of github.com:pytorch/torchcodec into pass_preall…

c2f2e59

…ocated_tensors

NicolasHug marked this pull request as ready for review October 18, 2024 09:39

NicolasHug merged commit d6cbee5 into pytorch:main Oct 18, 2024
24 checks passed

NicolasHug deleted the pass_preallocated_tensors branch October 18, 2024 09:42

This was referenced Oct 23, 2024

Add sort and dedup logic in C++ to getFramesAtIndices #280

Merged

Improve the way we allocate and use memory for GPU batch decoding #189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass pre-allocate tensors in batch APIs to avoid copies #266

Pass pre-allocate tensors in batch APIs to avoid copies #266

NicolasHug commented Oct 16, 2024 •

edited

Loading

NicolasHug Oct 16, 2024

NicolasHug Oct 16, 2024

ahmadsharif1 commented Oct 16, 2024

ahmadsharif1 left a comment

ahmadsharif1 Oct 16, 2024

NicolasHug Oct 17, 2024

scotts Oct 17, 2024

ahmadsharif1 Oct 17, 2024

ahmadsharif1 commented Oct 17, 2024

NicolasHug commented Oct 18, 2024

		} else if (streamInfo.options.device.type() == torch::kCUDA) {
		// TODO: handle pre-allocated output tensor

		DecodedOutput getNextDecodedOutputNoDemux(
		torch::Tensor& preAllocatedOutputTensor);

	if (options.dimensionOrder == "NHWC") {
	frames = torch::empty(
	{numFrames,
	options.height.value_or(*metadata.height),
	options.width.value_or(*metadata.width),
	3},
	{torch::kUInt8});
	} else if (options.dimensionOrder == "NCHW") {
	frames = torch::empty(
	{numFrames,
	3,
	options.height.value_or(*metadata.height),
	options.width.value_or(*metadata.width)},
	torch::TensorOptions()
	.memory_format(torch::MemoryFormat::ChannelsLast)
	.dtype({torch::kUInt8}));

Pass pre-allocate tensors in batch APIs to avoid copies #266

Pass pre-allocate tensors in batch APIs to avoid copies #266

Conversation

NicolasHug commented Oct 16, 2024 • edited Loading

NicolasHug Oct 16, 2024

Choose a reason for hiding this comment

NicolasHug Oct 16, 2024

Choose a reason for hiding this comment

ahmadsharif1 commented Oct 16, 2024

ahmadsharif1 left a comment

Choose a reason for hiding this comment

ahmadsharif1 Oct 16, 2024

Choose a reason for hiding this comment

NicolasHug Oct 17, 2024

Choose a reason for hiding this comment

scotts Oct 17, 2024

Choose a reason for hiding this comment

ahmadsharif1 Oct 17, 2024

Choose a reason for hiding this comment

ahmadsharif1 commented Oct 17, 2024

NicolasHug commented Oct 18, 2024

NicolasHug commented Oct 16, 2024 •

edited

Loading