Skip to content

Pass pre-allocate tensors in batch APIs to avoid copies #266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Oct 18, 2024

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Oct 16, 2024

First follow-up to #260

This PR modifies the low-level decoding entrypoints to accept an optional pre-allocated output tensor. If such a tensor exists, it is used to store the decoded output.

This allows the batch APIs that call into those low-level decoding entrypoints to save a tensor copy.

Note:

  • This PR has no effect when filtergraph is used. We can only save a copy with libswscale
  • This PR allows pre-allocated output tensors to be passed. It does not pre-allocate any tensor though. It doesn't need to: those tensors were already pre-allocated in the existing code.

Follow-ups for other PRs:

  • Working on this PR made me realize that we have different tensor allocation logics in different places, potentially using different sources of info (I need to double check). I'll follow-up with an issue to clarify and try to improve the situation. That is however orthogonal to this PR.
  • Maybe fix Improve the way we allocate and use memory for GPU batch decoding #189 if in scope
  • Comments / docs on the pre-allocation logic

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 16, 2024
@NicolasHug NicolasHug changed the title Pre-allocate tensors when possible to avoid copies Pass pre-allocate tensors in batch APIs to avoid copies Oct 16, 2024
} else if (streamInfo.options.device.type() == torch::kCUDA) {
// TODO: handle pre-allocated output tensor
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a no-op for CUDA devices. I'm leaving-out CUDA pre-allocation because this is strongly tied to #189 and can be treated separately.

@NicolasHug NicolasHug marked this pull request as ready for review October 16, 2024 12:49
Comment on lines 218 to 219
DecodedOutput getNextDecodedOutputNoDemux(
torch::Tensor& preAllocatedOutputTensor);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I had to overload this one (only), because it's called in a ton of places in the C++ tests. Forcing to pass an empty tensor at all call-sites would be quite noisy

@ahmadsharif1
Copy link
Contributor

I will review the code in detail in a bit, but since this is a rather large C++ change can you run the perf benchmarks to make sure we don't regress?

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks a lot better now. Thanks @NicolasHug

I think we should fail loudly if the shapes don't match up

{height, width, 3}, torch::TensorOptions().dtype({torch::kUInt8}));
torch::Tensor tensor;
if (preAllocatedOutputTensor.has_value()) {
// TODO: check shape of preAllocatedOutputTensor?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should TORCH_CHECK for height, width, shape, etc. here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this a try thinking it would be a simple assert like

assert `shape[-3] == H, W, 3`

But it turns out it's not as simple. Some tensors come as HWC while some other come pas HWC. This is because the pre-allocated batched tensors are allocated as such:

if (options.dimensionOrder == "NHWC") {
frames = torch::empty(
{numFrames,
options.height.value_or(*metadata.height),
options.width.value_or(*metadata.width),
3},
{torch::kUInt8});
} else if (options.dimensionOrder == "NCHW") {
frames = torch::empty(
{numFrames,
3,
options.height.value_or(*metadata.height),
options.width.value_or(*metadata.width)},
torch::TensorOptions()
.memory_format(torch::MemoryFormat::ChannelsLast)
.dtype({torch::kUInt8}));

It then me realize that everything works, but it's pretty magical. We end up doing the .pemute() calls in different places, but I think it would be a lot cleaner if we allocated batched output only as NHWC, and then permute this entire NHWC tensor in one go. What we do right now is that we permute all the N HWC tensors, and that's probably not as efficient (or as clean).

I want to fix this as an immediate follow-up if that's OK. I gave it a try here, but it's not trivial and it might be preferable not to overcomplexify this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of @NicolasHug suggestion. The logic he points out is legacy from way back when, and it wasn't necessarily throught through in terms of long term maintenance and code health. Always doing it one way, and then permuting as needed on the way out, sounds easier and cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

@NicolasHug NicolasHug marked this pull request as draft October 17, 2024 13:34
@ahmadsharif1
Copy link
Contributor

Don't forget to run the benchmark. Sometimes small changes can cause regressions

@NicolasHug
Copy link
Member Author

Benchmark results look... good, but also fishy. The variance on the main branch is crazy high. I can reproduce the same results consistently. If anything, this PR reduces the variance drastically (and the decoding time too). But I would take those results with a pinch of salt. On my laptop, main and this PR give the same timings, within stddev.

PR devvm
get_frames_in_range, swscale
med = 67.56ms +- 4.25
get_frames_by_pts_in_range, swscale
med = 78.34ms +- 4.12
get_frames_in_range, filtergraph
med = 161.04ms +- 97.86
get_frames_by_pts_in_range, filtergraph
med = 199.17ms +- 114.61

main devvm

get_frames_in_range, swscale
med = 537.86ms +- 133.63
get_frames_by_pts_in_range, swscale
med = 594.93ms +- 149.99
get_frames_in_range, filtergraph
med = 410.08ms +- 99.76
get_frames_by_pts_in_range, filtergraph
med = 308.44ms +- 160.11

Benchmark code:

import torch
from time import perf_counter_ns

from torchcodec.decoders._core import (
    _add_video_stream,
    create_from_file,
    get_frames_by_pts_in_range,
    get_frames_in_range,
    scan_all_streams_to_update_metadata,
)

import torch
from time import perf_counter_ns


def bench(f, color_conversion_library, num_exp=100):

    times = []
    for _ in range(num_exp):
        VIDEO_PATH = "./test/resources/nasa_13013.mp4"
        decoder = create_from_file(VIDEO_PATH)
        _add_video_stream(
            decoder,
            color_conversion_library=color_conversion_library,
        )
        scan_all_streams_to_update_metadata(decoder)

        start = perf_counter_ns()
        f(decoder)
        end = perf_counter_ns()
        times.append(end - start)
    return torch.tensor(times).float()

def report_stats(times, unit="ms"):
    mul = {
        "ns": 1,
        "µs": 1e-3,
        "ms": 1e-6,
        "s": 1e-9,
    }[unit]
    times = times * mul
    std = times.std().item()
    med = times.median().item()
    print(f"{med = :.2f}{unit} +- {std:.2f}")
    return med



NUM_EXP = 100
stream_index=3

def _get_frames_in_range(decoder):
    get_frames_in_range(decoder=decoder, stream_index=stream_index, start=0, stop=100)

def _get_frames_by_pts_in_range(decoder):
    get_frames_by_pts_in_range(decoder=decoder, stream_index=stream_index, start_seconds=0, stop_seconds=4)

for color_conversion_library in ("swscale", "filtergraph"):
    print(f"get_frames_in_range, {color_conversion_library}")
    times = bench(_get_frames_in_range, color_conversion_library=color_conversion_library, num_exp=NUM_EXP)
    report_stats(times)
    print(f"get_frames_by_pts_in_range, {color_conversion_library}")
    times = bench(_get_frames_by_pts_in_range, color_conversion_library=color_conversion_library, num_exp=NUM_EXP)
    report_stats(times)

@NicolasHug NicolasHug marked this pull request as ready for review October 18, 2024 09:39
@NicolasHug NicolasHug merged commit d6cbee5 into pytorch:main Oct 18, 2024
24 checks passed
@NicolasHug NicolasHug deleted the pass_preallocated_tensors branch October 18, 2024 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the way we allocate and use memory for GPU batch decoding
4 participants