-
Notifications
You must be signed in to change notification settings - Fork 31
Pass pre-allocate tensors in batch APIs to avoid copies #266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass pre-allocate tensors in batch APIs to avoid copies #266
Conversation
} else if (streamInfo.options.device.type() == torch::kCUDA) { | ||
// TODO: handle pre-allocated output tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is a no-op for CUDA devices. I'm leaving-out CUDA pre-allocation because this is strongly tied to #189 and can be treated separately.
DecodedOutput getNextDecodedOutputNoDemux( | ||
torch::Tensor& preAllocatedOutputTensor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I had to overload this one (only), because it's called in a ton of places in the C++ tests. Forcing to pass an empty tensor at all call-sites would be quite noisy
I will review the code in detail in a bit, but since this is a rather large C++ change can you run the perf benchmarks to make sure we don't regress? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks a lot better now. Thanks @NicolasHug
I think we should fail loudly if the shapes don't match up
{height, width, 3}, torch::TensorOptions().dtype({torch::kUInt8})); | ||
torch::Tensor tensor; | ||
if (preAllocatedOutputTensor.has_value()) { | ||
// TODO: check shape of preAllocatedOutputTensor? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should TORCH_CHECK for height, width, shape, etc. here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this a try thinking it would be a simple assert like
assert `shape[-3] == H, W, 3`
But it turns out it's not as simple. Some tensors come as HWC while some other come pas HWC. This is because the pre-allocated batched tensors are allocated as such:
torchcodec/src/torchcodec/decoders/_core/VideoDecoder.cpp
Lines 171 to 186 in c6a0a5a
if (options.dimensionOrder == "NHWC") { | |
frames = torch::empty( | |
{numFrames, | |
options.height.value_or(*metadata.height), | |
options.width.value_or(*metadata.width), | |
3}, | |
{torch::kUInt8}); | |
} else if (options.dimensionOrder == "NCHW") { | |
frames = torch::empty( | |
{numFrames, | |
3, | |
options.height.value_or(*metadata.height), | |
options.width.value_or(*metadata.width)}, | |
torch::TensorOptions() | |
.memory_format(torch::MemoryFormat::ChannelsLast) | |
.dtype({torch::kUInt8})); |
It then me realize that everything works, but it's pretty magical. We end up doing the .pemute()
calls in different places, but I think it would be a lot cleaner if we allocated batched output only as NHWC
, and then permute this entire NHWC
tensor in one go. What we do right now is that we permute all the N HWC
tensors, and that's probably not as efficient (or as clean).
I want to fix this as an immediate follow-up if that's OK. I gave it a try here, but it's not trivial and it might be preferable not to overcomplexify this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of @NicolasHug suggestion. The logic he points out is legacy from way back when, and it wasn't necessarily throught through in terms of long term maintenance and code health. Always doing it one way, and then permuting as needed on the way out, sounds easier and cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
Don't forget to run the benchmark. Sometimes small changes can cause regressions |
…rchcodec into pass_preallocated_tensors
Benchmark results look... good, but also fishy. The variance on the
Benchmark code: import torch
from time import perf_counter_ns
from torchcodec.decoders._core import (
_add_video_stream,
create_from_file,
get_frames_by_pts_in_range,
get_frames_in_range,
scan_all_streams_to_update_metadata,
)
import torch
from time import perf_counter_ns
def bench(f, color_conversion_library, num_exp=100):
times = []
for _ in range(num_exp):
VIDEO_PATH = "./test/resources/nasa_13013.mp4"
decoder = create_from_file(VIDEO_PATH)
_add_video_stream(
decoder,
color_conversion_library=color_conversion_library,
)
scan_all_streams_to_update_metadata(decoder)
start = perf_counter_ns()
f(decoder)
end = perf_counter_ns()
times.append(end - start)
return torch.tensor(times).float()
def report_stats(times, unit="ms"):
mul = {
"ns": 1,
"µs": 1e-3,
"ms": 1e-6,
"s": 1e-9,
}[unit]
times = times * mul
std = times.std().item()
med = times.median().item()
print(f"{med = :.2f}{unit} +- {std:.2f}")
return med
NUM_EXP = 100
stream_index=3
def _get_frames_in_range(decoder):
get_frames_in_range(decoder=decoder, stream_index=stream_index, start=0, stop=100)
def _get_frames_by_pts_in_range(decoder):
get_frames_by_pts_in_range(decoder=decoder, stream_index=stream_index, start_seconds=0, stop_seconds=4)
for color_conversion_library in ("swscale", "filtergraph"):
print(f"get_frames_in_range, {color_conversion_library}")
times = bench(_get_frames_in_range, color_conversion_library=color_conversion_library, num_exp=NUM_EXP)
report_stats(times)
print(f"get_frames_by_pts_in_range, {color_conversion_library}")
times = bench(_get_frames_by_pts_in_range, color_conversion_library=color_conversion_library, num_exp=NUM_EXP)
report_stats(times) |
First follow-up to #260
This PR modifies the low-level decoding entrypoints to accept an optional pre-allocated output tensor. If such a tensor exists, it is used to store the decoded output.
This allows the batch APIs that call into those low-level decoding entrypoints to save a tensor copy.
Note:
Follow-ups for other PRs: