Skip to content

Conversation

@NicolasHug
Copy link
Contributor

@NicolasHug NicolasHug commented Oct 18, 2024

This PR simplifies the HWC -> CHW dimension conversion, and reduces the expected input/output dimension order of some functions.

  • Batched output tensors are now always created as NHWC. They are converted to NCHW in one single step, instead of converting HWC sub-tensors N times.
  • As a consequence, the expected shape of input/allocated tensors within convertAVFrameToDecodedOutputOnCPU is now always HWC. And we can now enforce it (we couldn't before).
  • convertFrameToTensorUsingFilterGraph now always returns a HWC tensor. Similarly, convertFrameToBufferUsingSwsScale now always expects (pointer over) a HWC tensor.
  • The [N]HWC -> [N]CHW permutation is now centralized in a new function helper.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 18, 2024
// batch NHWC tensors to be permuted only once, instead of permuting HWC
// tensors N times.
output.frame = MaybeHWC2CHW(streamInfo.options, output.frame);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think there's some smell to this. Whether the tensor was pre-allocated and whether it should be permuted should be orthogonal concepts.
I think it would make sense for all the low-level decoding capabilities (including this function convertAVFrameToDecodedOutputOnCPU ) to only ever accept and return HWC tensors.

And it should be up to the higher-level decoding entry-points (basically the moral equivalent of the public methods) to do the conversion. It's not trivial because getFrameAtIndex is both an entry point and a sub-function of other entry-points. Maybe that also means we should already let all entry-points do their own allocation and always pass pre-allocated tensors?

Copy link
Contributor

@scotts scotts Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on your reasoning and the principles.

One way to square the circle is to split the public facing part of getFrameAtIndex from the actual work being done. We would have a public member function (getFrameAtIndex) and a private member function (getFrameAtIndexInternal, or something like that).

  • getFrameAtInedex would call getFrameAtIndexInternal for the actual work, and after, it would do the conversion check and the actual conversion if needed.
  • getFrameAtIndexInternal would just assume HWC tensors and do the real work.

Then all of the internal calls to getFrameAtIndex would become calls to getFrameAtIndexInternal. I've implementing variants this pattern several times before. We would repeat this for any public entry points that also need to do work for other public entry points, as needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with low-level functions only dealing with HWC. AFAICT, most (all?) low-level code deals with HWC because it has better performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds good. I'll try to implement that in a follow-up PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FAICT, most (all?) low-level code deals with HWC

That's not quite the case. In main , convertAVFrameToDecodedOutputOnCPU accepts both HWC and CHW - this is actually what this PR is fixing.

It may still return both HWC and CHW (instead of just HWC), and this is what I want to fix as a follow-up

"x",
width,
"x3, got ",
shape);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea how to make this single call shorter 🤔 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do:

TORCH_CHECK(
            (shape.size() == 3) && (shape.equals({height, width, 3}),
            "Expected tensor of shape ",
            height,
            "x",
            width,
            "x3, got ",
            shape);

But I'm not sure. The main thing I'm not sure about is if an array literal will auto-convert into a the corresponding ArrayRef: https://pytorch.org/cppdocs/api/classc10_1_1_array_ref.html#_CPPv4NK3c108ArrayRef6equalsE8ArrayRef

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was mainly hoping to avoid the

            height,
            "x",
            width,
            "x3, got ",
            shape);

stack :p

@NicolasHug NicolasHug marked this pull request as ready for review October 18, 2024 12:05
auto numDimensions = hwcTensor.dim();
auto shape = hwcTensor.sizes();
if (numDimensions == 3) {
TORCH_CHECK(shape[2] == 3, "Not a HWC tensor: ", shape);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this robust if the width/height is 3?

Copy link
Contributor Author

@NicolasHug NicolasHug Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will have false positive for the extremely rare (and probably degenerate) case where a video width is 3.

This check is the very very best we can do at this stage. The alternative is to not check anything.

// batch NHWC tensors to be permuted only once, instead of permuting HWC
// tensors N times.
output.frame = MaybeHWC2CHW(streamInfo.options, output.frame);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with low-level functions only dealing with HWC. AFAICT, most (all?) low-level code deals with HWC because it has better performance.

options.height.value_or(*metadata.height),
options.width.value_or(*metadata.width)},
torch::TensorOptions()
.memory_format(torch::MemoryFormat::ChannelsLast)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if using this is identical to permuting a NHWC tensor at the end. I am not 100% sure. Do you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant in terms of performance, is it identical? Specifically this is a bit concerning:

Uploading image.png…

We really should have a batch benchmark because the whole point of batch is to do things in a performant way for many frames.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand what you mean. What image did you mean to link to?

Note that this PR should e strictly more efficient:

Batched output tensors are now always created as NHWC. They are converted to NCHW in one single step, instead of converting HWC sub-tensors N times.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry the image wasn't uploaded properly:

image

I am not sure about the performance implications of doing a permute instead of .to(channels_last)

From this page: https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html#:~:text=What%20is%20Channels%20Last,pixel%2Dper%2Dpixel).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I still don't really understand where you're coming from. Can you please share a link? Is this relevant for this PR?

Again, the change involved in this PR is:

Batched output tensors are now always created as NHWC. They are converted to NCHW in one single step, instead of converting HWC sub-tensors N times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link.

We're not concerned about memory format (contiguous vs channels-last) in this PR. This is a related but distinct concern to the dimension order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link in the edited comment above. It's from the channels-last page.

I am not 100% sure if creating a NHWC and permuting it is the same as creating a NCHW with channels-last and working with that. The code that you deleted was doing the latter.

A benchmark may show a difference -- or not. Do you know?

@NicolasHug NicolasHug merged commit b82ea81 into meta-pytorch:main Oct 18, 2024
24 checks passed
@NicolasHug NicolasHug deleted the dim_conversion_cleanup branch October 18, 2024 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants