-
Notifications
You must be signed in to change notification settings - Fork 66
Rework HWC / CHW dimension order conversions #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework HWC / CHW dimension order conversions #277
Conversation
This reverts commit 8e06aa6.
…rchcodec into pass_preallocated_tensors
| // batch NHWC tensors to be permuted only once, instead of permuting HWC | ||
| // tensors N times. | ||
| output.frame = MaybeHWC2CHW(streamInfo.options, output.frame); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think there's some smell to this. Whether the tensor was pre-allocated and whether it should be permuted should be orthogonal concepts.
I think it would make sense for all the low-level decoding capabilities (including this function convertAVFrameToDecodedOutputOnCPU ) to only ever accept and return HWC tensors.
And it should be up to the higher-level decoding entry-points (basically the moral equivalent of the public methods) to do the conversion. It's not trivial because getFrameAtIndex is both an entry point and a sub-function of other entry-points. Maybe that also means we should already let all entry-points do their own allocation and always pass pre-allocated tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on your reasoning and the principles.
One way to square the circle is to split the public facing part of getFrameAtIndex from the actual work being done. We would have a public member function (getFrameAtIndex) and a private member function (getFrameAtIndexInternal, or something like that).
getFrameAtInedexwould callgetFrameAtIndexInternalfor the actual work, and after, it would do the conversion check and the actual conversion if needed.getFrameAtIndexInternalwould just assume HWC tensors and do the real work.
Then all of the internal calls to getFrameAtIndex would become calls to getFrameAtIndexInternal. I've implementing variants this pattern several times before. We would repeat this for any public entry points that also need to do work for other public entry points, as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with low-level functions only dealing with HWC. AFAICT, most (all?) low-level code deals with HWC because it has better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds good. I'll try to implement that in a follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FAICT, most (all?) low-level code deals with HWC
That's not quite the case. In main , convertAVFrameToDecodedOutputOnCPU accepts both HWC and CHW - this is actually what this PR is fixing.
It may still return both HWC and CHW (instead of just HWC), and this is what I want to fix as a follow-up
| "x", | ||
| width, | ||
| "x3, got ", | ||
| shape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea how to make this single call shorter 🤔 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do:
TORCH_CHECK(
(shape.size() == 3) && (shape.equals({height, width, 3}),
"Expected tensor of shape ",
height,
"x",
width,
"x3, got ",
shape);
But I'm not sure. The main thing I'm not sure about is if an array literal will auto-convert into a the corresponding ArrayRef: https://pytorch.org/cppdocs/api/classc10_1_1_array_ref.html#_CPPv4NK3c108ArrayRef6equalsE8ArrayRef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was mainly hoping to avoid the
height,
"x",
width,
"x3, got ",
shape);
stack :p
| auto numDimensions = hwcTensor.dim(); | ||
| auto shape = hwcTensor.sizes(); | ||
| if (numDimensions == 3) { | ||
| TORCH_CHECK(shape[2] == 3, "Not a HWC tensor: ", shape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this robust if the width/height is 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will have false positive for the extremely rare (and probably degenerate) case where a video width is 3.
This check is the very very best we can do at this stage. The alternative is to not check anything.
| // batch NHWC tensors to be permuted only once, instead of permuting HWC | ||
| // tensors N times. | ||
| output.frame = MaybeHWC2CHW(streamInfo.options, output.frame); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with low-level functions only dealing with HWC. AFAICT, most (all?) low-level code deals with HWC because it has better performance.
| options.height.value_or(*metadata.height), | ||
| options.width.value_or(*metadata.width)}, | ||
| torch::TensorOptions() | ||
| .memory_format(torch::MemoryFormat::ChannelsLast) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if using this is identical to permuting a NHWC tensor at the end. I am not 100% sure. Do you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't understand what you mean. What image did you mean to link to?
Note that this PR should e strictly more efficient:
Batched output tensors are now always created as NHWC. They are converted to NCHW in one single step, instead of converting HWC sub-tensors N times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry the image wasn't uploaded properly:
I am not sure about the performance implications of doing a permute instead of .to(channels_last)
From this page: https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html#:~:text=What%20is%20Channels%20Last,pixel%2Dper%2Dpixel).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I still don't really understand where you're coming from. Can you please share a link? Is this relevant for this PR?
Again, the change involved in this PR is:
Batched output tensors are now always created as NHWC. They are converted to NCHW in one single step, instead of converting HWC sub-tensors N times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the link.
We're not concerned about memory format (contiguous vs channels-last) in this PR. This is a related but distinct concern to the dimension order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link in the edited comment above. It's from the channels-last page.
I am not 100% sure if creating a NHWC and permuting it is the same as creating a NCHW with channels-last and working with that. The code that you deleted was doing the latter.
A benchmark may show a difference -- or not. Do you know?
This PR simplifies the HWC -> CHW dimension conversion, and reduces the expected input/output dimension order of some functions.
convertAVFrameToDecodedOutputOnCPUis now alwaysHWC. And we can now enforce it (we couldn't before).convertFrameToTensorUsingFilterGraphnow always returns a HWC tensor. Similarly,convertFrameToBufferUsingSwsScalenow always expects (pointer over) a HWC tensor.