Skip to content

Review of C++ decoding entry-points #260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
NicolasHug opened this issue Oct 14, 2024 · 2 comments
Closed

Review of C++ decoding entry-points #260

NicolasHug opened this issue Oct 14, 2024 · 2 comments

Comments

@NicolasHug
Copy link
Member

NicolasHug commented Oct 14, 2024

We current have 4 public C++ entry points for decoding, and 1 additional private outlier. Below are their high-level call graph:

  • getFrameAtIndex -> idx2pts -> getDecodedOutputWithFilter -> convertAVFrameToDecodedOutput

  • getFramesInRange -> getFrameAtIndex (in a for loop)

  • getFrameDisplayedAtTimestampNoDemux -> getDecodedOutputWithFilter -> convertAVFrameToDecodedOutput

  • getFramesDisplayedByTimestampInRange ->pts2idx -> getFrameAtIndex (in a for loop)

Note 1 the pts2idx and idx2pts aren't existing functions, they're just some inline conversion logic.
Note 2 convertAVFrameToDecodedOutput is responsible for calling either libavfilter or swscale to convert a single frame to RGB.

Then there is a private entry point, which is only exposed as a core API. It is used in the private/deprecated samplers and also used internally:

  • getFramesAtIndices -> idx2pts -> getDecodedOutputWithFilter -> "batched libavfilter/swscale" logic

Note 3 this "batched libavfilter/swscale" logic is basically the same as what happens in convertAVFrameToDecodedOutput, but with a slight optimization for batched output, avoiding an extra copy.

Some issues

The current states of things is the result of many locally-optimal and sensible decisions. In aggregate however, it's pretty confusing, especially for someone like me who didn't write most of those. In particular we end up with the following quirks, with increasing degree of importance:

  • getFramesInRange calls into getFrameAtIndex but getFramesDisplayedByTimestampInRange doesn't call into getFrameDisplayedAtTimestampNoDemux

  • getFramesDisplayedByTimestampInRange converts pts to idx and then idx back to pts

  • Naively, it should make sense for getFramesInRange to call into getFramesAtIndices. It's not immediately obvious why that isn't the case.

  • It's not clear why getFramesAtIndices exists at all considering it's not used by any public API.

  • [maintenance] The "libavfilter/swscale" logic is implemented twice: in convertAVFrameToDecodedOutput for single frames, and in getFramesAtIndices with a more efficient path for batched frames.

  • [perf] The public entry points do not benefit from that efficient batched "libavfilter/swscale" path since it only exists in getFramesAtIndices.

Some more issues

For the samplers we have to implement 2 variants of getFramesAtIndices (potentially extending it): we need to implement the "sort and dedup" logic of frame indices. This has to be done for both indices and pts, noting that the pts will have to be converted to indices so that the dedup can happen (see #256).

This is likely to lead to further fragmentation of the decoding entry points if we don't do anything about it.


Before we move forward with the implementation of these new/extended entry-points, do we:

  • pay some debt right now and address some of the quirks mentioned above ([perf], [maintenance])
  • yolo it, let's just add more stuff and refactor later, maybe
@scotts
Copy link
Contributor

scotts commented Oct 15, 2024

My understanding of some of what's going on:

  • getFramesAtIndices was indeed implemented for the deprecated samplers. Either it, or something like it, should be exposed publicly in VideoDecoder to support the new samplers or the cases where our users want to implement their own sampling logic.
  • getFramesInRange accepts the range as start, stop and step values. getFramesAtIndices accepts the indices through a vector. Making getFramesInRange call getFramesAtIndices would require creating a vector of indices, which is inefficient. Rather than combining the two, getFramesAtIndices should call getFrameAtIndex. That should eliminate duplicate logic.
  • It seems like it would be logical for there to be some rhyming between indices and timestamps in terms of their range and single entry points. Why they don't:
    • This is the quirk that getFramesDisplayedByTimestampInRange does not call getFrameDisplayedAtTimestampNoDemux.
    • The reason that getFramesDisplayedByTimestampInRange instead calls into getFrameAtIndex is because it's easier and more efficient for us to convert from pts-space to index-space to compute and retrieve the timestamp ranges. Once we've done that, we might as well just call the index-based frame retrieval, rather than converting back to pts and calling timestamp based frame retrieval.
    • As quirky as this is, I don't see this changing unless we change the algorithm for getFramesDisplayedByTimestampInRange. We would have to do something where we call getFrameDisplayedAtTimestampNoDemux on startSeconds, then call getNextDecodedOutputNoDemux in a loop, always evaluating if we have decoded past stopSeconds. That was the approach I was going to take originally, but what we currently have seemed better.
    • Basically, if you're going to pre-compute the frames you want, you're going to need to convert to index-space. I believe we have encountered something similar in the sampler?

Some refactoring I would like to do is to never store pts as a double in seconds. We have to expose pts as a double in seconds, but we should never store it as such. This matters for our seek request:

void VideoDecoder::setCursorPtsInSeconds(double seconds) {
.

It makes sense for us to expose that as an API. But as an implementation, we store the value as-is, as a double. This is our only seek function, so we call it everywhere. This then leads to us converting from int-based pts values we get directly from FFmpeg to seconds as doubles in many places. That has the possibility of precision loss. We instead should store the pts as an int, and then internally we can seek without a conversion to a double. I'm happy to do this refactoring.

@NicolasHug
Copy link
Member Author

After discussing this we decided to:

  • rewrite getFramesAtIndices to rely on getFrameAtIndex
  • Let getFrameAtIndex (and downstream stuff) to accept an optional pre-allocated output tensor. This will allow batched-APIs to pass a 3D view of their already-allocated 4D tensors, and save a copy.

These 2 combined should address both the [maintenance] and [perf] points above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants