Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Media elements should support a rational time value for seek() #609

Open
jernoble opened this issue Feb 2, 2016 · 8 comments
Open

Media elements should support a rational time value for seek() #609

jernoble opened this issue Feb 2, 2016 · 8 comments

Comments

@jernoble
Copy link

jernoble commented Feb 2, 2016

Using floating point time values when seeking is inherently imprecise. Specifically, authors attempting to precisely seek to the beginning of a specific video frame will often find that they have seeked to the end of the previous video frame. This is due to rounding errors when converting from double-width floating point values used by JavaScript to the rational integer values used by media file formats.

Double-width floating point values can accurately represent integers up to 2^52. So a rational integer time value can be represented by two JavaScript numbers so long as those numbers are both smaller than 2^52.

Building off of @foolip's work in issue #553, rational time seeking could be supported by adding an optional timeScale parameter to the SeekOptions, which defaults to 1 if absent.

An author with a 29.97 fps video file could then accurately seek to the 30th frame by issuing:

video.seek(30 * 1001, { mode: "precise", timeScale: 30000 });

Left unspecced for now is how the author would determine the correct time scale for the movie or track. For the current proposal, the time scale could be provided out of band.

@foolip
Copy link
Member

foolip commented Feb 2, 2016

There is an old Bugzilla bug for this, which I'll close and redirect here:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=23493

@foolip
Copy link
Member

foolip commented Feb 2, 2016

If the timeScale used doesn't match that of the media resource, would the results be sane? Does Apple's media framework support seeking to any rational number, or just integers relative to the scale?

At least the WebM container stores a TimecodeScale on which all other times are based, so to use another scale wouldn't make much more sense than using a double.

Basically, would it make sense to expose the resource's timeScale and let that be implicit in every other bit of API, or is it useful to be able to seek to 10/3 or similar regardless of the internal scale?

@jernoble
Copy link
Author

jernoble commented Feb 2, 2016

Apple's media frameworks support seeking to any rational time value. Generally, the final seek value is created by finding the least common denominator between the input value and the media's time scale. If a final value can't be created without losing precision, the value is marked as having been rounded.

See CMTime and -[AVPlayer seekToTime:] for examples of platform support.

But keep in mind that for MPEG containers, tracks can have different time scales from one another, and different from the movie as a whole. I suspect WebM is similar (though I can't tell from the documentation whether TrackTimecodeScale is a multiplier of the segment TimecodeScale, or an independent value).

The Web platform can generally only support a single video track per

@foolip
Copy link
Member

foolip commented Feb 3, 2016

For WebM, the muxer guidelines say "The TimecodeScale element SHOULD be set to a default of 1.000.000 nanoseconds" and TrackTimecodeScale is listed as unsupported, so it looks like it's expected that all times in all track be expressed in milliseconds, which seems strange.

@tomfinegan @vigneshvg @jzern, I see that you are the main recent contributors to libwebm, can any of you summarize how frame- or sample-accurate seeking in WebM must work, and what the structure of the metadata is? (global? per track? can vary between clusters?)

@tomfinegan
Copy link

For WebM, the muxer guidelines say "The TimecodeScale element SHOULD be set to a default of 1.000.000 nanoseconds" and TrackTimecodeScale is listed as unsupported, so it looks like it's expected that all times in all track be expressed in milliseconds, which seems strange.

The reasoning is listed just below the quote you provided:

Reason: Allows every cluster to have blocks with positive values up to 32.767 seconds.

It could go on and say "because block timecode's are expressed as signed 16 bit integers relative to the cluster timecode" to make things clearer.

Frame accurate seeking:

  • Find cluster with largest timecode <= seek point
  • Walk its blocks and find keyframe (block with marker set) with largest rel timecode <= seek point
  • If you actually need to play (aka display/render) the data
    • Decode until seek point is reached
    • Display/render starting from seek point
  • If just analyzing the data (or something else...)
    • Walk through cluster directly to desired frame
    • Do something with frame data

I'm assuming same-accurate is a typo/autocorrect mishap and it should be sample. It's essentially the same-- just that for audio the marker is set for all blocks, so you can get a little closer to exactly what you want without the pre-roll decoding you end up doing to reach a video non-keyframe. (though one needs to be careful about codecdelay and discard padding when handling Opus audio)

Not sure about your metadata question. Are you asking if things like the video frame rate and audio sample rate are non-constant? The video definitely can be (i.e. a webcam feed run through a live encoder is rarely constant frame rate). I don't think a non-constant audio sample rate would work in any player, but I've been wrong before.

@foolip
Copy link
Member

foolip commented Feb 3, 2016

Thanks, @tomfinegan!

Reason: Allows every cluster to have blocks with positive values up to 32.767 seconds.

I saw this, but didn't realize that blocks and clusters used different representations for the timecodes. I still can't tell from the documentation, but see in libwebm that blocks use short m_timecode while clusters use long long m_timecode.

I'm assuming same-accurate is a typo/autocorrect mishap and it should be sample.

Oops, edited in place.

Not sure about your metadata question.

In essence, is there only a single timescale (recommended to be 1.000.000) across a whole WebM file? It seems so from the documentation, and if is that's good, something like chained Ogg would not have this property I think.

Anyway, if the timecode scale is constant across the file, how can one seek to a specific audio sample if the sample rate isn't divisor of 1.000.000? Or indeed the start of a video frame if the source material was 29.97 fps or some such?

I'm guessing that times are simply rounded to the closest possible value, but if so it seems tricky for a precise seeking API like we're discussing here to know what rational number will correspond to a specific frame, even assuming one knows what the (constant) framerate of the source material was?

@jernoble, how does this work for MPEG? Are audio and video frame time offsets stored as rational numbers, or is it also always converted to a file-wide timescale?

@tomfinegan
Copy link

I saw this, but didn't realize that blocks and clusters used different representations for the timecodes. I still can't tell from the documentation, but see in libwebm that blocks use short m_timecode while clusters use long long m_timecode.

It's kind of buried on https://www.matroska.org/technical/specs/index.html

You'll find the relevant bit in the second row of the block header structure table:
https://www.matroska.org/technical/specs/index.html#block_structure

It took me a couple minutes to find it again, and I already knew it was there. :)

Abusing a quote from the WebM guidelines again:

Muxer Guidelines
Muxers should treat all guidelines marked SHOULD in this section as MUST. This will foster consistency across WebM files in the real world.

Basically, the timecode scale of 1.000.000 is a requirement.

I'm guessing that times are simply rounded to the closest possible value, but if so it seems tricky for a precise seeking API like we're discussing here to know what rational number will correspond to a specific frame, even assuming one knows what the (constant) framerate of the source material was?

Yes, there is rounding of the timestamp values (and the durations) when converting from time-in-{audio-samples|video-frames} to time-in-milliseconds. It is going to be tricky to seek to a precise frame or sample as a result. TBH I don't think frame/sample accurate seeking were primary concerns at the time the guidelines were written. Being off by a frame or a sample usually isn't a big deal in terms of playback, excepting A/V sync wrt entire frames and w/sensitive viewers or very low frame rates.

Anyway, when dealing with combined playback of A/V streams most playback solutions I've seen treat the pts as informative but rely on audio hardware clocks for syncing with the video (i.e. read the samples actually played back from the hardware, and then use that to determine when it's time to render a video frame).

@foolip
Copy link
Member

foolip commented Feb 3, 2016

Very interesting, thanks again! I don't know what that means for the HTMLMediaElement feature under consideration here, but it's a safe bet that it's not trivial to get this right across container formats, or even a single one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants