Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sample accurate audio splicing using timestampOffset/appendWindowStart/appendWindowEnd #37

Open
Melatonin64 opened this issue Nov 9, 2015 · 14 comments
Labels
feature request needs author input TPAC-2022-discussion Marked for discussion at TPAC 2022 Media WG meeting Sep 16
Milestone

Comments

@Melatonin64
Copy link

One of the use cases for sample-accurate-audio-splicing is gapless audio playback (by removing the excess front/back padding added by most audio codecs).

Step 9 of the for loop in the Coded Frame Processing algorithm states:
If frame end timestamp is greater than appendWindowEnd, then set the need random access point flag to true, drop the coded frame, and jump to the top of the loop to start processing the next coded frame.

For audio, this means that we'll be dropping a complete coded frame (i.e. for AAC - 1024 audio samples) even if some of the audio samples would have fallen within the append window.
This granularity (coded frames) is not sufficient in order to achieve gapless audio playback.
It would be great if instead, we could keep the frame, and mark a range of samples to be discarded from frame, so that once the frame is decoded only the samples that fall within the append window will be used.

It's possible that this change alone is not sufficient to support sample-accurate-audio-splicing...

A test webpage can be found here, (based on Dale Curtis' article).
Further discussion can be found here

@wolenetz
Copy link
Member

Does the suggested change not fall within the non-normative note already in the spec? (pasted below)
NOTE
Some implementations may choose to collect some of these coded frames that are outside the append window and use them to generate a splice at the first coded frame that has a presentation timestamp greater than or equal to appendWindowStart even if that frame is not a random access point. Supporting this requires multiple decoders or faster than real-time decoding so for now this behavior will not be a normative requirement.

@wolenetz
Copy link
Member

On reread, this looks like a request to make gapless behavior (as is in Chrome) normative, not non-normative. For v1, this remains a quality-of-implementation issue. @Melatonin64 do you have a better non-normative note that we could put into v1?

Per triage process, marking V1NonBlocking to resolve any non-normative note fixes.
Gapless support, per reasons in the existing non-normative note after step 8, is non-normative for V1.

@wolenetz wolenetz added this to the V1NonBlocking milestone Mar 16, 2016
@Melatonin64
Copy link
Author

Thanks for your comments.
You're correct - this is basically a request to make gapless audio playback normative.

I'm not entirely sure why this requires multiple decoders (for audio codecs, where every frame is a random access point AFAIK).
It seems to me this could be easily implemented by including the coded frames that sit on the append window boundaries, while retaining some metadata for the splice.
Once the audio samples have been decoded, the excess samples (those that fall outside the append window) can just be discarded.
I might be missing something here though...

Otherwise, I don't have anything to add to the note.

@jdsmith3000
Copy link
Contributor

It's my understanding this would be implemented post decode. I'm not sure either why the note mentions multiple decoders, but it is not a simple change.

@wolenetz
Copy link
Member

It is not a simple change. Chrome does this post-decode, with parser-time marking of the encoded frames to assist the post-decode. w.r.t. multiple decoders, at minimum a faster-than-realtime decoder would probably be necessary, if doing the splicing for gapless post-decode at playback time (because some of the decoder outputs from pre- and post- splice will be extra decoder work vs the actual decoded samples kept). Some implementations may not be able to do this faster-than-realtime decode without having more than one decoder (this is my educated guess why that non-normative note is phrased that way).

@Melatonin64
Copy link
Author

Ok, thanks for your comments.

I think it's a shame that implementers are not required to implement this.
Also, since Step 9 in Coded Frame Processing explicitly instructs implementers to:
drop the coded frame ... If frame end timestamp is greater than appendWindowEnd,
implementing this might not even occur to some.

Are there any plans to make this behavior normative in subsequent versions of the spec?

@wolenetz
Copy link
Member

@Melatonin64 at the moment we are focused on getting MSE v1 spec across the line. A feature request like this imposes constraints that may be too much for some implementations, especially at this point in the spec process. I propose we move this to VNext.
If we keep this in V1, this is substantive. The milestone shouldn't be V1NonBlocking since this is substantive.

@wolenetz wolenetz modified the milestones: VNext, V1NonBlocking May 17, 2016
@Melatonin64
Copy link
Author

@wolenetz Ok, thanks.

@wolenetz wolenetz removed this from the VNext milestone Jun 9, 2020
@mwatson2
Copy link
Contributor

mwatson2 commented Sep 8, 2020

IIUC, this is asking for PCM sample-accurate audio splicing, which is definitely a desirable feature.

@Melatonin64
Copy link
Author

Yup, correct!

@mwatson2 mwatson2 added this to the V2 milestone Sep 21, 2020
@wolenetz
Copy link
Member

See also #165

@mwatson2
Copy link
Contributor

This feature needs to cover the various ways the audio could be spliced after decoding and applying the append window in the decoded domain. For example, if there remains overlap between the old and new audio, what are the options for browsers ? Cross-fading should remain an option. This would allow sites to specify the exact position of the start of the cross-fade within an audio frame.

@wolenetz
Copy link
Member

wolenetz commented Aug 10, 2021

This was discussed on today's Media Workgroup call.

Since not all implementations may be capable of supporting sample-accurate splicing, perhaps a normative feature detection would be best for apps to adapt appropriately, and then normative behavior for what is necessary for an implementation to support interoperable gapless/sample-accurate splicing. This would enable interop tests, and ideally promote better interop of bytestream-specific metadata interpretations of things like negative timestamps, edit lists, decoder preroll, and decoder delay - all to improve interoperability of especially gapless/sample-accurate implementations.

Note that cross-fade at whatever audio splice points result may depend on implementation capability for doing that (for example, currently Chrome doesn't cross-fade, yet does do sample-accurate audio splicing). See #165 for related issue around splice rendering behavior.

@wolenetz wolenetz added the TPAC-2022-discussion Marked for discussion at TPAC 2022 Media WG meeting Sep 16 label Sep 16, 2022
@chrisn
Copy link
Member

chrisn commented Dec 7, 2022

Minutes from 8 Nov 2022 Media WG meeting: https://www.w3.org/2022/11/08-mediawg-minutes.html#t03

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request needs author input TPAC-2022-discussion Marked for discussion at TPAC 2022 Media WG meeting Sep 16
Projects
None yet
Development

No branches or pull requests

5 participants