Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AudioBufferSourceNode: Allow for (small) negative offsets in Start() when subsampling #2047

Closed
collares opened this issue Aug 25, 2019 · 6 comments

Comments

@collares
Copy link

collares commented Aug 25, 2019

This refers to the discussion at #2032 (comment). In my opinion, @karlt makes a very good point in his comment: Due to subsampling, "interpolating" before the first sample provides (in some cases) meaningful audio content to be played. Consider the case, extracted from buffer-resampling.html, where the buffer's sample rate is 8000Hz and the context's sample rate is 48000Hz. Interpolating, even linearly, before the first sample would give us 5 non-silent samples which improve the audio quality in stitching cases.

For backwards compatibility, the AudioBufferSourceNode can't start playing before its start time, so I am suggesting an opt-in alternative: Allowing for negative offsets (at least those falling between "buffer sample -1" (silence) and buffer sample 0) would have the desired outcome and address @karlt's audio quality concern. In this way, buffer stitching (as in sub-sample-buffer-stitching.html) could be achieved in a robust manner without linear extrapolation at the endpoints. The spec already allows for offsets a bit before after the last buffer sample (see #2032; this behavior is tested in buffer-resampling.html), so this would merely make the playback symmetric with regards to endpoints.

@rtoy
Copy link
Member

rtoy commented Aug 26, 2019

I'm not sure how this would work in general and be self-consistent. Let's say you called start(t), and t just happens to be the same as the current time. How do you "insert" the interpolated samples before the start time? Either you don't or you delay the output a few samples.

But let's say t is a little bit later than current time, Then there is time to do this interpolation. But now your output depends on some magic relationship between t and the current time, and the output can change depending on the relationship.

This also begs the question of what does start(t) then really mean if there is non-zero output before t?

@collares
Copy link
Author

collares commented Aug 26, 2019

I understand your point (I had the same objections in #2032 (comment) too), but my proposal is more conservative than that. Here's an example that might help my explanation: Suppose buffer.sampleRate = 8kHz, context.sampleRate = 48kHz. Denote by EPS the time between two buffer samples (six ticks, 0.000125s). In this setup, the last buffer sample corresponds to time buffer.duration - EPS. If playbackRate == -1, the user can (currently!) choose if they want to do buffer.start(t, buffer.duration) to get the interpolated samples or buffer.start(t, buffer.duration - EPS) to start immediately.

Similarly, my proposal is to allow buffer.start(t, -EPS) so the user can choose to get the interpolated samples for positive playbackRates. buffer.start(t) wouldn't have its behavior changed: it would not insert interpolated samples regardless of whether t is in the present or in the future. The main application of this is stitching buffers, inspired by the WPT test. With this, you could do

buffer1.start(now) // without interpolation before the first sample
buffer2.start(now + buffer1.duration - EPS, -EPS) // smooth stitching

I'm not saying this is worth the trouble, but it would at least make the algorithm symmetric with regards to playback direction.

@karlt
Copy link
Contributor

karlt commented Aug 27, 2019

I suspect there may be another way to address this.

This also begs the question of what does start(t) then really mean if there is non-zero output before t?

I think that is a key part of the issue here.

Half the energy of the first sample in the buffer is before the time of the first sample and half is after, so centering the first buffer sample on time t is always placing half the energy of the first sample before the time t.

When the sample rate of the buffer matches that of the AudioContext and t = n/sampleRate for integer n, we don't notice any complications because we interpret the first ABSN output sample with its energy centered around the time t. The differences in interpretation show up when "Resampling of the buffer may be performed arbitrarily by the UA at any desired point".

I've been assuming that the first sample in the buffer corresponds to buffer playhead position 0 and that the last sample corresponds to playhead position buffer.duration - 1/buffer.sampleRate. However, it would be just as reasonable to assume that the first sample corresponds to playhead position 1/buffer.sampleRate and that the last sample corresponds to position buffer.duration. So the question arises whether symmetry would be better. i.e. first and last samples centred at 0.5/buffer.sampleRate from each end.

There are similar interpretations possible for the time of the first sample rendered by the AudioContext. Presumably this should be consistent with that of the buffer playhead position. I have been assuming that the first sample rendered by the context corresponds to time zero. However, there is a strong case to indicate that this is wrong.

AudioParam.setValueAtTime(value, startTime) "Schedules a parameter value change at the given time", so it produces a step function change at startTime. If a PCM representation of this function has a sample point exactly at startTime, then that sample point would render a value mid-way between the values before and after startTime. No implementation AFAIK implements setValueAtTime(newValue, 0) so that the first sample has a value mid way between the values before and after the change. Instead, implementations render newValue at the first sample. It is as if the first sample rendered corresponds to time 0.5/context.sampleRate.

start(t) means that initial buffer playhead position (offset) will be aligned with time t.

For zero offset, first buffer sample corresponding to playhead position 0.5/8000, and first ABSN output sample corresponding to time t + 0.5/48000, there will be sufficient output samples after time t to capture most of the energy of the first buffer sample.

@rtoy
Copy link
Member

rtoy commented Aug 27, 2019

Maybe we can rephrase the question a bit. Let's say the AudioBuffer creates a new (internal) resampled array. Let's also assume that the resampling is done using either a truncated sinc function or a typical linear-phase FIR interpolating (decimation) filter so that you know the precise delay caused by the filter.

Apply the filter to the original buffer to create a new buffer at the context rate. We know what the delay is, so drop the samples before the delay, and just keep those samples in the new buffer.

Then we can process the AudioBuffer using the new array. If the start time is on a frame boundary, we are done.

If not, we can just linearly interpolate, or we can get fancy. Say the requested start time, t0, between n/Fs and (n+1)/Fs. Create a new filter like the original interpolating filter, but adds an additional delay of t0-n/Fs sec. Apply this filter to get a new audio buffer but keep only the samples from (n+1)/Fs and greater. Output this signal at the frame boundaries.

I think this approach produces the output you want, and still preserves the fact that for start(t0), the output is 0 for t < t0, and non-zero for t > t0.

But fundamentally, the developers who really need this kind of precise output should not depend on WebAudio doing exactly what they want. They should ensure all the buffers have rates that match the cotnext sample rate or the context sample rate should match the buffer rate. They should also not do subsample starts and always start on a frame boundary.

@karlt
Copy link
Contributor

karlt commented Aug 28, 2019

We know what the delay is, so drop the samples before the delay, and just keep those samples in the new buffer.

Those leading samples are there because they contain some of the energy from the buffer signal. If delay is measured to the time corresponding to the center of the first buffer sample (This is not the only interpretation of delay.), dropping the samples before the delay when up-sampling would be dropping much of the energy of the first sample.

I would agree though that it is reasonable to expect clients wanting precise output to resample themselves.

@padenot
Copy link
Member

padenot commented Aug 29, 2019

AudioWG call:

We discussed this, and, bottom line, this sentence:

I would agree though that it is reasonable to expect clients wanting precise output to resample themselves.

more or less summarize the group's position. In this day and age with AudioWorklet available, it's reasonable to expect authors to do something on their own if they want precise control over everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants