web-audio-perf.bs

<pre class='metadata'>
Title: Web Audio API performance and debugging notes
Status: ED
ED: https://padenot.github.io/web-audio-perf
shortname: web-audio-perf
Level:1
Editor: Paul Adenot <padenot@mozilla.com>
Abstract: These notes present the Web Audio API from a performance and debugging point of view, outlining some differences between implementation.
group: plain
Boilerplate: omit property-index logo copyright references property-index
</pre>
<style>
a[data-file] {
  border-bottom: 1px dotted gray;
}
a[data-file]:hover {
  border-bottom: 2px dotted gray;
  text-decoration: none;
}
</style>
<section>
  <h2>Introduction</h2>
  In this tutorial, we will look at two different aspects of working with the Web
  Audio API.

  First, we'll have a look at the different implementations
  available today, how to inspect their source code, and report problems found
  while testing application using the Web Audio API with them.

  We'll then have a look into the performance characteristics of the different
  <code>AudioNode</code>s available, their performance profile, overall CPU and
  memory cost.

  We'll continue by exploring the different strategies and techniques
  implementors have used when writing their implementation of the Web Audio API.

  We'll then look into ways to make processing lighter, while still retaining
  the essence of the application, for example to make a "degraded" mode for
  mobile.  We'll use techniques such as substituting rendering methods to trade
  fidelity against CPU load, pre-baking assets, minimizing resampling.

  Finally, we'll touch on tools and techniques useful to debug audio problems,
  both using the browser developer tools, or JavaScript code designed to inspect
  static and dynamic audio graphs and related Web Audio API objects.
</section>
<section>
  <h2>The different implementations</h2>

  Four complete (if there is such thing, considering the standard is always
  evolving) Web Audio API implementations are available as of today in browsers:

  <ul>
    <li>The first ever implementation was part of WebKit. At the time, Chrome
    and Safari were sharing the same code.</li>
    <li>Then, Blink got forked from WebKit, and the two gradually diverged. They
    share a lot of code, but can be considered separate implementations these
    days.</li>
    <li>Gecko was the second implementation, mostly from scratch, but borrowing
    a few files from the Blink fork, for some processing code.</li>
    <li>Edge's source are not available, but is based on an old snapshot of
    Blink.</li>
  </ul>

  The source code from the first three implementations can be read, compiled and
  modified, here are the relevant locations:

  <ul>
    <li>
    WebKit's implementation lives at <a
      href="https://trac.webkit.org/browser/trunk/">https://trac.webkit.org/browser/trunk/</a>,
    in at the path <code>/Source/WebCore/Modules/webaudio</code>.
    </li>
    <li>
    Blink's implementation can be found at
    <a href="https://code.google.com/p/chromium/codesearch#chromium/src/">
      https://code.google.com/p/chromium/codesearch#chromium/src/</a>
    which is a handy web interface, with cross-referencing of symbols. The Web
    Audio API implementation lives at
    <code>third_party/WebKit/Source/modules/webaudio</code>, but a number of
    classes and functions, intended to be shared among different Chromium modules
    are located at <code>./third_party/WebKit/Source/platform/audio/</code>.
    </li>
    <li>
    Gecko's implementation can be found at <a
      href="https://dxr.mozilla.org/mozilla-central">
      https://dxr.mozilla.org/mozilla-central</a>, which is also a nice web
    interface with cross-referencing of symbols,
    and the Web Audio API implementation is located in
    <code>dom/media/webaudio</code>. Some shared components are in
    <code>dom/media</code>.
    </li>
  </ul>

  Issues (about performance or correctness) can be filed in the project's bug
  tracker:

  <ul>
    <li>WebKit: <a
      href="https://bugs.webkit.org/enter_bug.cgi?product=WebKit&amp;component=Web%20Audio">https://bugs.webkit.org/enter_bug.cgi?product=WebKit&amp;component=Web%20Audio</a>
    (WebKit bugzilla account needed).</li>
    <li>Blink: <a href="https://new.crbug.com/">https://new.crbug.com/</a>
    (Google account needed).</li>
    <li>Gecko: <a
      href="https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Web%20Audio">https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Web%20Audio</a>
    (GitHub, Persona or Mozilla Bugzilla account needed).</li>
  </ul>

  When filing issues, a minimal test case reproducing the issue is very welcome,
  as is a benchmark in case of performance problems. A stand alone HTML file is
  usually preferred to a jsfiddle, jsbin or similar service, for archival
  purposes.

  <h2>Performance analysis</h2>
  <h3><code>AudioNode</code>s characteristics</h3>
  This section explains the characteristics of each of the
  <code>AudioNode</code> that are available in the Web Audio API, from four
  angles.

  <ul>
  <li>
  CPU, that is the temporal complexity of the processing algorithm;
  </li>
  <li>
  Memory, whether node needs to keep buffers around, or needs internal memory
  for processing;
  </li>
  <li>
  Latency, whether the processing induces a delay in the processing chain. If
  this section is not present, the node does not add latency;
  </li>
  <li>
  Tail, whether you can have a non-zero output when the
  input is continuously silent (for example because the audio source has
  stopped). If this section is not present, the node does not have a tail.
  </li>
  </ul>

  <h4> AudioBufferSourceNode </h4>
  <dl>
    <dt>CPU</dt>
    <dd>The <code>AudioBufferSourceNode</code> automatically resamples its
    <code>buffer</code> attribute to the sample-rate of the <code>AudioContext</code>. Resampling is
    done differently in different browsers. Edge, Blink and Webkit based browser
    use <a class="chromium" data-file="AudioBufferSourceNode.cpp:302">linear
      resampling</a>, that is cheap, has no latency, but has low quality.  Gecko
    based browser use a <a class="firefox"
      data-file="AudioBufferSourceNode.cpp:294">more expensive</a> but higher
    quality technique, that introduces some latency.</dd>
    <dt>Memory</dt>
    <dd>The <code>AudioBufferSourceNode</code> reads sample from an
    <code>AudioBuffer</code> that can be shared between multiple nodes. The
    resampler used in Gecko uses some memory for the filter, but nothing major.</dd>
  </dl>
  <h4> ScriptProcessorNode </h4>
  <dl>
    <dt>CPU</dt>
    <dd>On Gecko-based browsers, this node uses a <a
      data-file="ScriptProcessorNode.cpp:31" class="firefox">message queue</a>
    to send buffers back and forth between the main thread and the rendering
    thread. On other browsers, <a data-file="ScriptProcessorNode.cpp:153"
      class="chromium">buffer ping-ponging</a> is used. This means that the
    former is more reliable against dropouts, but can have a higher latency
    (depending on the main thread event loop load), whereas the latter drops out
    more easily, but has fixed latency.</dd>
    <dt>Memory</dt>
    <dd>
      Buffers have to be allocated to move audio back and forth between threads.
    Since Gecko uses a buffer queue, more memory can be used.
    </dd>
    <dt>Latency</dt>
    <dd> The latency is specified when creating the node. If Gecko has trouble
    keeping up, the latency will increase, up to a point where audio will <a
      class="firefox" data-file="ScriptProcessorNode.cpp:133">start to
    drop</a>.</dd>
  </dl>
  <h4> AnalyserNode </h4>
  <dl>
  <dt>CPU</dt>
  <dd>This node can give frequency domain data, using a Fast Fourier Transform
  algorithm, that is expensive to compute. The higher the buffer size, the more
  expensive the computing is. <code>byte</code> version of the analysis methods
  are <a data-file="AnalyserNode.cpp:228" class="firefox">not cheaper</a> than
  <code>float</code> alternative, they are provided for
  convenience: the <code>byte</code> version are computed from the
  <code>float</code> version, using simple quantization to 2^8 values.</dd>
  <dt>Memory</dt>
  <dd>Fast Fourier Transform algorithms use internal memory for processing.
  Different platforms and browsers have different algorithms, so it's hard to
  quantify exactly how much memory is going to be used. Additionaly, some
  memory is going to be used for the <code>AudioBuffer</code> passed in to the
  analysis methods. </dd>
  <dt>Latency</dt>
  <dd>Because of the windowing function there can be some perceived latency in
  this node, but windowing can be disabled by setting it to 0.</dd>
  <dt>Tail</dt>
  <dd>Because of the windowing function there can be a tail with
  this node, but windowing can be disabled by setting it to 0.</dd>
</dl>
  <h4> GainNode </h4>
  <dl>
    <dt>CPU</dt>
    <dd>Gecko-based browsers, the gain is always <a data-file="GainNode.cpp:72"
      class="firefox">applied lazily</a>, and folded in before processing that
    require to touch the samples, or before send the
    rendered buffer back to the operating system, so <code>GainNode</code> with
    a fixed gain are essentially free. In other engines, the gain is applied to
    the input buffer <a data-file="GainNode.cpp:51" class="chromium">as it's
      received</a>. When automating the gain using <code>AudioParam</code>
    methods, the gain is applied to the buffer in all browsers. </dd>
    <dt>Memory</dt>
    <dd>A <code>GainNode</code> is stateless and has therefore no associated
    memory cost.</dd>
  </dl>
  <h4> DelayNode </h4>
  <dl>
    <dt>CPU</dt>
    <dd>This node essentially copies input data into a buffer, and reads from this
    buffer at a different location to compute its output buffer.</dd>
    <dt> Memory</dt>
    <dd> The memory cost is a function of the number of input and output
    channels and the length of the delay line. </dd>
    <dt>Latency</dt>
    <dd>Obviously this node introduces latency, but no more than the latency set
    by its parameter</dd>
    <dt>Tail</dt>
    <dd>This node is being kept around (not collected) until it has finished
    reading and has output all of its internal buffer.</dd>
  </dl>
  <h4> BiquadFilterNode </h4>
  <dl>
  <dt>CPU</dt>
    <dd>Biquad filters are relatively cheap (<a data-file="blink/Biquad.cpp:66"
      class="firefox">five multiplication and four additions per
    sample</a>).</dd>
  <dt>Memory</dt>
  <dd>Very cheap, four float for the memory of the filter.</dd>
  <dt>Latency</dt>
  <dd>Exactly <a data-file="blink/Biquad.cpp:71" class="firefox">two frames</a>
  of latency, due to how the filter works.</dd>
  <dt>Tail</dt>
  <dd>Variable tail, depending on the filter setting (in particular the
  resonance).</dd>
 </dl>
 <h4> IIRFilterNode </h4>
 <dl>
   <dt>CPU</dt>
   <dd>Similarly to the biquad filter, they are rather cheap. The complexity
   depends on the number of coefficients, that is set at construction.</dd>
   <dt>Memory</dt>
   <dd>Again, the memory usage depends on the number of coefficients, but is
   overall very small (a couple floats per coefficients).</dd>
   <dt>Latency</dt>
   <dd>A frame per coefficient.</dd>
   <dt>Tail</dt>
   <dd>Variable, depending on the value of the coefficients.</dd>
  </dl>
  <h4> WaveShaperNode </h4>
  <dl>
  <dt>CPU</dt>
  <dd>The computational complexity depends on the oversampling. If no
  oversampling is used, a sample is read in the wave table, <a
    data-file="WaveShaperNode.cpp:197" class="firefox">using linear
  interpolation</a>, which is a cheap process in itself. If oversampling is used, a
  resampler is used. Depending on the browser engine, different resampling
  techniques can be used (FIR, linear, etc.).</dd>
  <dt>Memory</dt>
  <dd>This node is making a copy of the curve, so it can be quite expensive in
  terms of memory.</dd>
  <dt>Latency</dt>
  <dd>This node does not add latency if oversampling is not used. If
  over-sampling is used, and depending on the resampling technique, latency can
  be added by the processing.</dd>
  <dt>Tail</dt>
  <dd>Similarly, depending on the resampling technique used, and when using
  over-sampling, a tail can be present.</dd>
  </dl>
  <h4> PannerNode, when <code>panningModel == "HRTF"</code> </h4>
  <dl>
    <dt>CPU</dt>
    <dd><strong>Very</strong> expensive. This node is constantly doing
    convolutions between the input data and a set of HRTF impulse, that are
    characteristic of the elevation and azimuth. Additionaly, when the position
    changes, it <a data-file="blink/HRTFPanner.cpp:272"
      class="firefox">interpolates</a> (cross-fades) between the old and new
    position, so that the transition between two HRTF impulses is smooth. This
    means that for a stereo source, and while moving, there can be <a
    data-file="blink/HRTFPanner.cpp:258" class="firefox">four convolvers</a>
    processing at once. Additionaly, the HRTF panning needs short delay
    lines.</dd>
    <dt>Memory</dt>
    <dd>The HRTF panner needs to load a set of HRTF impulses around when
    operating. Gecko loads the HRTF database only if needed, while other engines
    load it unconditionally. The convolver and delay lines require memory as
    well, depending on the Fast Fourier Transform implementation used.</dd>
    <dt>Latency</dt>
    <dd>HRTF always adds <a data-file="blink/HRTFPanner.cpp:312"
    class="firefox">some amount of delay</a>, but the amount depends on the
    azimuth and elevation.</dd>
    <dt>Tail</dt>
    <dd>Similarly, depending on the azimuth and elevation, a tail of different
    duration is present.</dd>
  </dl>
  <h4> PannerNode, when <code>panningModel == "equalpower"</code> </h4>
  <dl>
    <dt>CPU</dt>
    <dd>Rather cheap. The processing has two parts:
    <ul>
      <li>
        First, the <a data-file="PannerNode.cpp:409" class="firefox">azimuth
          needs to be determined</a> from the Cartesian coordinate of the source
        and listener, this is a bit of vector maths, and can be cached by the
        implementation for static sources.
      </li>
      <li>
        Then, <a data-file="PanningUtils.h:52" class="firefox">gain is
        applied</a>, maybe blending the two channels is the source is stereo.
      </li>
  </ul> </dd>
    <dt>Memory</dt>
    <dd>The processing being stateless, this has no memory cost.</dd>
  </dl>
  <h4> StereoPannerNode </h4>
  <dl>
    <dt>CPU</dt>
    <dd>Similar to the <code>"equalpower"</code> panning, but the azimuth is
    cheaper to compute since there is no need to do the vector math, we already
    have the position.  </dd>
    <dt>Memory</dt>
    <dd>Stateless processing, no memory cost.</dd>
  </dl>
  <h4> ConvolverNode </h4>
  <dl>
    <dt>CPU</dt>
    <dd>Very expensive, and depending on the duration of the convolution
    impulse. A <a data-file="blink/ReverbConvolver.cpp:156"
      class="firefox">background thread</a> is used to offload some of the
    processing, but computational burst can occur in some browsers. Basically,
    multiple FFT are computed for each block.</dd>
    <dt>Memory</dt>
    <dd>The node is making a copy of the buffer for internal use, so it's taking
    a fair bit or memory (depending on the duration of the impulse).
    Additionaly, some memory can be used for the Fast Fourier Transform
    implementation, depending on the platform.
    </dd>
    <dt>Latency</dt>
    <dd>Convolver can be used to create delay-like effect, so latency can
    certainly be introduced by a <code>ConvolverNode</code>.</dd>
    <dt>Tail</dt>
    <dd>Depending on the convolution impulse, there can be a tail.</dd>
  </dl>
  <h4> ChannelSplitterNode / ChannelMergerNode </h4>
  <dl>
    <dt>CPU</dt>
    <dd>This is merely splitting or merging channels, that is copying buffer
    around.
    </dd>
    <dt>Memory</dt>
    <dd>No memory implications</dd>
  </dl>
  <h4> DynamicsCompressorNode </h4>
  <dl>
    <dt>CPU</dt>
    <dd>The exact algorithm is not specified yet. In practice, it's the same in
    all browsers, a peak detecting look-ahead, with a pre-emphasis and
    post-de-emphasis, not too expensive.</dd>
    <dt>Memory</dt>
    <dd>Not very expensive in terms of memory, just some floats to track the
    internal state.</dd>
    <dt>Latency</dt>
    <dd>Being a look ahead compressor, it introduces a fixed look-ahead of
    six milliseconds.</dd>
    <dt>Tail</dt>
    <dd>Because of the emphasis, there is a tail. Also, compression can boost
    quiet audio, so audible sound can appear to last longer.</dd>
  </dl>
  <h4> OscillatorNode </h4>
  <dl>
    <dt>CPU</dt>
    <dd>The basic wave forms are implemented using multiple <a
      data-file="OscillatorNode.cpp:261" class=firefox>wave tables</a> computed
    using the inverse Fourier transform of a buffer with carefully chosen
    coefficients (apart from the sine that is <a
      data-file="OscillatorNode.cpp:219" class=firefox>computed directly</a> in
    Gecko). This means that there is an initial cost when changing the wave
    form, that is <a data-file="OscillatorNode.cpp:118"
      class="firefox">cached</a> in Gecko-based browser. After the initial cost,
    processing is essentially doing linear interpolation between multiple wave
    tables. When the frequency changes, new tables have to be computed.
    </dd>
    <dt>Memory</dt>
    <dd>A number of wave tables have to be stored, that can take up some memory.
    Those are shared in Gecko-based browsers, apart from the sine wave in Gecko,
    that is directly computed.</dd>
  </dl>

  <h3> Other noteworthy performance characteristics </h3>
  <h4> Memory model </h4>

  Web Audio API implementation use two threads. The <em>control thread</em> is
  the thread on which are issued the Web Audio API calls:
  <code>createGain</code>, <code>setTargetAtTime</code>, etc. The <em>rendering
  thread</em> is the thread that is responsible for rendering the audio. This
  can be a normal thread (for example for an <code>OfflineAudioContext</code>) or a
  system provided, high-priority audio thread (for a normal
  <code>AudioContext</code>). Of course, informations have to be communicated
  between the two threads.

  Current Web Audio API implementations have taken two different approaches to
  implement the specification. Gecko-based browsers use an <em>message
  passing</em> model, whereas all the other implementation use a <em>shared
  memory</em> model. This has a number of implications in practice.

  First, in engines that are using the <em>shared memory</em> model, changes to
  the graph and <code>AudioParam</code> can occur at any time. This means that
  in some scenario, manipulation (from the main thread) of internal Web Audio
  API data structures can be reflected more quickly in the rendering thread. For
  example, if the audio thread is current rendering, a modification from the
  main thread will be reflected immediately on the rendering thread.

  A drawback of this approach is that it is necessary to have some
  synchronization between the control thread  and the rendering thread. The
  rendering thread is often very high priority (usually the highest priority on
  the system), to guarantee no under-runs (or dropouts), which are considered
  catastrophic failures, for most audio rendering system. Under-runs usually
  occur when the audio rendering thread did not make its deadline. For example,
  it took more than 5 milliseconds of processing to process 5 milliseconds of
  audio. Non-Gecko based browers are often using <em>try locks</em> to ensure
  smooth operation.

  In certain parts of the Web Audio API, there is a need to a be able to access
  data structure sent to the rendering thread, from the main thread. Gecko-based
  browsers keep two synchronized copies of the data structure to implement this,
  this has a cost in memory, that other engines don't have to pay.

  <h4> AudioParam </h4>
  The way a web application uses <code>AudioParam</code> plays an important role
  in the performance of the application. <code>AudioParam</code> come in two
  flavours, <em>a-rate</em> and <em>k-rate</em> parameters. <em>a-rate</em>
  parameters have their value computed for each audio sample, whereas
  <em>k-rate</em> parameters are computed once per 128 frames block.

  <code>AudioParam</code> methods (<code>setValueAtTime</code>,
  <code>linearRampToValueAtTime</code>, etc.) each insert <em>events</em> in a
  list of events that is later accessed by the rendering thread.

  Handling <code>AudioParam</code>, for an engine, means first finding the right
  event (or events) to consider for the block of audio to render. Different
  approaches are taken by different browsers. Gecko prefers to prune all events
  that are in the past but the one that is right before the current time
  (certain events require to have a look at the previous event's value). This
  guarantees amortized <em>O(1)</em> complexity (amortized because deallocations
  can take some time). Other engines do a linear scan in the event list to find
  the right one.

  In practice, both techniques perform well enough so that the difference is not
  noticeable most of the time. If the application uses <em>a lot </em> of
  <code>AudioParam</code> events, non-Gecko based browers can have performance
  issues, because scanning through the list starts to take a non-trivial amount
  of time. Strategies can be employed to mitigate this issue, by creating new
  <code>AudioNode</code>, with new <code>AudioParam</code>, that start with an
  empty list of events.

  For example, let's take a <code>GainNode</code> that is used as the envelope
  of an <code>OscillatorNode</code>-based kick drum, playing each beat at 140
  beat per minute. The envelope is often implemented using a
  <code>setValueAtTime</code> call to set the initial volume of the hit, that
  often depends on the velocity, immediately followed by a
  <code>setTargetAtTime</code> call, at the same time, with a value of 0 to have
  a curve that decays to silence, and a time constant that depends on the
  release parameter of the synth. At 140BPM, 280 events will be inserted by
  minute. To ensure stable performance, and leveraging the fact that a
  <code>GainNode</code> is very cheap to create and connect, it might be worth
  it to consider swapping the node that is responsible for the envelope
  regularly.

  Gecko-based browers, because of their event-loop based model, have two copies
  of the event list. One on the main thread, that contains all the events, and
  one on the rendering thread, that is regularly pruned to only contain relevant
  events. Other engines simply synchronize the event list using a lock, and only
  have one copy of the event list. Depending on how the application schedules
  its <code>AudioParam</code> events (the worst case being when all the events
  are scheduled in advance), having two copies of the event list can take a
  non-negligible amount of memory.

  The second part of <em>AudioParam</em> handling is computing the actual value
  (or value, for an <em>a-rate</em> parameter, based on past, present and future
  events. Gecko-based browsers are not very efficient are computing those values
  (and suffer from a number of bugs in the implementation). This code is going
  to go through a rewrite <em>real soon (tm)</em>. Other engines are much more
  efficient (which, most of the time, offsets their less efficient way of
  searching for the right events to consider). Blink even has optimized code to
  compute the automation curves, using SSE intrinsics on x86.

  All that said, and while implementations are becoming more and more efficient
  at computing automation curves, it is event more efficient to not use
  <code>AudioParam</code> if not necessary, implementation often take different
  code path if they know an <code>AudioParam</code> value is going to be
  constant for a block, and this is easier to detect if the <code>value</code>
  attribute has been directly set.

  <h4> Node ordering </h4>

  Audio nodes have to be processed in a specific order to get correct results.
  For example, considering a graph with an <code>AudioBufferSourceNode</code>
  connected to a <code>GainNode</code>, connected to the
  <code>AudioDestinationNode</code>. The <code>AudioBufferSourceNode</code> has
  to be processed first, it gets
  the values from the <code>AudioBuffer</code> and maybe resamples them.
  The output of this node is passed to the <code>GainNode</code>, that applies
  its own processing, passing down the computed data to the
  <code>AudioDestinationNode</code>.

  Gecko-based browsers and other engines have taken two different approaches for
  ordering Web Audio API graphs.

  Gecko uses an algorithm based on Tim Leslie's iterative implementation
  [<a href="http://www.timl.id.au/?p=327">1</a>][<a
    href="https://github.com/scipy/scipy/blob/e2c502fca/scipy/sparse/csgraph/_traversal.pyx#L582">2</a>]
  of Pearce's variant [<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.102.1707">3</a>] of Tarjan's strongly connected components (SCC)
  algorithm, which is basically a kind of topological sort that takes cycles
  into account to properly handle cycles with <code>DelayNode</code>.

  Other engines simply do a depth first walk of the graph, uses a graph coloring
  approach to detect cycles, and pull from the <code>DelayNode</code> internal
  buffer in case of cycles. This is way simpler, but differs in rendering with
  the other approach.

  Gecko is only running the algorithm if the topology of the graph has changed
  (for example if a node has been added, or a connection has been removed),
  while other engine perform the traversal for each rendering quantum. Depending
  on if the application has a mostly static or very dynamic graph topology, this
  can have an impact on performance.

  <h4> Latency </h4>

  Audio output latency is very important for the Web Audio API. Usually,
  implementations use the lowest available audio latency on the system to run
  the <em>rendering thread</em>. Roughly, this mean on the order of 10ms on
  Windows (post-Vista, using WASAPI), very very low on OSX and iOS (a few
  milliseconds), a 30-40ms on Linux/PulseAudio. There are not yet engines that
  support Jack with low-latency code, although this would certainly have amazing
  latency numbers.  Windows XP does not usually have great output latency, and
  licensing terms (and available time to write code!) prevent the use of ASIO on
  Windows, which would also have great performance.

  Android devices are a bit of a problem, as the latency can vary between 12.5ms
  on a Galaxy Nexus (which was the device with the lowest latency) up to 150ms
  on some very cheap devices. Although there is a way to know the latency using
  system API (that browsers use for audio/video synchronization), Web App
  authors don't currently have a way to know the output latency. This is
  currently being addressed (<a
    href="https://github.com/WebAudio/web-audio-api/issues/12">#12</a>).

  Creating <code>AudioContext</code> with a higher latency is a feature that is
  currently under discussion (<a
    href="https://github.com/WebAudio/web-audio-api/issues/348">#348</a>). This allows a
  number of things, such as battery savings for use-cases that don't require
  very low latency. For example, an application that only does playback, but
  wants to display the Fourier transform of the signal, or simply the wave form
  of the signal itself, have an equalizer on the signal path, apply compression,
  all would benefit from this feature, because real-time interaction is not
  required. Of course, the latency that is the most efficient, battery-wise and
  regardless of audio output latency depends on the operating system and
  hardware.

  By using bigger rendering quantum (by using bigger buffers, caused by
  increasing the audio latency in most systems) also allows creating and
  rendering more complex Web Audio API graphs. Indeed, audio systems often pull
  quite a lot of memory into CPU cache lines to compute each rendering block (be
  it audio sample data, automation timeline event arrays, the memory for the
  nodes themselves, etc.). Rendering for a longer period of time means having
  caches that are more hot and lets the CPU crunching numbers without having to
  wait for memory faults. Digital Audio Workstation users often increase the
  latency when they heard "audio crackles" (under-runs), the same logic applies
  here. Of course, there is a point where having a higher latency is actually
  not helping the computation. This value depends on the system.

  <h4> Browser architecture </h4>

  Firefox (that is based on Gecko) uses only one process for all the tabs and
  the chrome (which is the user interface of the browser, that is, the tabs, the
  menus, etc., a very confusing term nowadays), whereas other browsers use
  multiple processes.

  While the rendering of Web Audio API graphs happen on dedicated thread, this
  has important implications in terms of responsiveness. The event loop in Gecko
  is shared between all the tabs. When using a Web Application that uses the Web
  Audio API, if all the other tabs are still doing some processing, receiving
  events, running script, etc. in the background, the Web Audio API calls will
  be delayed, so it is necessary to plan a bit more in advance (see Chris
  Wilson's <a
    href="http://www.html5rocks.com/en/tutorials/audio/scheduling/">Tale of two
  clocks</a>). This is somewhat being addressed by the electrolysis (e10s)
project, but there is still a long way to go to catch up in responsiveness.

  Other engines usually use multiple processes, so the event loop load is split
  among multiple processes, and there is less chances that the script that is
  issuing Web Audio API calls is delayed, allowing more tight delays for
  scheduling things (be it an <code>AudioParam</code>, the starting an
  <code>AudioBufferSourceNode</code>, etc.).

  <h4><code>decodeAudioData</code></h4>

  Gecko based browsers use a thread pool (with multiple threads) when handling
  multiple <code>decodeAudioData</code> calls, whereas other browsers serialize
  the decoding on a thread.

  Additionaly, audio is resampled differently, and different audio decoders are
  used, with different optimizations on different OS and architectures, leading
  to a wide variety of performance profiles.

  Authors should take advantage of (nowadays ubiquitous) multi-core machines
  (even phones often have four or more core these days), and try to saturate the
  CPUs with decoding operations. Since the rendering thread is higher priority
  than the decoding threads, no audio under-runs are to be expected.

  <h4>Micro-optimizations</h4>

  Audio processing is composed of a lot of very tight loops, repeating a
  particular operation over and over on audio buffers.
  Implementations take advantage of SIMD (Single Instruction Multiple Data)
  available in CPUs, and optimize certain very common operation after profiling,
  to make processing go faster, in order to render have bigger graphs, allow
  lower audio latency.

  Gecko has NEON (on ARM) and SSE (on x86) function implementing all the very
  common functions (Applying a gain in place, copying a buffer while applying a
  gain value, panning a mono buffer to a stereo position, etc.), as well as
  optimized FFT code (taken from the FFMPEG project on x86, and OpenMax DL
  routines on ARM). Of course, these optimizations are only used if the CPU on
  which the browser is running has the necessary extensions, falling back to
  normal scalar code if not. Depending on the function, those optimizations have
  proven to be between three and sixteen times faster than their scalar
  counterpart.

  Blink has SSE code for some <code>AudioParam</code> event handling, and has
  the same FFT code as Gecko (both on x86 and ARM).

  WebKit can use different FFT implementation, depending on the platform, for
  example the FFT from the <em>Accelerate</em> framework on OSX.

  This results in a variety of performance profiles for FFT-based processing
  (HRTF panner, <code>ConvolverNode</code>, <code>AnalyserNode</code>, etc.).

  Additionaly, very often, the decoding code used by the implementation of
  <code>decodeAudioData</code> are very well optimized using SIMD and a variety
  of other techniques.

</section>
<section>
  <h2>Using lighter processing</h2>

  Audio processing is a hard real-time process. If the audio has no been
  computed when the system is about to output it, drop outs will occur, in the
  form of clicks, silence, noise and other unpleasant sounds.

  Web Browsers run a variety of platform and devices, some of them being very
  powerful, and some of them with limited resources. To ensure a smooth
  experiences for all users and depending on the application itself, it can be
  necessary to use techniques that does not sound as good as the normal
  technique, but still allow to have a convincing experience, retaining the
  essence of the interaction.

  <h3>Custom processing</h3>

  Sometimes, the Web Audio API falls short in what it has to offer, to solve
  particular problems. It can be necessary to use an <code>AudioWorklet</code>
  or a <code>ScriptProcessorNode</code> to implement custom processing.

  JavaScript is a very fast language if written properly, but there are a number
  of rules to obey to achieve this result.

  <ul>
    <li>
    Using typed array is a must. They are very fast compared to the normal
    array. In the Web Audio API, the <code>Float32Array</code> is the most
    common. Re-using arrays is important, there is no need to spend a lot of
    time in the allocator code.
    </li>
    <li>
    Keeping the working set small is important, not using a lot of arrays and
    always using the same data is faster.
    </li>
    <li>
    No DOM manipulation or fiddling with the prototype of object during
    processing as this invalidates the JITed code.
    </li>
    <li>
    Trying to stay as mono-morphic as possible and always using the same code
    path yields more optimized JITed code. Suddenly calling a function with
    a normal array instead of a <code>Float32Array</code> will invalidate a lot
    of code and take a slow path.
    </li>
    <li>
    Compiling C or C++ to JavaScript (or soon Web Assembly) yields the best
    performance.  <em>emscripten</em> or other tools can be used to compile
    libraries or custom code into a typed subset of JavaScript that does not
    allocate and is very fast at number crunching.
    </li>
    <li>
    Experimental extensions, such as <code>SIMD.js</code> or
    <code>SharedArrayBuffer</code> can make things run faster, but they are not
    available in all browsers.
    </li>
  </ul>


  <h3>Worker-based custom processing</h3>

  All of the above techniques can be used with workers, to offload heavy
  processing. While the worker does not have access to the Web Audio API,
  buffers can be transfered without copy, and the audio can then be sent back to
  the main thread for use with the Web Audio API.

  This has latency implications, and care must be taken so that the worker
  always has finished processing on time, generally using a FIFO.

  <h3>Copying audio data</h3>

  Reusing <code>AudioBuffer</code> internal data can be performed more
  efficiently than with calling <code>getChannelData()</code>.
  <code>copyFromChannel()</code> and <code>copyToChannel()</code> allow the
  engine to to optimize away some copies and allocations.

  <h3>Built-in resampling</h3>

  The <code>OfflineAudioContext</code> can be use to perform off-main-thread and
  off-rendering-thread re-resampling of audio buffer. This can have two
  different uses:

  <ul>
    <li>
    By resampling an <code>AudioBuffer</code> to the sample-rate of the
    <code>AudioContext</code>, no time is spent resampling in the
    <code>AudioBufferSourceNode</code>.
    </li>
    <li>
    By resampling an <code>AudioBuffer</code> to a low sample rate, memory can
    be saved. This has sound quality and CPU implications, but not all sounds
    have partials that are high enough to require a full-band rate.
    </li>
  </ul>

  <pre highlight=javascript>
function resample(input, target_rate) {
  return new Promise((resolve, reject) => {
    if (typeof input != "AudioBuffer")  {
      reject()
    }
    if (typeof target_rate != "number" && target_rate <= 0) {
      reject()
    }
    var resampling_ratio = input.sampleRate / target_rate;
    var final_length = input.length * resampling_ratio;
    var off = new OfflineAudioContext(input.numberOfChannels,
                                      final_length, target_rate);

    var source = off.createBufferSource();
    source.buffer = input;
    source.connect(off.destination);
    source.start(0);

    off.startRendering().then(resolve).catch(reject);
  });
}
  </pre>

  <h3>Track or asset freezing</h3>

  The <code>OfflineAudioContext</code> can be use to apply, ahead of playback,
  complex processing to audio buffer, to avoid having to recompute the
  processing each time. This is called <em>baking</em> or <em>freezing</em>.
  This can be employed for a full sound-track, as well as, for example, the
  individual sound effects of a video game.

  <h3>Cheaper reverb</h3>

  When using the Web Audio API, the simplest way of adding a reverberation
  effect to a sound is to connect it into a <code>ConvolverNode</code>, setting
  its <code>buffer</code> attribute to a reverb impulse, often a decaying curve
  of some sort, synthesized using a program, or recorded and processed from a
  real location. While this setup has a very good sound quality (of course,
  depending on the impulse chosen), it is very expensive to compute. While
  computers and modern phones are very powerful, longer reverb tails, or running
  the Web App on a cheap mobile devices might not work.

  Cheaper reverb can be created using delay lines, all-pass and low-pass
  filters, to achieve a very convincing effect. This also has the advantage of
  having parameters you can change, instead of an impulse buffer.

  <h3>Cheaper panning</h3>

  HRTF panning is based on convolution, sounds good, but is very expensive. It
  can be replaced, for example on mobile, by a very short reverb and an
  equal-power panner, with a distance function and Cartesian positions adjusted
  properly. This has the advantage of not requiring very very expensive
  computation when continuously changing the positions of the listener and
  source, something that is often the case when basing those positions on sensor
  data, such as accelerometer and gyroscopes.

</section>
<section>
  <h2>Debugging Web Audio API applications</h2>
  <h3>Node Wrapping</h3>
  A good strategy to debug Web Audio API application is to wrap
  <code>AudioNode</code>s using ES6's <code>Proxy</code> or normal prototype
  override, to keep track of state, and being able to tap in an re-route audio,
  insert analysers and so on.
  <h3>Firefox' Web Audio API debugger</h3>
  Firefox has a custom developer tool panel, that shows the graph, with the
  following features:
  <ul>
    <li>Ability to see the topology of the graph, garbage collection of nodes
    and sub-graph portion.</li>
    <li>Setting and getting the value of <code>AudioNode</code> attributes.</li>
    <li>Bypass nodes</li>
  </ul>

  There are some up and coming feature coming up as well:

  <ul>
    <li>Memory consumption of <code>AudioNode</code>s and
    <code>AudioBuffers</code>.</li>
    <li> Tapping into nodes, inserting analysers. </li>
    <li> Inspecting <code>AudioParam</code> time-lines </li>
    <li> CPU profiling of real-time budget </li>
  </ul>
  <h3>Memory profiling</h3>
  It is pretty easy to determine the size in bytes of the buffer space used by
  <code>Audiobuffer</code>:

  <pre highlight=javascript>
    function AudioBuffer_to_bytes(buffer) {
      if (!(buffer instanceof AudioBuffer)) {
        throw "not an AudioBuffer.";
      }
      const sizeoffloat32 = 4;
      return buffer.length * buffer.numberOfChannels * sizeoffloat32;
    }
  </pre>

  Other memory figures can be obtained in Firefox by going to
  <code>about:memory</code>, clicking "Measure", and looking to the tab containing your
  page.

</section>
<div data-fill-with="conformance">
</div>
<script>
  var chromiumFiles = document.querySelectorAll(".chromium");
  for (var i = 0; i < chromiumFiles.length; i++) {
    var tuple = chromiumFiles[i].dataset.file.split(":");
    var file = tuple[0];
    var line = tuple[1];
    chromiumFiles[i].href =
    "https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/modules/webaudio/"+file+"&l="+line;
  }
  var firefoxFile = document.querySelectorAll(".firefox");
  for (var i = 0; i < firefoxFile.length; i++) {
    var tuple = firefoxFile[i].dataset.file.split(":");
    var file = tuple[0];
    var line = tuple[1];
    firefoxFile[i].href =
      "https://dxr.mozilla.org/mozilla-central/source/dom/media/webaudio/" + file + "#" + line;
  }
</script>
</body>
</html>