Skip to content
This repository has been archived by the owner on Jan 19, 2022. It is now read-only.

RTC_API_Proposal

anantn edited this page Oct 19, 2011 · 41 revisions

RTC API Proposal

This proposal served as input for the formal specification of RTC APIs, which is currently being standardized by the W3C RTC WG. You can find the current version of the spec here. The API presented on the document is preserved for archival purposes.

The WhatWG proposal for real time media streams presents many fine ideas, as does an extension to the Streams API presented in that document (as proposed to the W3C audio working group). This proposal builds on those two documents to present an API for media capture and transmission in web browsers.

The primary motivations for this document are:

  1. Some use-cases are not satisfied with either of the earlier proposals.
  2. Some aspects of earlier proposals are amenable to simplification, and others may present unique implementation challenges, which this proposal takes into account.
  3. Firefox already supports a rich Audio API for manipulating streams and we would like to ensure that subsequent work on video and real time communication plays well with other media APIs.

Use cases

For purposes of designing this API, we present the following use-cases. We omit the use-cases that do not pertain to the RTC working group (such as local-only media capture or audio processing), but it suffices to state that from an implementation perspective, it is important to consider all media-related APIs for coherence, and that the API proposed in this document does take those use-cases into account, even though they are not presented here.

  • Simple video and voice calling site
  • Broadcasting real time video & audio streams
  • Browser based MMORPG that enables player voice communication (push to talk)
  • Video conferencing between 3 or more users (potentially on different sites)
  • [Fill in more use cases from IETF document]

API Specification

The API proposed in this section is intended to be the baseline that should be provided by the browser and to give web applications the maximum amount of flexibility. Some use-cases (such as a simple video chat application) may be fulfilled by a simpler API more intuitive to web developers; however, it is hoped that such an API may be built on top of the proposed baseline. We do not preclude that a simpler API be specified by the working group, but suggest that it be mandatory for browsers to implement the following specification to ensure that all targeted use-cases are satisfied.

We split the specification into three distinct portions for clarity: definition of media streams, obtaining device access, and establishing peer connections. Implementation of all three sections are required for an end-to-end solution satisfying all the use-cases targeted.

Media streams

A media stream is an abstraction over a particular window of time-coded audio or video data (or both). The time-codes are in the stream's own internal timeline. The internal timeline can have any base offset but always advances at the same rate as real time. Media streams are not seekable in any direction.

interface MediaStream {
    readonly attribute DOMString label;
    readonly attribute double currentTime;

    MediaStreamTrack[] tracks;
    MediaStreamRecorder record(in JSONHints hints);

    const unsigned short LIVE = 1;
    const unsigned short BLOCKED = 2;
    const unsigned short ENDED = 3;
    readonly attribute unsigned short readyState;

    attribute Function onReadyStateChange;   
    ProcessedMediaStream createProcessor(in optional Worker);
}

When the readyState of a media stream is LIVE, the window is advancing in real-time. When the state is BLOCKED, the stream does not advance (the user-agent may replace this with silence until the stream is LIVE again); and ENDED implies no further data will be received on the stream. MediaStreams have implicit 'sources' and 'sinks'. Whenever you receive a MediaStream from the user-agent or a PeerConnection, the source is setup by either of them; and assigning the stream to a HTML element for rendering sets them up as sinks.

JSONHints provides hints from the application as to what kind of output it needs from the MediaStreamRecorder, and is described later in the proposal.

The createProcessor() call creates a ProcessedMediaStream object as defined in the MediaStream Processing API.

Every stream has an associated set of tracks:

 interface MediaStreamTrack {
     readonly attribute MediaStream stream;
     
     // IANA MIME media type (http://www.iana.org/assignments/media-types/index.html)
     readonly attribute DOMString kind; 
     readonly attribute DOMString label;

     attribute Function onTypeChanged;
     attribute DOMString[] supportedTypes;
     attribute JSONHints hints;

     attribute boolean enabled;
     readonly attribute MediaBuffer buffer;

     // Extension for Audio API, see:
     readonly attribute double volume;
     // Gain to be applied 
     void setVolume(in double volume, in double optional startTime, in double optional duration);
 };

A MediaStreamTrack can contain either audio, video or data. The origin of the media stream is free to choose how media data is split between multiple tracks. Typically, there will be separate tracks for audio, video, subtitles, DTMF etc. Tracks can be individually ENABLED or DISABLED. If all tracks in a stream are DISABLED it simply means the streams is playing nothing.

MediaBuffer allows web applications to access the underlying media data:

interface MediaBuffer {
    readonly attribute MediaStreamTrack track;

    // IANA MIME media type (http://www.iana.org/assignments/media-types/index.html)
    readonly attribute DOMString kind;

    // Codec specific (may return the next Ogg packet in the stream, for example)
    attribute int sequenceNr;
    Object getBufferData(args);
};

The programmer can also provide "hints" to the MediaStreamTrack as to the kind of data it is carrying. The MediaStreamTrack's type may then change to accommodate the provided hints, and if this is done, the onTypeChanged event handler for the track will be called. When this happens, a new track will be created with the new codec type and a reference to this new track will be provided in the only argument to the onTypeChanged callback.

JSONHints {
    "audioType": "spoken | music",
    "videoType": "fast | slow",
    "videoSize": "height x width",
    "videoFPS": "fps_n / fps_d",
    ... TBD ...
};

Streams can be associated with existing HTML media elements such as <video> and <audio>, and video streams with <canvas>. Each of these tags may serve as either the input or output for a media stream, by setting or getting the stream attribute as appropriate.

partial interface HTMLMediaElement {
    attribute MediaStream stream;
};
partial interface HTMLCanvasElement {
    attribute MediaStream stream;
};

Streams can be recorded to files, which can then be accessed via the DOM File APIs:

interface MediaStreamRecorder {
    readonly attribute MediaStream stream;
    void getRecordedData(in Function onsuccess, in Function onerror);
    void stop();
};
function onsuccess(DOMString type, DOMFile file);
function onerror(DOMString error);

The type argument passed to the onsuccess callback is a string as defined in RFC8421. (This is the same format for the type attribute in MediaBuffer).

Device access

MediaStreams can be obtained from <video>, <audio> and <canvas> elements; but they can also be obtained from a user's local media devices such a webcam or microphone:

interface NavigatorMedia {
    void getUserMedia(in JSONHints hints, in Function onsuccess, in optional Function onerror);
};
Navigator implements NavigatorMedia;

function onsuccess(MediaStream stream);

const unsigned short PERMISSION_DENIED = 1;
const unsigned short RESOURCE_BUSY = 2;
const unsigned short RESOURCE_UNAVAILABLE = 3;
function onerror(unsigned short errorCode);

The caller may set the values in the options JS object depending on its requirements. Note that these are merely suggestions from the caller, the returned stream may not match the requirement exactly (though the user-agent will make its best effort to provide the requested stream). If either of the requested inputs (audio / video) are not available, the success callback must still be called; thus the application must check the type attribute of the resulting tracks in the stream handed to it to verify whether the stream contains only audio, only video, or both. If hardware to fulfill the request is unavailable the error callback is invoked with RESOURCE_UNAVAILABLE, but if hardware is available and is currently being used by another application, RESOURCE_BUSY is returned. Additionally, the user-agent may choose to offer the user to select a local file to act as the source of the media stream in place of real hardware.

Peer connections

A peer connection provides a UDP channel of communication between two user-agents.

constructor PeerConnection(DOMString config, Function sendSignalingMessage, optional DOMString negotiationServerURN)
interface PeerConnection {
    void processSignalingMessage(DOMString msg);

    const unsigned short NEW = 1;
    const unsigned short LISTENING = 2;
    const unsigned short OPENING = 3;
    const unsigned short ACTIVE = 4;
    const unsigned short CLOSED = 5;
    readonly attribute unsigned short readyState;

    void addLocalStream(in MediaStream stream);
    void removeLocalStream(in MediaStream stream);
    readonly attribute MediaStream[] localStreams;
    readonly attribute MediaStream[] remoteStreams;

    void open(in DOMString addr);
    void listen();
    void accept(in DOMString addr);
    void close();
    void send(in DOMString text);

    attribute Function onMessage;
    attribute Function onIncoming;
    attribute Function onRemoteStreamAdded;
    attribute Function onRemoteStreamRemoved;
    attribute Function onReadyStateChange;
};

The configuration string gives the address of a STUN or TURN server used to establish the connection. sendSignal is a function that is provided by the caller which will be called when the user-agent needs to transport and out of band signalling message to the remote peer. When a message is received from the remote peer via this channel, it must be sent to the user-agent by calling receivedSignal(). The ordering of messages is important.

When a PeerConnection object is created, the readyState is set to NEW. Peers that are willing to receive incoming connections may call listen() to indicate this, and their readyState changes to LISTENING. Peer willing to initiate a connection to another peer may call open() to begin this process (their readyState changes to OPENING). The listening end will receive a callback on the onIncoming function, in which they may decide to accept() the connection. If a connection is accepted, the readyState changes to ACTIVE and the peer may start transmitting media packets. The readyState on the far end changes to ACTIVE as soon as the first packet from the initiating end is received.

Using a single PeerConnection to handle multiple incoming connections presents some unique challenges, but also has the desirable property of being able to stream out a single set of MediaStreams to multiple peers (which can also be changed mid-session). However, it remains to be specified how the addresses passed to the open() and accept() calls are obtained by the JS caller.

An alternative programming model is similar to that of UNIX sockets, we can define a separate PeerListener object that will in turn create new PeerConnection objects for every accepted incoming connection. This scheme has the advantage of associating every PeerConnection with only 2 endpoints (signaling & addressing is very clear in this case), however makes it harder to use a single set of MediaStreams for multiple clients since they will have to be setup for every individual connection. The alternative proposal is at RTC_API_Proposal:PeerListener.

Examples

Simple Video Call

Simple video calling between two users A and B. A is making a call to B:

// User-agent A executes
<video id="localPreview"/><video id="remoteView"/>
<script>
navigator.getUserMedia({}, function(stream) {
    // Display local video
    document.getElementById("localPreview").stream = stream;

    var conn = new PeerConnection("stun:foobar.net:3476", sendToB);
    function sendToB(msg) { // send via XHR to B }
    function gotFromB(msg) { conn.receivedSignal(msg); }

    conn.addLocalStream(stream);
    conn.onRemoteStreamAdded = function(remoteStream) {
        // Display remote video
        document.getElementById("remoteView").stream = remoteStream;
    };

    // Initiate video call to B. How is b_addr obtained? TBD
    conn.open(b_addr);
});
</script>

// User-agent B executes
<video id="localPreview"/><video id="remoteView"/>
<script>
navigator.getUserMedia({}, function(stream) {
    document.getElementById("localPreview").stream = stream;

    var conn = new PeerConnection("stun:foobar.net:3476", sendToA);
    function sendToA(msg) { // send via XHR to A }
    function gotFromA(msg) { conn.receivedSignal(msg); }

    conn.addLocalStream(stream);
    conn.onRemoteStreamAdded = function(remoteStream) {
        document.getElementById("remoteView").stream = remoteStream;
    };
    conn.onIncoming = function(msg) {
        conn.accept(msg.addr); // msg.addr = a.addr
    };
});
</script>

Simulcast Video

Simulcasting real-time video & audio streams to multiple clients:

// This code runs on the "server". Some other part of the web page magically paints the game to the canvas
<canvas id="hockeyGame"/>
<script>
function sendToPeer(msg) { // Out of band send }
function gotFromPeer(msg) { conn.receivedSignal(msg); // Out of band receive }

var conn = new PeerConnection("turns:example.org", sendToPeer);
conn.addLocalStream(document.getElementById("hockeyGame").stream);
conn.onIncoming = function(msg) {
    conn.accept(msg.addr); // Accept every incoming connection
}
conn.listen(); 
</script>

// All clients subscribing to the simulcast run this code.
// TURN server does the job of initiating onRemoteStreamAdded for every client?
<video id="gameStream"/>
<script>
    function sendToPeer(msg) { // Out of band send }
    function gotFromPeer(msg) { conn.receivedSignal(msg); // Out of band receive }

    var conn = new PeerConnection("turns:example.org", sendToPeer);
    conn.onRemoteStreamAdded = function(stream) {
        document.getElementById("gameStream").stream = stream;
    };

    conn.open(server_addr); 
</script>

MMORPG

Browser based MMORPG that enables player voice communication (push to talk):

// All players
<button id="ptt"/>
<audio id="otherPlayers"/>
<script>
var mixer;
var worker = new Worker("muxer.js");
var players = ... // this is an array of objects provided by server used in connect to other players
navigator.getUserMedia({audio:true}, function(stream) {
    function sendToPeer(msg) { // Out of band send }
    function gotFromPeer(msg) { conn.receivedSignal(msg); // Out of band receive }

    var conn = new PeerConnection("stuns:game-server.net");
    conn.addLocalStream(stream);

    conn.onIncoming = function(msg) {
        if (msg.addr in players) // Only accept connections from other players
            conn.accept(msg.addr);
    };
    conn.onRemotetreamAdded = function(remoteStream) {
        if (!mixer) mixer = remoteStream.createProcessor(worker); // StreamProcessor API TBD
        else mixer.addInput(remoteStream);
    };

    conn.listen(); 
    for(i=0; i<players.length; i++) {
         conn.open(players[i]);
    }

    var streaming = false;
    document.getElementById("ptt").onclick = {
        if (!streaming) {
            streaming = true;
            stream.readyState = stream.LIVE;
        } else {
            streaming = false;
            stream.readyState = stream.BLOCKED;
        }
    };
});
document.getElementById("otherPlayers").stream = mixer.outputStream;