MSC3898: Native Matrix VoIP signalling for cascaded SFUs

MSC3401 specifies how full-mesh group calls work in Matrix. While that MSC works well for small group calls, it does not work so well for large conferences due to bandwidth (and other) issues.

Selective Forwarding Units (SFUs) - servers which forwarding WebRTC streams between peers (which could be clients or SFUs or both). To make use of them effectively, peers need to be able to tell the SFU which streams they want to receive at what resolutions.

To solve the issue of centralization, the SFUs are also allowed to connect to each other ("cascade") and therefore the peers also need a way to tell an SFU to which other SFUs to connect.

Proposal

TODO: spell out how this works with active speaker detection & associated signalling

Diagrams

The diagrams of how this all looks can be found in MSC3401.

Additions to the `m.call.member` state event

This MSC proposes adding two optional fields to the m.call.member state event: m.foci.preferred and m.foci.active.

Informational: This attempts to avoid the situation where a conference is ongoing with several users in, for example, New York. These users are all connected to the focus in New York. Alice joins from London: rather than connecting to the focus in London, she connects directly to the one in New York since that's where all the other participants are connected. If more users then join from London, however, they will all make the same decision and connect to the New York focus rather than the optimal configuration of the London users connected to the London focus. With active and preferred foci, the second user that joins from London will know that although Alice's active focus is New York, her preferred is London, and can therefore choose the London focus instead.

For instance:

{
    "type": "m.call.member",
    "state_key": "@matthew:matrix.org",
    "content": {
        "m.calls": [
            {
                "m.call_id": "cvsiu2893",
                "m.devices": [{
                    "device_id": "U738KDF9WJ",
                    "m.foci.active": [
                        { "user_id": "@sfu-lon:matrix.org", "device_id": "FS5F589EF" }
                    ],
                    "m.foci.preferred": [
                        { "user_id": "@sfu-bon:matrix.org", "device_id": "3FSF589EF" },
                        { "user_id": "@sfu-mon:matrix.org", "device_id": "GFSDH93EF" },
                    ]
                }]
            }
        ],
        "m.expires_ts":  1654616071686
    }
}

`m.foci.active`

This field is a list of foci the user's device is publishing to. Usually, this list will have a length of 1, yet a client might publish to multiple foci if they are on different networks, for instance, or to simultaneously fan-out in different directions from the client if there is no nearby focus. If the client is participating full-mesh, it should either omit this field from the state event or leave the list empty.

`m.foci.preferred`

This field is a list of foci the client would prefer to switch to from the current active focus, if any other client also starts using the given focus. If the client is already using one of its preferred foci, it should either omit this field from the state event or leave the list empty.

Choosing a focus

Discovering foci

TODO: How does a client discover foci? We could use well-known or a custom endpoint

Foci are identified by a tuple of user_id and device_id.

Determining the best focus

There are many ways to determine the best focus; this MSC recommends the following:

Is the quickest to respond to m.call.invite with m.call.answer.
Is the quickest to rapidly reject a spurious HTTPS request to a high-numbered port on the SFU's IP address, if the SFU exposes its IP somewhere - similar to the apenwarr/blip trick, in order to measure media-path latency rather than signalling path latency.
Has the best latency of data-channel traffic flows.
Has the best latency and bandwidth determined by sending a small splurge of media down the pipe to probe.

Joining a call

The following diagram explains how a client chooses a focus when joining a call.

flowchart TD;
wantsToJoin[Wants to join a call];
hasPreferred(Has preferred focus?);
callPreferred[Calls preferred foci without media to grab a slot];
publishPreferred[Publishes `m.foci.preferred`];
checkMembers(Call has more than 2 members including the client itself?);
callFullMesh[Calls other member full-mesh];
callMembersFoci[Tries calling foci from `m.call.member` events];
orderFoci[Orders foci from best to worst];
findFocusPreferredByOtherMember(Goes through ordered foci to find one which is preferred by at least one other member);
callBestPreferred[Calls the focus];
callBestActive[Calls the best active focus in room];
publishActive[Publishes `m.foci.active`];

wantsToJoin-->hasPreferred;
hasPreferred--->|Yes|callPreferred;
hasPreferred--->|No|checkMembers;
callPreferred--->publishPreferred;
publishPreferred--->checkMembers;
checkMembers--->|Yes|callMembersFoci;
checkMembers--->|No|callFullMesh;
callMembersFoci--->orderFoci;
orderFoci--->findFocusPreferredByOtherMember;
findFocusPreferredByOtherMember--->|Found|callBestPreferred;
callBestPreferred--->publishActive;
findFocusPreferredByOtherMember--->|Not found|callBestActive;
callBestActive--->publishActive;

Loading

Mid-call changes

Once in a call, the client listens for changes to m.call.member state events and if another member starts using one of the client's preferred foci, the client switches to that focus.

TODO: other cases?

Initial offer/answer dance

During the initial offer/answer dance, the client establishes a data-channel between itself and the SFU to use later for rapid signalling.

Simulcast

RTP munging

vp8 munging

RTCP re-transmission

Data-channel messaging

The client uses the established data channel connection to the SFU to perform low-latency signalling to rapidly (un)subscribe/(un)publish streams, send ping messages, metadata, cascade and perform re-negotiation.

See the section about the rationale behind the use of the data channels for signaling.

TODO: Spell out how the DC traffic interacts with application-layer traffic

SDP Stream Metadata extension

The client will be receiving multiple streams from the SFU and it will need to be able to distinguish them, this therefore builds on MSC3077 and MSC3291 to provide the client with the necessary metadata. Some of the data-channel events include an sdp_stream_metadata field including a description of the stream being sent either from the SFU to the client or from the client to the SFU.

Other than mute information and stream purpose, the metadata includes video track resolution. The SFU may not be able to determine the resolution of the track itself but it does need to know for simulcast; therefore, we include this in the metadata.

{
    "streamId1": {
        "purpose": "m.usermedia",
        "audio_muted": false,
        "video_muted": true,
        "tracks": {
            "trackId1": {
                "width": 1920,
                "height": 1080
            },
            "trackId2": {}
        }
    }
}

Event types

This MSC adds a few new m.call.* events and extends a few of the existing ones.

`m.call.track_subscription`

This event is sent to the focus to let it know about the tracks the client would like to start/stop subscribing to.

Upon receiving this event, a focus should make the subscribe changes based on the start and stop arrays and respond with an m.call.negotiate event.

In the case of video tracks, in the start array the client may also request a specific resolution for a given track; this resolution is a resolution the client wishes to receive but the SFU may send a lower one due to bandwidth etc.

If the user for example switches from "spotlight" (one large tile) to "grid" (multiple small tiles) view, it should also send this event with the updated resolution in the start array to let the focus know of the resolution change.

Clients may request each track only once: foci should ignore multiple requests of the same track.

TODO: how do we prove to the focus that we have the right to subscribe to track?

{
    "type": "m.call.track_subscription",
    "content": {
        "subscribe": [
            {
                "stream_id": "streamId1",
                "track_id": "trackId1",
                "width": 1920,
                "height": 1080
            },
            {
                "stream_id": "streamId2",
                "track_id": "trackId2",
                "width": 256,
                "height": 144
            }
        ],
        "unsubscribe": [
            {
                "stream_id": "streamId3",
                "track_id": "trackId4"
            },
            {
                "stream_id": "streamId4",
                "track_id": "trackId4"
            }
        ]
    }
}

`m.call.negotiate`

This event works exactly like the m.call.negotiate event in 1:1 calls.

{
    "type": "m.call.negotiate",
    "content": {
        "description": {
            "type": "offer",
            "sdp": "..."
        },
        "sdp_stream_metadata": {...} // As specified in the Metadata section
    }
}

`m.call.sdp_stream_metadata_changed`

This event works very similarly to the 1:1 call m.call.sdp_stream_metadata_changed.

TODO: Spec how foci actually use this to advertise tracks

{
    "type": "m.call.sdp_stream_metadata_changed",
    "content": {
        "sdp_stream_metadata": {...} // As specified in the Metadata section
    }
}

`m.call.ping`, `m.call.pong`

A ping message must be sent by the focus to the client at an interval no greater than 30 seconds. On receiving a ping message, a client must respond immediately with a pong message. A client may therefore detect that the connection has failed after an amount of time of its choosing (greater than 30 seconds) has elapsed since it last saw a ping message. A server may deem a client unresponsive after not receiving a pong some amount of time after it has sent a ping, again the amount of time the server waits is up to the implementation. Either send should hang up once deeming the other side unresponsive.

focus -> client:

{
    "type": "m.call.ping",
    "content": {}
}

client -> focus:

{
    "type": "m.call.pong",
    "content": {}
}

`m.call.connect_to_focus`

If a user is using their focus in a call, it will need to know how to connect to other foci present in order to participate in the full-mesh of SFU traffic (if any). The client is responsible for doing this using the m.call.connect_to_focus event.

{
    "type": "m.call.connect_to_focus",
    "content": {
        // TODO: How should this look?
    }
}

Notes

Hiding behind foci

We do not recommend that users utilise a focus to hide behind for privacy, but instead use a TURN server, only providing relay candidates, rather than consuming focus resources and unnecessarily mandating the presence of a focus.

Potential issues

The SFUs participating in a conference end up in a full mesh. Rather than inventing our own spanning-tree system for SFUs however, we should fix it for Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or similar to decide what better-than-full-mesh topology to use. In practice, full mesh cascade between SFUs is probably not that bad (especially if SFUs only request the streams over the trunk their clients care about) - and on aggregate will be less obnoxious than all the clients hitting a single SFU.

Too many foci will chew bandwidth due to full-mesh between them. In the worst case, if every use is on their own HS and picks a different foci, it degenerates to a full-mesh call (just server-side rather than client-side). Hopefully this shouldn't happen as you will converge on using a single SFU with the most clients, but need to check how this works in practice.

SFrame mandates its own ratchet currently which is almost the same as megolm but not quite. Switching it out for megolm seems reasonable right now (at least until MLS comes along)

Alternatives

An option would be to treat 1:1 (and full mesh) entirely differently to SFU based calling rather than trying to unify them. Also, it's debatable whether supporting full mesh is useful at all. In the end, it feels like unifying 1:1 and SFU calling is for the best though, as it then gives you the ability to trivially upgrade 1:1 calls to group calls and vice versa, and avoids maintaining two separate hunks of spec. It also forces 1:1 calls to take multi-stream calls seriously, which is useful for more exotic capture devices (stereo cameras; 3D cameras; surround sound; audio fields etc).

The use of the data channels for signaling

The current specification assumes that signaling works over Matrix, but side-chains to the data channel once the peer connection is established in order to perform low-latency signaling.

In an ideal scenario the use of the data channels would not be required and the usage of native Matrix signaling would be sufficient, however due to the fact that regular Matrix signaling may need to traverse different servers, e.g. client <-> home server <-> home server <-> sfu, our signaling would not be quite as fast as we need it to be. The effect will be even greater when coupled with the fact that certain protocols like HTTP would not be as efficient for a real-time communication as e.g. WebRTC data channels or WebSockets.

The problem would be solved if the clients could connect to the SFU directly and communicate via Matrix for all signaling messages. This would allow us to use a faster transport (WebSockets, QUIC etc) to transmit signaling messages. However, this is currently not possible due to the fact that it would require the support of the P2P Matrix that is still being under development at the time of writing this MSC.

To read more about the problem and get more context, please refer to the discussion.

Cascading

One option here is for SFUs to act as an AS and sniff the m.call.member traffic of their associated server, and automatically call any other m.foci which appear. (They don't need to make outbound calls to clients, as clients always dial in).

Security considerations

Malicious users could try to DoS SFUs by specifying them as their foci.

SFrame E2EE may go horribly wrong if we can't send the new megolm session fast enough to all the participants when a participant leave (and meanwhile if we keep using the old session, we're technically leaking call media to the parted participant until we manage to rotate).

Need to ensure there's no scope for media forwarding loops through SFUs.

In order to authenticate that only legitimate users are allowed to subscribe to a given conf_id on an SFU, it would make sense for the SFU to act as an AS and sniff the m.call events on their associated server, and only act on to-device m.call.* events which come from a user who is confirmed to be in the room for that m.call. (In practice, if the conf is E2EE then it's of limited use to connect to the SFU without having the keys to decrypt the traffic, but this feature is desirable for non-E2EE confs and to stop bandwidth DoS)

Unstable prefixes

We probably don't care for this for the data-channel?

While this MSC is not considered stable, implementations should use org.matrix.msc3898 as a namespace.

Stable (post-FCP)	Unstable
`m.foci.active`	`org.matrix.msc3898.foci.active`
`m.foci.preferred`	`org.matrix.msc3898.foci.preferred`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3898-sfu.md

3898-sfu.md

MSC3898: Native Matrix VoIP signalling for cascaded SFUs

Proposal

Diagrams

Additions to the `m.call.member` state event

`m.foci.active`

`m.foci.preferred`

Choosing a focus

Discovering foci

Determining the best focus

Joining a call

Mid-call changes

Initial offer/answer dance

Simulcast

RTP munging

vp8 munging

RTCP re-transmission

Data-channel messaging

SDP Stream Metadata extension

Event types

`m.call.track_subscription`

`m.call.negotiate`

`m.call.sdp_stream_metadata_changed`

`m.call.ping`, `m.call.pong`

`m.call.connect_to_focus`

Notes

Hiding behind foci

Potential issues

Alternatives

The use of the data channels for signaling

Cascading

Security considerations

Unstable prefixes

Files

3898-sfu.md

Latest commit

History

3898-sfu.md

File metadata and controls

MSC3898: Native Matrix VoIP signalling for cascaded SFUs

Proposal

Diagrams

Additions to the m.call.member state event

m.foci.active

m.foci.preferred

Choosing a focus

Discovering foci

Determining the best focus

Joining a call

Mid-call changes

Initial offer/answer dance

Simulcast

RTP munging

vp8 munging

RTCP re-transmission

Data-channel messaging

SDP Stream Metadata extension

Event types

m.call.track_subscription

m.call.negotiate

m.call.sdp_stream_metadata_changed

m.call.ping, m.call.pong

m.call.connect_to_focus

Notes

Hiding behind foci

Potential issues

Alternatives

The use of the data channels for signaling

Cascading

Security considerations

Unstable prefixes

Additions to the `m.call.member` state event

`m.foci.active`

`m.foci.preferred`

`m.call.track_subscription`

`m.call.negotiate`

`m.call.sdp_stream_metadata_changed`

`m.call.ping`, `m.call.pong`

`m.call.connect_to_focus`