Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] MSC3898: Native Matrix VoIP signalling for cascaded foci (SFUs, MCUs...) #3898

Draft
wants to merge 30 commits into
base: main
Choose a base branch
from
Draft
Changes from 4 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
750087f
Native Matrix VoIP signalling for cascaded SFUs
SimonBrandner Sep 25, 2022
aa53398
Update MSC number
SimonBrandner Sep 25, 2022
de302cb
Link to diagrams from MSC3401
SimonBrandner Oct 2, 2022
7474782
Use correct number for file
SimonBrandner Oct 2, 2022
5cad46d
Update sub and unsub ops
SimonBrandner Nov 11, 2022
2cbc2d6
Merge remote-tracking branch 'upstream/main' into SimonBrandner/msc/sfu
SimonBrandner Nov 11, 2022
f542fcb
Give a reason for specifying res in metadata
SimonBrandner Nov 11, 2022
6f01a94
Specify foci by `device_id` too
SimonBrandner Nov 12, 2022
575e16c
Fixup some json
SimonBrandner Nov 12, 2022
33b1880
Typo
SimonBrandner Nov 12, 2022
65faee4
Specify how to handle foci better
SimonBrandner Nov 13, 2022
9882c97
Amend TODOs
SimonBrandner Nov 13, 2022
c66bbe4
Add rationale behind usage of data channels
daniel-abramov Nov 15, 2022
1b2d740
Add TODO
SimonBrandner Dec 2, 2022
feb064b
Update event types
SimonBrandner Dec 2, 2022
d96d101
Add unstable prefixes
SimonBrandner Dec 2, 2022
d538e1e
Use `subscribe` instead of `select`
SimonBrandner Dec 6, 2022
91470a2
`op` -> `event`
SimonBrandner Dec 6, 2022
2ef7425
Fixup formatting
SimonBrandner Dec 6, 2022
5a186e4
Use `content`
SimonBrandner Dec 6, 2022
b461525
Namespace things
SimonBrandner Dec 6, 2022
e49e80d
Further namespacing
SimonBrandner Dec 6, 2022
6b3fd47
Update the events to match current Matrix
SimonBrandner Dec 6, 2022
bf52e02
Fix typo
SimonBrandner Dec 7, 2022
f81dd9d
Use `subscribe`/`unsbuscribe`
SimonBrandner Dec 7, 2022
9c32b96
Add informational section on active/preferred foci.
dbkr Dec 8, 2022
6f8c9d1
Change keepalives to ping/pong
dbkr Dec 8, 2022
ecf2425
Add empty line
SimonBrandner Dec 8, 2022
bf04b17
Fix event name
SimonBrandner Dec 9, 2022
1896fc7
Remove encryption section as it's glossing over details
SimonBrandner Dec 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
314 changes: 314 additions & 0 deletions proposals/3898-sfu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,314 @@
# MSC3898: Native Matrix VoIP signalling for cascaded SFUs

[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401)
specifies how full-mesh group calls work in Matrix. While that MSC works well
for small group calls, it does not work so well for large conferences due to
bandwidth (and other) issues.

Selective Forwarding Units (SFUs) - servers which forwarding WebRTC streams
between peers (which could be clients or SFUs or both). To make use of them
effectively, peers need to be able to tell the SFU which streams they want to
receive at what resolutions.

To solve the issue of centralization, the SFUs are also allowed to connect to
each other ("cascade") and therefore the peers also need a way to tell an SFU to
which other SFUs to connect.

## Proposal

**TODO: spell out how this works with active speaker detection & associated
signalling** **TODO: spell out how the DC traffic interacts with
application-layer traffic** **TODO: how do we prove to the SFU that we have the
right to subscribe to track?**

### Diagrams

The diagrams of how this all looks can be found in
[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401).

### State events

#### `m.call` state event

This MSC proposes adding an _optional_ `m.foci` field to the `m.call` state
event. It as list of recommended SFUs that the call initiator can recommend to
users who do not want to use their own SFU (because they don't have one, or
because they would be the only person on their SFU for their call, and so choose
to connect direct to save bandwidth).

For instance:

```json
{
"type": "m.call",
"state_key": "cvsiu2893",
"content": {
"m.intent": "m.room",
"m.type": "m.voice",
"m.name": "Voice room",
"m.foci": [
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved
"@sfu-lon:matrix.org",
"@sfu-nyc:matrix.org"
]
}
}
```

#### `m.call.member` state event

This MSC proposes adding an _optional_ `m.foci` field to the `m.call.member`
state event. It is used, if the user wants to be contacted via an SFU rather
than called directly (either 1:1 or full mesh).

For instance:

```jsonc
{
"type": "m.call.member",
"state_key": "@matthew:matrix.org",
"content": {
"m.calls": [
{
"m.call_id": "cvsiu2893",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the call_id does not seem to be necessary.

When the SFU sends To-Device messages to the clients, the conf_id is specified and given that the conf_id is a unique identifier of a conference/call, there seem to be no need to have a call_id in addition to that.

Recently I've ran into an issue where I realized that call_id and conf_id are not the same (despite MSC3401 giving me an impression that they are identical). The conf_id was the ID of a conference (as expected), but the call_id was another random string that was different for each single participant which forced us to use both call_id and conf_id when sending messages back to the clients (otherwise they would be rejected).

It looks like call_id should either be removed or (if we want to keep it for the backward compatibility with the older MSC?) it must be equal to the conf_id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment belongs on MSC3401 as this line is specified in other MSC, although I thinik the conclusion is just that there's confusion between call_id and conf_id and we should rename this to conf_id (there's no other conf ID in this event so it is necessary, not just for backwards compat).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've also written a comment about it in MSC3401 😛

Basically, the problem is not only that they are called differently, but also that the value of call_id and conf_id is different, so they are different for some reason (and on the SFU we are obligated to take both into account: conf_id for a conference ID and the call_id to set a value in outgoing To-Device messages without which the client would discard the messages).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you agree that the correct resolution is to change this to m.conf_id?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be great! Though I wonder what the consequence of that would be (i.e. what is that value that the current call_id has? - It's not a conference ID, it's something different, or maybe it's a leftover from a previous implementation for 1:1s where call_id meant something?)

Copy link

@daniel-abramov daniel-abramov Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really?

Yes 🙂 That's something that I discovered a week ago when deploying the first iteration of refactored SFU. I've just tried to join the SFU and the conf_id field is equal to 1668002318158qFQmBWgVHHXTZsPA, while the call_id is 1670443502134bOWVqa3btIfDQMjJ.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conf_id and call_id from where though? There will also be call_id in the individual calls which will definitely be different. Otherwise we need to work out what's going on here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conf_id and call_id from where though?

From To-Device messages that the participants of the conference send to the SFU. We then reply with To-Device messages back (e.g. when we generate an answer), in which case we also set both conf_id or call_id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean https://github.com/matrix-org/matrix-js-sdk/blob/develop/src/webrtc/call.ts#L2252? conf_id is the ID of the conference call (state key of the m.call event), call_id is the ID of the 1:1 call between the individual group call participants.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems like this. But the thing is that, from the SFUs standpoint, the call_id does not have any semantics, but currently we're obligated to store both conf_id and call_id (which have different values), where the call_id is only used in order to send To-Device messages to the clients, i.e. when I e.g. want to send messages from the SFU to the client, I have to set both the conf_id (the ID of a conference) and the call_id (the ID of the 1:1 call between individual group call participants).

So my point is that we probably want to get rid of mandating call_id for the SFU calls since they don't seem any semantic value for this use case. And only use the conf_id instead?

// TODO: Should this be at the device level?
"m.foci": [
"@sfu-lon:matrix.org",
"@sfu-nyc:matrix.org",
],
"m.devices": [...]
}
],
"m.expires_ts": 1654616071686
}
}
```

### Choosing an SFU

**TODO: How does a client discover SFUs** **TODO: Is SFU identified by just
`user_id` or `(user_id, device_id)`?**

* When initiating a group call, we need to decide which devices to actually talk
to.
* If the client has no SFU configured, we try to use the `m.foci` in the
`m.call` event.
* If there are multiple `m.foci`, we select the closest one based on
latency, e.g. by trying to connect to all of them simultaneously and
discarding all but the first call to answer.
* If there are no `m.foci` in the `m.call` event, then we look at which foci
in `m.call.member` that are already in use by existing participants, and
select the most common one. (If the foci is overloaded it can reject us
and we should then try the next most populous one, etc).
* If there are no `m.foci` in the `m.call.member`, then we connect full
mesh.
* If subsequently `m.foci` are introduced into the conference, then we
should transfer the call to them (effectively doing a 1:1->group call
upgrade).
* If the client does have an SFU configured, then we decide whether to use it.
* If other conf participants are already using it, then we use it.
* If there are other users from our homeserver in the conference, then we
use it (as presumably they should be using it too)
* If there are no other `m.foci` (either in the `m.call` or in the
participant state) then we use it.
* Otherwise, we save bandwidth on our SFU by not cascading and instead
behaving as if we had no SFU configured.
* We do not recommend that users utilise an SFU to hide behind for privacy, but
instead use a TURN server, only providing relay candidates, rather than
consuming SFU resources and unnecessarily mandating the presence of an SFU.

### Initial offer/answer dance

During the initial offer/answer dance, the client establishes a data-channel
between itself and the SFU to use later for rapid signalling.
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved

### Simulcast

#### RTP munging

#### vp8 munging

### RTCP re-transmission

### Data-channel messaging

The client uses the established data channel connection to the SFU to perform
low-latency signalling to rapidly (un)subscribe/(un)publish streams, send
keep-alive messages, metadata, cascade and perform re-negotiation.

**TODO: It feels like these ought to be `m.` namespaced** **TODO: Why `op`
instead of `type`?** **TODO: It feels like these ought to have `content` rather
than being on the same layer**

#### SDP Stream Metadata extension

The client will be receiving multiple streams from the SFU and it will need to
be able to distinguish them, this therefore build on
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved
[MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077) and
[MSC3291](https://github.com/matrix-org/matrix-spec-proposals/pull/3291) to
provide the client with the necessary metadata. Some of the data-channel events
include a `metadata` field including a description of the stream being sent
either from the SFU to the client or from the client to the SFU.

```json5
{
"streamId1": {
"purpose": "m.usermedia",
"audio_muted": false,
"video_muted": true,
"tracks": {
"trackId1": {
"width": 1920,
"height": 1080
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved
},
"trackId2": {}
}
}
}
```

#### Event types

##### Subscribe

```json5
{
"op": "subscribe",
"streamId": "streamId1",
"trackId1": "trackId1",
"width": 1920,
"height": 1080
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved
}
```

##### Unsubscribe

```json5
{
"op": "unsubscribe",
"streamId": "streamId1",
"trackId1": "trackId1"
}
```

##### Publish

##### Unpublish

##### Offer

##### Answer

##### Metadata

```json5
{
"op": "metadata",
"metadata": {...} // As specified in the Metadata section
}
```

##### Keep-alive

```json5
{
"op": "alive"
}
```

##### Connect

If a user is using their SFU in a call, it will need to know how to connect to
other SFUs present in order to participate in the full-mesh of SFU traffic (if
any). The client is responsible for doing this using the `connect` op.
Copy link

@daniel-abramov daniel-abramov Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to specify the cascading specific logic in this MSC or would it be better to make a separate MSC for cascading?

Rationale: if we have a dedicated MSC for the SFU, we'll be able to finalize and merge it faster to master. Iterating with small MSCs might be a better idea given the amount of time it normally takes until the MSC is merged? (just a gut feeling)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the event fields used in a single focus case are quite different from the cascading case. I wonder if there is a way to avoid that issue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not fully get what you mean.

I think the reason why I initially commented is that it seems like we're not going to have the cascading implemented in the very nearest future (currently we don't really support it), so I thought maybe it would be faster to limit this MSC to the SFU and then create a cascading MSC after that (once we have a stable SFU). I was just afraid that otherwise the MSC would stay open (or in a draft state) for too long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that but I am not sure how to technically handle this - the MSC currently specifies an SFU selection algorithm and the fields it uses, if we wanted to split the MSC into two, we would need to completely different ways to specify the SFU, I think...


```json5
{
op: "connect"
// TODO: How should this look?
}
```

### Encryption

When SFUs are on the media path, they will necessarily terminate the SRTP
dbkr marked this conversation as resolved.
Show resolved Hide resolved
traffic from the peer, breaking E2EE. To address this, we apply an additional
end-to-end layer of encryption to the media using [WebRTC Encoded
Transform](https://github.com/w3c/webrtc-encoded-transform/blob/main/explainer.md)
(formerly Insertable Streams) via
[SFrame](https://datatracker.ietf.org/doc/draft-omara-sframe/).

In order to provide PFS, The symmetric key used for these streams from a given
participating device is a megolm key. Unlike a normal megolm key, this is shared
via `m.room_key` over Olm to the devices participating in the conference
including an `m.call_id` and `m.room_id` field on the key to correlate it to the
conference traffic, rather than using the `session_id` event field to correlate
(given the encrypted traffic is SRTP rather than events, and we don't want to
have to send fake events from all senders every time the megolm session is
replaced).

The megolm key is ratcheted forward for every SFrame, and shared with new
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved
participants at the current index via `m.room_key` over Olm as per above. When
participants leave, a new megolm session is created and shared with all
participants over Olm. The new session is only used once all participants have
received it.

## Potential issues

The SFUs participating in a conference end up in a full mesh. Rather than
inventing our own spanning-tree system for SFUs however, we should fix it for
Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or
similar to decide what better-than-full-mesh topology to use. In practice, full
mesh cascade between SFUs is probably not that bad (especially if SFUs only
request the streams over the trunk their clients care about) - and on aggregate
will be less obnoxious than all the clients hitting a single SFU.

Too many foci will chew bandwidth due to full-mesh between them. In the worst
case, if every use is on their own HS and picks a different foci, it degenerates
to a full-mesh call (just server-side rather than client-side). Hopefully this
shouldn't happen as you will converge on using a single SFU with the most
clients, but need to check how this works in practice.

SFrame mandates its own ratchet currently which is almost the same as megolm but
not quite. Switching it out for megolm seems reasonable right now (at least
until MLS comes along)

## Alternatives

An option would be to treat 1:1 (and full mesh) entirely differently to SFU
based calling rather than trying to unify them. Also, it's debatable whether
supporting full mesh is useful at all. In the end, it feels like unifying 1:1
and SFU calling is for the best though, as it then gives you the ability to
trivially upgrade 1:1 calls to group calls and vice versa, and avoids
maintaining two separate hunks of spec. It also forces 1:1 calls to take
multi-stream calls seriously, which is useful for more exotic capture devices
(stereo cameras; 3D cameras; surround sound; audio fields etc).

### Cascading

One option here is for SFUs to act as an AS and sniff the `m.call.member`
traffic of their associated server, and automatically call any other `m.foci`
which appear. (They don't need to make outbound calls to clients, as clients
always dial in).

## Security considerations

Malicious users could try to DoS SFUs by specifying them as their foci.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are SFUs not (by default, with an option to the admin/operator to open it up) authenticated using one's matrix account? Shouldn't they be?
The cascaded decentralized SFU concept appears to be that there is one focus associated with each homeserver. Hence I would expect that I can ever only access my hs's SFU(s).

(by @HarHarLinks from #3401 (comment))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I learn more about this topic, foci seem to not be authenticated.
As a server admin, I would like if not anyone could use the focus I host. It would appear logical to allow only user of one or more associated homeservers and at most also temporarily their remote call members if the algorithm deems the focus favourable.


SFrame E2EE may go horribly wrong if we can't send the new megolm session fast
enough to all the participants when a participant leave (and meanwhile if we
keep using the old session, we're technically leaking call media to the parted
participant until we manage to rotate).

Need to ensure there's no scope for media forwarding loops through SFUs.

In order to authenticate that only legitimate users are allowed to subscribe to
a given `conf_id` on an SFU, it would make sense for the SFU to act as an AS and
sniff the `m.call` events on their associated server, and only act on to-device
`m.call.*` events which come from a user who is confirmed to be in the room for
that `m.call`. (In practice, if the conf is E2EE then it's of limited use to
connect to the SFU without having the keys to decrypt the traffic, but this
feature is desirable for non-E2EE confs and to stop bandwidth DoS)

## Unstable prefixes

We probably don't care for this for the data-channel?