From 00f8470a804cb3d2dbc3dedbf6c54703ff067547 Mon Sep 17 00:00:00 2001 From: Richard van der Hoff Date: Wed, 29 Aug 2018 15:43:12 +0100 Subject: [PATCH 1/7] WIP proposal for better grammars --- proposals/1597-id-grammar.md | 265 +++++++++++++++++++++++++++++++++++ 1 file changed, 265 insertions(+) create mode 100644 proposals/1597-id-grammar.md diff --git a/proposals/1597-id-grammar.md b/proposals/1597-id-grammar.md new file mode 100644 index 00000000000..bf2153721a2 --- /dev/null +++ b/proposals/1597-id-grammar.md @@ -0,0 +1,265 @@ +# Grammars for identifiers in the Matrix protocol + +## Background + +Matrix uses client- or server-generated identifiers in a number of +places. Historically the grammars for these have been underspecified, which +leads to confusion about what is or is not a valid identifier with the +possibility of incompatability between implementations. + +This proposal presents tightly-specified grammars for a number of +identifiers. + +## Common Identifiers + +[Spec](https://matrix.org/docs/spec/appendices.html#common-identifier-format) + +Proposal: + +> `localpart` may not include `:`. When parsing a Common Identifier, it should +> be split at the *leftmost* `:`. + +Rationale: server names may contain multiple `:`s (think IPv6 literals), so the +first colon is the only sane place to split them. This is a Known Thing, but I +don't think we spell it out anywhere in the spec. + +## User IDs + +User IDs are +[well-specified](https://matrix.org/docs/spec/appendices.html#user-identifiers), +however we should consider dropping `/` from the list of allowed characters, as +it interferes with URI escaping (XXX: how, specifically?) + +History: `/` was introduced to act as a heirarchical namespacing character "as +bridged IDs for things like gitter were looking incredibly ugly and using `.` +or `_` for namespacing rather than a more natural `/` feels ugly too". + +Proposal: + +> Remove `/` from the list of allowed characters in User IDs. + +`/` will of course be maintained under the grammar of "historical user +IDs". Sorting out that mess is a longer-term project. + +## Room IDs and Event IDs + +[Issue](https://github.com/matrix-org/matrix-doc/issues/667) +[Spec](https://matrix.org/docs/spec/appendices.html#room-ids-and-event-ids) + +These currently have similar formats, though it is likely that event ids will +be replaced with something else due to +[#1127](https://github.com/matrix-org/matrix-doc/issues/1127). + +Currently they are both specified as ``?opaque_id:domain``, without clues as to +what the opaque_id should be. + +Synapse uses: `[A-Za-z]{18}`. +[Dendrite](https://github.com/matrix-org/dendrite/blob/b71d922/src/github.com/matrix-org/dendrite/clientapi/routing/createroom.go#L125) +uses (I think) `[A-Za-z0-9]{16}` via +[json.go](https://github.com/matrix-org/util/blob/master/json.go#L185). However, +some server implementations/forks are known to generate event IDs (and possibly +room IDs) using a wide alphabet, which means that there exist rooms that +include unusual event IDs. + +Proposal: + +> The opaque_id part must not be empty, and must consist entirely of the +> characters `[0-9a-zA-Z.=_-]`. +> +> The total length (including sigil and domain) must not exceed 255 characters. +> +> This is only enforced for v2 rooms - servers and clients wishing to support +> v1 rooms should be more tolerant. + + +## Room Aliases + +[Issue](https://github.com/matrix-org/matrix-doc/issues/668) +[Spec](https://matrix.org/docs/spec/appendices.html#room-aliases) + +We probably want to inherit some rules from IDNs and consider unicode normalisation. + +XXX: UTF-8, presumably? +XXX: the maxlen is 255 *bytes*?? + +## Key IDs (for federation and e2e) + +These are always of the form `:`. + +Valid algorithms are defined at +https://matrix.org/docs/spec/client_server/unstable.html#key-algorithms, though +we should define the alphabet for future algorithms. + +Proposal: + +> Future algorithm identifiers will be assigned from the alphabet `[a-z0-9_.]` +> and will be at most 31 characters in length. + +For federation keys, +[Synapse](https://github.com/matrix-org/synapse/blob/74854a97191191b08101821753c2672efc2a65fd/synapse/config/key.py#L159) +generates key ids as `ed25519:a_[A-Za-z]{4}`, though an HS admin can configure +them manually to be anything without whitespace. + +Key IDs end up in an Authorization header which looks like `X-Matrix +origin=origin.example.com,key="keyId",sig="ABCDEF..."`. The Synapse +implementation splits on `,` and `=` without regard to quoting so this +currently precludes the use of `,` or `=` in a key ID. + +For e2e, device keys have a `tok` corresponding to the device id, whilst +one-time keys are generated by libolm, which uses a base64-encoded 32-bit int, ie +`[A-Za-z0-9+/]{6}`. + +A key ID needs to be unique over the lifetime of the server (for federation) or +the device (for e2e). However, they are used fairly widely, so making them long +is unattractive as they could significantly increase the amount of data being +transmitted. Let's limit the 'tok' part of the key to 31 characters too. + +Proposal: + +> Key IDs use the following BNF grammar: +> +> ``` +> key_id = algorithm ":" tok +> +> algorithm = 1*31 alg_chars +> +> tok = 1*31 tok_chars +> +> alg_chars = %x61-7a / %30-39 / "_" / "." +> ; a-z 0-9 _ . +> +> tok_chars = ALPHA / DIGIT / "." / "=" / "_" / "-" +> ; A-Z a-z 0-9 . = _ - +> ``` +> + +Note that enforcing this grammar will mean: + +* Making sure that synapse handles "=" characters in key IDs (easy). + +* Making libolm not put + and / characters in key IDs (easy enough, but there + will be a bunch of malformed unique keys out there in the wild. Possibly they + would just get thrown away. Servers may need to continue to tolerate `+` and + `/` in e2e keys for a while.) + +## Opaque IDs + +[Issue](https://github.com/matrix-org/matrix-doc/issues/666) + +This is a class of identifier types where nobody is really meant to parse any +part of the ID - they are just unique identifiers (with varying scopes of +uniqueness). See below for discussion on what is currently in use. + +I propose to specify the almost the same grammar for all of these, for +simplicity and consistency. + +Proposal: + +> Opaque IDs must be strings consisting entirely of the characters +> `[0-9a-zA-Z.=_-]`. Their length must not exceed 255 characters and they must +> not be empty. + +For almost all of the current implementations I have looked at (listed below), +the grammar above is a superset of the generated identifiers, and a subset of +the understood identifiers. There should therefore be no +backwards-compatibility problems with its introduction. + +The exception is transaction IDs generated by some clients. I think that we'll +just have to fix those clients and accept that old versions may not work with +future servers. + +### Call IDs + +[Spec](https://matrix.org/docs/spec/client_server/unstable.html#m-call-invite) + +These are only used within the body of `m.call.*` events, as far as I am +aware. However, they are treated as globally-unique identifiers. + +[matrix-js-sdk](https://github.com/matrix-org/matrix-js-sdk/blob/4d310cd4618db4e98a8e6b5eb812480102ee4dee/src/webrtc/call.js#L72) uses `c[0-9.]{32}`. +[matrix-android-sdk](https://github.com/matrix-org/matrix-android-sdk/blob/5c6f785e53632e7b6fb3f3859a90c3d85b040e7f/matrix-sdk/src/main/java/org/matrix/androidsdk/call/MXWebRtcCall.java#L221) uses `c[0-9]{13}`. + +Additional proposal: + +> Call IDs should be long enough to make clashes unlikely. + +### Media IDs + +[Spec](https://matrix.org/docs/spec/client_server/r0.3.0.html#id67) + +These are generated by the server on upload, and then embedded in `mxc://` URIs +and used in the C-S API and the S-S API. + +They must be URI-safe to be sensibly embedded in `mxc://` URIs. + +[Synapse](https://github.com/matrix-org/synapse/blob/74854a97191191b08101821753c2672efc2a65fd/synapse/rest/media/v1/media_repository.py#L153) +uses `[A-Za-z]{24}`. + +### Filter IDs + +[Spec](https://matrix.org/docs/spec/client_server/unstable.html#post-matrix-client-r0-user-userid-filter) + +These are generated by the server and then used in the CS API. They are only +required to be unique for a given user. `{` is already forbidden by the spec. + +[Synapse](https://github.com/matrix-org/synapse/blob/74854a97191191b08101821753c2672efc2a65fd/synapse/storage/filtering.py#L70-L73) +uses a stringified int. + +### Auth Session IDs + +[Spec](https://matrix.org/docs/spec/client_server/r0.3.0.html#user-interactive-authentication-api) + +These are generated by the server during auth, and then used in the CS +API. However, they need to be unique for a given server. + +[Synapse](https://github.com/matrix-org/synapse/blob/74854a97191191b08101821753c2672efc2a65fd/synapse/handlers/auth.py#L494) uses `[A-Za-z]{24}`. + +### Transaction IDs (for federation) + +[Spec](https://matrix.org/docs/spec/server_server/unstable.html#put-matrix-federation-v1-send-txnid) + +Generated by sending server. Needs to be unique for a given pair of servers. + +[Synapse](https://github.com/matrix-org/synapse/blob/74854a97191191b08101821753c2672efc2a65fd/synapse/federation/transaction_queue.py#L593) uses a stringified int and accepts pretty much anything. + +### Transaction IDs (for C-S API) + +[Spec](https://matrix.org/docs/spec/client_server/unstable.html#put-matrix-client-r0-rooms-roomid-send-eventtype-txnid) + +These are generated by the client. They only need to be unique within the +context of a single access_token/device. + +Synapse doesn't appear to do any sanity-checking here currently. + +[matrix-js-sdk](https://github.com/matrix-org/matrix-js-sdk/blob/c6b500bc09994ab5924ef8aab9bd10fc7ded5dae/src/base-apis.js#L123) +uses `m[0-9]{13}.[0-9]{1,}`. +[matrix-android-sdk](https://github.com/matrix-org/matrix-android-sdk/blob/088414fb187cae341690c3a01493b87d97f0169f/matrix-sdk/src/main/java/org/matrix/androidsdk/rest/model/Event.java#L503) +uses a room ID plus a timestamp, hence kinda could be anything, but certainly +will include a `!`. + +### Device IDs + +[Spec](https://matrix.org/docs/spec/client_server/unstable.html#relationship-between-access-tokens-and-devices) + +These are normally generated by the server on login. It's possible for clients +to present their own device_ids, but we're not aware of this feature being +widely used. + +They are used between users and across federation for E2E and to-device +messages. They need to be unique for a particular user. They also appear in key +IDs and must therefore be a subset of that grammar. + +[Synapse](https://github.com/matrix-org/synapse/blob/74854a97191191b08101821753c2672efc2a65fd/synapse/handlers/device.py#L89) +generates device IDs with `[A-Z]{10}`. It appears to do little sanity-checking +of client-generated device IDs currently. + +Additional proposal: + +> Device IDs must not exceed 31 characters in length. + +### Message IDs + +These are used in the server-server API for +[Send-to-device messaging](https://matrix.org/docs/spec/server_server/unstable.html#send-to-device-messaging). + +Synapse uses `[A-Za-z]{16}`, and accepts anything that fits in a postgres TEXT +field. Ref: [devicemessage.py](https://github.com/matrix-org/synapse/blob/74854a97191191b08101821753c2672efc2a65fd/synapse/handlers/devicemessage.py#L102). From 180ba147002c570c1287ec4d5e4b3db0d11a2202 Mon Sep 17 00:00:00 2001 From: Richard van der Hoff Date: Wed, 29 Aug 2018 23:07:34 +0100 Subject: [PATCH 2/7] descope aliases --- proposals/1597-id-grammar.md | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/proposals/1597-id-grammar.md b/proposals/1597-id-grammar.md index bf2153721a2..93248c3d313 100644 --- a/proposals/1597-id-grammar.md +++ b/proposals/1597-id-grammar.md @@ -72,16 +72,6 @@ Proposal: > v1 rooms should be more tolerant. -## Room Aliases - -[Issue](https://github.com/matrix-org/matrix-doc/issues/668) -[Spec](https://matrix.org/docs/spec/appendices.html#room-aliases) - -We probably want to inherit some rules from IDNs and consider unicode normalisation. - -XXX: UTF-8, presumably? -XXX: the maxlen is 255 *bytes*?? - ## Key IDs (for federation and e2e) These are always of the form `:`. @@ -263,3 +253,9 @@ These are used in the server-server API for Synapse uses `[A-Za-z]{16}`, and accepts anything that fits in a postgres TEXT field. Ref: [devicemessage.py](https://github.com/matrix-org/synapse/blob/74854a97191191b08101821753c2672efc2a65fd/synapse/handlers/devicemessage.py#L102). + + +## Room Aliases + +These are a complex topic and are discussed in [MSC +1608](https://github.com/matrix-org/matrix-doc/issues/1608). From 900be84bb646de9988973ffa9c37c1fc1ea30d49 Mon Sep 17 00:00:00 2001 From: Richard van der Hoff Date: Thu, 30 Aug 2018 00:11:46 +0100 Subject: [PATCH 3/7] address review feedback --- proposals/1597-id-grammar.md | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/proposals/1597-id-grammar.md b/proposals/1597-id-grammar.md index 93248c3d313..85a0ba56962 100644 --- a/proposals/1597-id-grammar.md +++ b/proposals/1597-id-grammar.md @@ -27,12 +27,17 @@ don't think we spell it out anywhere in the spec. User IDs are [well-specified](https://matrix.org/docs/spec/appendices.html#user-identifiers), -however we should consider dropping `/` from the list of allowed characters, as -it interferes with URI escaping (XXX: how, specifically?) - -History: `/` was introduced to act as a heirarchical namespacing character "as -bridged IDs for things like gitter were looking incredibly ugly and using `.` -or `_` for namespacing rather than a more natural `/` feels ugly too". +however we should consider dropping `/` from the list of allowed characters, +because HTTP proxies might rewrite +`/_matrix/client/r0/profile/@foo%25bar:matrix.org/displayname` to +`/_matrix/client/r0/profile/@foo/bar:matrix.org/displayname`, messing things +up. + +History: `/` was introduced with the intention of acting as a hierarchical +namespacing character, particularly with consideration to the gitter protocol +which uses it as a hierarchical separator. However, this was not as effective +as hoped because `@foo/bar:example.com` looks like the ID is partitioned into +`@foo` and `bar:example.com`. Proposal: @@ -163,7 +168,9 @@ future servers. [Spec](https://matrix.org/docs/spec/client_server/unstable.html#m-call-invite) These are only used within the body of `m.call.*` events, as far as I am -aware. However, they are treated as globally-unique identifiers. +aware. They should be unique within the lifetime of a room. (Some +implementations currently treat them as globally unique, but that is considered +an implementation bug.) [matrix-js-sdk](https://github.com/matrix-org/matrix-js-sdk/blob/4d310cd4618db4e98a8e6b5eb812480102ee4dee/src/webrtc/call.js#L72) uses `c[0-9.]{32}`. [matrix-android-sdk](https://github.com/matrix-org/matrix-android-sdk/blob/5c6f785e53632e7b6fb3f3859a90c3d85b040e7f/matrix-sdk/src/main/java/org/matrix/androidsdk/call/MXWebRtcCall.java#L221) uses `c[0-9]{13}`. From 3dc630263ea43e4046b9baf880bb8eceda1c4cc0 Mon Sep 17 00:00:00 2001 From: Richard van der Hoff Date: Thu, 30 Aug 2018 00:30:23 +0100 Subject: [PATCH 4/7] synapse uses a different format of media IDs for URL previews --- proposals/1597-id-grammar.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/proposals/1597-id-grammar.md b/proposals/1597-id-grammar.md index 85a0ba56962..0a42192b146 100644 --- a/proposals/1597-id-grammar.md +++ b/proposals/1597-id-grammar.md @@ -189,7 +189,8 @@ and used in the C-S API and the S-S API. They must be URI-safe to be sensibly embedded in `mxc://` URIs. [Synapse](https://github.com/matrix-org/synapse/blob/74854a97191191b08101821753c2672efc2a65fd/synapse/rest/media/v1/media_repository.py#L153) -uses `[A-Za-z]{24}`. +uses `[A-Za-z]{24}`, though it also uses `[0-9A-Za-z_-]{27}` for +[URL previews](https://github.com/matrix-org/synapse/blob/74854a97191191b08101821753c2672efc2a65fd/synapse/rest/media/v1/preview_url_resource.py#L285). ### Filter IDs From cff427562fe63606b1b1b53d648c7ea38a0b08ca Mon Sep 17 00:00:00 2001 From: Richard van der Hoff Date: Thu, 30 Aug 2018 00:32:58 +0100 Subject: [PATCH 5/7] add a note on what matrix-media-repo does --- proposals/1597-id-grammar.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/proposals/1597-id-grammar.md b/proposals/1597-id-grammar.md index 0a42192b146..bee1e71873b 100644 --- a/proposals/1597-id-grammar.md +++ b/proposals/1597-id-grammar.md @@ -190,7 +190,11 @@ They must be URI-safe to be sensibly embedded in `mxc://` URIs. [Synapse](https://github.com/matrix-org/synapse/blob/74854a97191191b08101821753c2672efc2a65fd/synapse/rest/media/v1/media_repository.py#L153) uses `[A-Za-z]{24}`, though it also uses `[0-9A-Za-z_-]{27}` for -[URL previews](https://github.com/matrix-org/synapse/blob/74854a97191191b08101821753c2672efc2a65fd/synapse/rest/media/v1/preview_url_resource.py#L285). +[URL +previews](https://github.com/matrix-org/synapse/blob/74854a97191191b08101821753c2672efc2a65fd/synapse/rest/media/v1/preview_url_resource.py#L285). + +[matrix-media-repo](https://github.com/turt2live/matrix-media-repo/blob/539f25ee75ba6cdbb0410314b29978f4b8b1d7fe/src/github.com/turt2live/matrix-media-repo/controllers/upload_controller/upload_controller.go#L50) +uses `[A-Za-z0-9]{32}`, via [random.go](https://github.com/turt2live/matrix-media-repo/blob/539f25ee75ba6cdbb0410314b29978f4b8b1d7fe/src/github.com/turt2live/matrix-media-repo/util/random.go#L18-L27). ### Filter IDs From 2abfb1fee6e05e206c5809e64c4236bfd6e65543 Mon Sep 17 00:00:00 2001 From: Richard van der Hoff Date: Thu, 30 Aug 2018 11:09:24 +0100 Subject: [PATCH 6/7] ISes use keys too --- proposals/1597-id-grammar.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/1597-id-grammar.md b/proposals/1597-id-grammar.md index bee1e71873b..102f652cb41 100644 --- a/proposals/1597-id-grammar.md +++ b/proposals/1597-id-grammar.md @@ -77,7 +77,7 @@ Proposal: > v1 rooms should be more tolerant. -## Key IDs (for federation and e2e) +## Key IDs (for federation, e2e, and identity servers) These are always of the form `:`. From d6b704ab69c587271543ddd0f03cdd4515c9fca2 Mon Sep 17 00:00:00 2001 From: Richard van der Hoff <1389908+richvdh@users.noreply.github.com> Date: Tue, 13 Apr 2021 16:46:07 +0100 Subject: [PATCH 7/7] Replace `=` with `~` in various grammars ... to match the characters which do not need URL-encoding --- proposals/1597-id-grammar.md | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/proposals/1597-id-grammar.md b/proposals/1597-id-grammar.md index 102f652cb41..80468249260 100644 --- a/proposals/1597-id-grammar.md +++ b/proposals/1597-id-grammar.md @@ -69,7 +69,7 @@ include unusual event IDs. Proposal: > The opaque_id part must not be empty, and must consist entirely of the -> characters `[0-9a-zA-Z.=_-]`. +> characters `[0-9a-zA-Z._~-]`. > > The total length (including sigil and domain) must not exceed 255 characters. > @@ -123,15 +123,13 @@ Proposal: > alg_chars = %x61-7a / %30-39 / "_" / "." > ; a-z 0-9 _ . > -> tok_chars = ALPHA / DIGIT / "." / "=" / "_" / "-" -> ; A-Z a-z 0-9 . = _ - +> tok_chars = ALPHA / DIGIT / "." / "~" / "_" / "-" +> ; A-Z a-z 0-9 . ~ _ - > ``` > Note that enforcing this grammar will mean: -* Making sure that synapse handles "=" characters in key IDs (easy). - * Making libolm not put + and / characters in key IDs (easy enough, but there will be a bunch of malformed unique keys out there in the wild. Possibly they would just get thrown away. Servers may need to continue to tolerate `+` and @@ -151,7 +149,7 @@ simplicity and consistency. Proposal: > Opaque IDs must be strings consisting entirely of the characters -> `[0-9a-zA-Z.=_-]`. Their length must not exceed 255 characters and they must +> `[0-9a-zA-Z._~-]`. Their length must not exceed 255 characters and they must > not be empty. For almost all of the current implementations I have looked at (listed below),